I have a head node and 4 worker nodes for high-performance computing (HPC).
Recently, I had to turn it off for maintenance at our data center. I tried to turn the system back on, but I encountered an error message stating
[ 5.215623][ C14] nvme0: Identify(0x6), Invalid Field in Command (sct 0x0 / sc 0x2) You are in emergency mode. After logging in, type "journalctl -xb" to view system logs, "systemctl reboot" to reboot, "systemctl default" or "exit" to boot into default mode. Give root password for maintenance (or press Control-D to continue): and it seems to be stuck in a loop.
Initially, I selected Ctrl+d as suggested to boot into the default mode, but unfortunately, it just recycles back to the same emergency mode error every time.
A couple of things that might be relevant:
I wasn't aware, but it seems an external USB was left plugged into the system's back when I turned it on after maintenance. I'm not entirely sure if this could be causing the issue, but it's worth mentioning.
Each node requires two power cables plugged into the power adapter. During the reconnection, I realized that one of the power cables for a node was not initially connected to a power source. However, I have fixed this issue, and now all nodes are receiving power as required.
I'm not a Linux expert, so I'm a bit lost as to what could be causing this problem. I've tried searching for solutions online, but nothing seems to be working for me.
If any of you have encountered a similar issue or have expertise with SUSE Linux and HPC systems, I would greatly appreciate any advice or guidance on how to troubleshoot and resolve this "emergency mode" problem.
Give root password for maintenance (or press Control D to continue):? If so, you can just type the root password instead of CTRL-D, and you'll get into the shell. Then you can view and edit/etc/fstab.journalctl -xbprovide any information?