I'm hearing the term Kernel Panic only today. Today morning when I woke up, I found out that one of my servers was offline and another server had a MongoDB crash. Definitely a very bad thing for a Monday morning. I thought I will write my experience and story in the form of an article so that I can revisit that later if I need a reference and of course, it can also be useful for someone having a similar issue on their servers.
The MongoDB crash is something that we are frequently dealing with. But the other server was not even reachable. I was unable to SSH into the server and I did not even understand the options I had on Hetzner's console page.
I found that there was an option to activate KVM on my server for free for 3 hours and troubleshoot my issue. I created a support ticket for Hetzner asking them to activate KVM to troubleshoot. They replied saying that my server had a Kernel panic and it is unable to boot. But they did activate KVM for me. When I logged into the KVM console, I was able to see only this.
This is definitely not very helpful and I was able to see the word Kernel Panic there but not sure what the cause was. Thanks to @rishi556. He gave me a few links to read about it. After reading about it, I was able to understand that Kernel panic is something similar to the BSOD - Blue Screen of Death
we used to have in Windows. I have dealt with BSOD several times especially when I was using Windows XP and now it was familiar and I was able to relate. The reason for BSOD can be anything, it can be something to do with both hardware as well as software. The hardest part is to find out what the actual reason was.
Continuing my story, I still had no clue as to what I should be doing to troubleshoot this. I was just one step close to doing a hardware reset and starting to configure everything from scratch. This is really a nightmare because it would again eat up several hours of my time.
Rescue system for the rescue
In the meantime I was also writing emails continuously to Hetzner complaining about each step and that I had no clue what was happening. They also helped me with prompt and proper replies. They suggested I go to the Rescue system and go through the logs to find out what the issue was. I again had no clue how to do that. I logged into Hetzner's console and activated the rescue system.
The rescue system will not clear the data from our hard disk but will let us log in to Hetzner's rescue system and from there we can take a look at our hard drives. Before you log in to the rescue system, make sure you keep the old password saved. When the rescue system is activated, it will give us a new password to log in. The next step is to reboot the server by hitting the power button on the Reset tab. I was very cautious not to hit the Execute an automatic hardware reset by mistake.
I had to hit that power button twice. The first time when I hit, it the server was shut down and when I hit it again, the server was turned on. It took me a while to understand this.
I was able to see my server getting booted up on the rescue system from the KVM console. After it was ready, I logged into the rescue system via SSH with the new password that was provided. I really had no clue to check what make it reach Kernel Panic. But I was able to see my drives and other details from the rescue system. Just to try my luck, I tried rebooting the server again from the rescue system.
Usually when we are in the rescue system, if we do a reboot, the server will be rebooted from the hard drive the next time. I was able to see this happening on the KVM console. This time it did not end up in Kernel panic again.
I connected to the server with my old password again to see if the connection was successful. All good and I got inside my server once again. I was able to see that my configurations and data were all intact. All I had to do was to restart my services with pm2 resurrect
. Luckily I had done pm2 save
earlier.
Interesting part
Actually, I did not do a deep dive to understand what actually went wrong. I used my instincts from years ago when we used to have frequent BSOD. The only solution back then was to hit the reset button on my PC to boot again. I tried a similar trick and luckily it worked.
I don't think the software that I'm running should have any problem because I have a similar set of software with the same operating system and everything running on another server as well and I did not have any issues there. I'm thinking this can be a temporary hardware issue. Hopefully, this should just be a one-off event. Now at least I know what I should be doing if this happens once again. If this happens again, I will try to do a deep dive or request a change in hardware.
This really saved my day. I don't want to spend hours configuring the server once again. Maintaining our own servers is not an easy thing to do. This keeps reminding us that we should be having more backup plans in place.
If you like what I'm doing on Hive, you can vote me as a witness with the links below.
|
|
|
|
|
|