I don't like to anthropomorphisize technology, but sometimes I swear computers and the like know very well the most inconvenient times to fail. Such as on a Saturday evening, just as I arrive in Tokyo tired and exhausted after a trip halfway round the world. Going on line to check my mail before falling into a deep, jetlagged sleep, I discover the rental server I was running at the time was not responding. Not even to pings. It seemed to have vanished from the net completely.
The hosting company has a very good remote administrative interface which allows you to reboot the system. At this point I was assuming the server had just crashed, as it had done on one or two other occasions due to a freak combination of server load and an obscure kernel corner case error. Nothing happened.
Fortunately the hosting company also has a very good human support service, with someone in the hosting center even on a Saturday (and due to the time difference it was still morning over there). Within a few minutes I was able to raise someone by email and have them physically restart the server.
Server starts, boots, everything is as normal. Looking at the logs however, there were no signs of any abnormalities which might have caused a total crash, but I was too tired to go into it any further.
Half-an-hour later, the
ssh session was dead.
top had been running,
but it too showed no signs of unusual load activity. I went through the restart routine again, finally
dashing off another email to the support people - this time with a request that they take a quick
look at the server to see if there was anything that might be wrong.
The reply was commendably quick, along the lines that "looks OK to us, the lights were on, but we rebooted it for you anyway". The server was up and running again, and instead of retiring to my futon I kept watching it... and watching it... and it kept running as normal. So I popped out to the local konbini (convenience store) for something to keep me awake, and on my return: dead as a dodo.
Now, from previous communications with support at the hosting company, I get the impression they have to deal with a fair number of people with little experience of running servers - which, if you don't know what you're doing, are quite easy to "crash" in some way. The general tone of their mails is "we'll reboot your server as often as you want, but after a while it isn't funny any more". However, at that point I was sure the problem wasn't on my end: neither the logs nor the server while running showed any signs of unusual activity; and it just appeared to stop dead from one second to the next. Yet the support people claim the server box appeared to be running.
OK... what else could be the problem? At this point I was pretty sure it was a hardware glitch, and started to run through the possible causes. CPU and mainboard were presumably in working order, otherwise it wouldn't have run at all after a restart. The transformer - a common cause of failure - was also presumably working; all the transformer failures I had experienced until then were total failures (either they work, or they don't - nothing inbetween). RAM is also a frequent source of issues - on a different server from the same company I had experienced a dodgy RAM stick; but at worst that had caused an OS level crash and / or reboot with traces in the logs. In this case however, the server was starting, but just stopping after a short period of time. Hard disks seemed to be working properly.
The other major source of hardware problems is the main moving part in a server (besides the transformer fan) - the CPU ventilator. If this had failed, that would explain the server's failure: it would start, but after a while the ventilator (placed directly above the CPU) wasn't working (or more likely working intermittently), the mainboard would detect a rise in temperature and switch the CPU off to prevent overheating.
I dashed off a third mail to support requesting them to take another look at the server, specifically whether the CPU ventilator was working: and sure enough, not long afterwards came the reply "we've just replaced it". The server started up and I was finally able to get some sleep.
Posted at 2007-11-26 15:14:00 |Comments (0)