Determining cause of SSD failure after 30-ish minutes

ssd

We have a 64GB SSD drive in a tower server with a local colocation company. This drive and the enter system was built about six months ago, brand new parts.

Until this weekend the SSD/system were working perfectly. We're running CentOS 6.2

After booting perfectly, the system can be used about 20-30 minutes (no real consistency with time) before the drive starts acting funny.

Libraries start saying they can't load, ssh starts denying public key logins. Shutdown starts saying "input/outout error". Some programs start indicating the drive is read-only.

Only 25GB of the 64GB are used.

I can't find any errors that indicate what happened. I tried running fsck from a live cd on the drive and it showed no problems and most of the time boot works fine. There was one boot that said "couldn't find os" but that's not happening anymore.

Where can I look to find logs about what happens? Are there any other disk checks I should do? It seems like a repairable problem, and not that I need a new drive.

Update:

I enabled SMART after rebooting the server. After 1 hour of uptime and normal system operation (running services are httpd, mysql, but very little to no traffic), suddenly things just stop working. During the hour of uptime it responded with a PASS for the smart health check. After the hour I tried it again (through webmin) and it now says SMART is disabled.

The hard drive is now showing the same issues I've seen before – trying most commands show "input/output error".

Running a smart health check now shows:

Log Sense failed, IE page [scsi response fails sanity test]

What can I do to figure out what's causing this to fail after a random period of time? It runs perfectly for 30-60 minutes and then it starts acting odd like this.

Update 2

Some requested that I try dmesg and this was the result: http://www.pastie.org/private/hk7jfhxilj7ypy828irna. Someone else recommended that I not assume it's the drive, but possibly the drive controller. I don't understand how to determine whether the error is the controller versus the drive – aside from trying a different drive. If I have to buy a replacement motherboard or drive I need to know which is failing first.

Running fsck shows:

fsck from util-linux-ng 2.17.2
e2fsck 1.41.12 (17-May-2010)
fsck.ext4: Superblock invalid, trying backup blocks...
fsck.ext4: Bad magic number in super-block while trying to open /dev/mapper/vg_192-lv_root

The superblock could not be read or does not describe a correct ext2
filesystem.  If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>

Best Answer

SSDs are notoriously fragile. Jeff Atwood outlines some failure rates here. They will fail without any warning and turn your data into a distant memory.

Looks like it's time to RMA and restore from backup. It shouldn't be a problem though, because you're not running a production server on a single, non-RAIDed disk, right? And you definitely have recent backups you can use to get back on your feet, right?

Right?