Our system is running well, RAID is running fine, backup are fine, and our server can send email alerts.

Hard disk failure may not be predicted reliably (see this article published by google), it may be interesting to keep an eye on HDD.

S.M.A.R.T.

Most HDDs (and SSDs) implements the S.M.A.R.T. monitoring system. This system logs informations into some non-volatile memory in the disk, and can be queried usint smartctl.

Setup

sudo apt-get install smartmontools

I won’t explain every fields in the output of sudo smartctl -a /dev/sdc, but here are some interesting attributes see description here.

Load_Cycle_Count

This one should be watched because this number is limited (around 50’000 for a standard desktop drive, or 300’000 for a laptop hard drive). It’s clearly possible to kill a drive just by letting it go to low power and restart it soon after (search for “WD Green Load_Cycle_Count” in your favorite search engine!).

sudo smartctl -a /dev/sdc | grep Load_Cycle_Count
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       205

Power_On_Hours

sudo smartctl -a /dev/sdc | grep Load_Cycle_Count
  9 Power_On_Hours          0x0032   079   079   000    Old_age   Always       -       15818

It seems that my HDD is running for almost 2 years!

Reallocated_Sector_Ct and Reallocated_Event_Count

sudo smartctl -a /dev/sdc | grep -E 'Reallocated_Sector_Ct|Reallocated_Event_Count'
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0

Seems the disk a no surface problems ;)

For SSD: Wear_Leveling_Count and Reallocated_Event_Count

sudo smartctl -a /dev/sda | grep -E 'Wear_Leveling_Count|Total_LBAs_Written'
177 Wear_Leveling_Count     0x0013   091   091   000    Pre-fail  Always       -       179
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       30052738530

According to this online calculator, I have already written 13.99 TB, and the estimated wear level is 91% (Wear_Leveling_Count value). Or it should be called health level since at 100% it has no wear, and should be dead at 0%.

hddtemp

As you have seen in the output of smartctl, most HDD have temperature sensors, and this information can be accessed using hddtemp.

Setup:

sudo apt-get install hddtemp

Use:

hddtemp /dev/sd{c,d,e,f}
/dev/sdc: WDC WD20EFRX-68EUZN0: 29°C
/dev/sdd: WDC WD20EFRX-68EUZN0: 29°C
/dev/sde: WDC WD20EFRX-68EUZN0: 29°C
/dev/sdf: WDC WD20EFRX-68EUZN0: 29°C

Some disk will fail (example this ssd:)

hddtemp /dev/sda
WARNING: Drive /dev/sda doesn't seem to have a temperature sensor.
WARNING: This doesn't mean it hasn't got one.
WARNING: If you are sure it has one, please contact me (hddtemp@guzu.net).
WARNING: See --help, --debug and --drivebase options.
/dev/sda: Samsung SSD 850 EVO 120G B              :  no sensor

Another failing example (external USB disk)

hddtemp /dev/sdg
/dev/sdg: Seagate Expansion:  drive supported, but it doesn't have a temperature sensor.

Monitoring

Both smartmontools and hddtemp can be configured to periodicaly watch the HDDs health, see /etc/default/smartmontools and /etc/default/hddtemp.

~~~

Question, remark, bug? Don't hesitate to contact me or report a bug.