ubuntu 18.04 hw2018 server backup
Our system is running well, RAID is running fine, backup are fine, and our server can send email alerts.
Hard disk failure may not be predicted reliably (see this article published by google), it may be interesting to keep an eye on HDD.
S.M.A.R.T.
Most HDDs (and SSDs) implements the S.M.A.R.T.
monitoring system. This system logs informations into some non-volatile memory in
the disk, and can be queried usint smartctl
.
Setup
sudo apt-get install smartmontools
I won’t explain every fields in the output of sudo smartctl -a /dev/sdc
, but
here are some interesting attributes see description here.
Load_Cycle_Count
This one should be watched because this number is limited (around 50’000 for a standard desktop drive, or 300’000 for a laptop hard drive). It’s clearly possible to kill a drive just by letting it go to low power and restart it soon after (search for “WD Green Load_Cycle_Count” in your favorite search engine!).
sudo smartctl -a /dev/sdc | grep Load_Cycle_Count
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 205
Power_On_Hours
sudo smartctl -a /dev/sdc | grep Load_Cycle_Count
9 Power_On_Hours 0x0032 079 079 000 Old_age Always - 15818
It seems that my HDD is running for almost 2 years!
Reallocated_Sector_Ct and Reallocated_Event_Count
sudo smartctl -a /dev/sdc | grep -E 'Reallocated_Sector_Ct|Reallocated_Event_Count'
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
Seems the disk a no surface problems ;)
For SSD: Wear_Leveling_Count and Reallocated_Event_Count
sudo smartctl -a /dev/sda | grep -E 'Wear_Leveling_Count|Total_LBAs_Written'
177 Wear_Leveling_Count 0x0013 091 091 000 Pre-fail Always - 179
241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 30052738530
According to this online calculator, I have already written 13.99 TB, and the estimated wear level is 91% (Wear_Leveling_Count value). Or it should be called health level since at 100% it has no wear, and should be dead at 0%.
hddtemp
As you have seen in the output of smartctl
, most HDD have temperature sensors,
and this information can be accessed using hddtemp
.
Setup:
sudo apt-get install hddtemp
Use:
hddtemp /dev/sd{c,d,e,f}
/dev/sdc: WDC WD20EFRX-68EUZN0: 29°C
/dev/sdd: WDC WD20EFRX-68EUZN0: 29°C
/dev/sde: WDC WD20EFRX-68EUZN0: 29°C
/dev/sdf: WDC WD20EFRX-68EUZN0: 29°C
Some disk will fail (example this ssd:)
hddtemp /dev/sda
WARNING: Drive /dev/sda doesn't seem to have a temperature sensor.
WARNING: This doesn't mean it hasn't got one.
WARNING: If you are sure it has one, please contact me (hddtemp@guzu.net).
WARNING: See --help, --debug and --drivebase options.
/dev/sda: Samsung SSD 850 EVO 120G B : no sensor
Another failing example (external USB disk)
hddtemp /dev/sdg
/dev/sdg: Seagate Expansion: drive supported, but it doesn't have a temperature sensor.
Monitoring
Both smartmontools
and hddtemp
can be configured to periodicaly watch the
HDDs health, see /etc/default/smartmontools
and /etc/default/hddtemp
.
~~~
Question, remark, bug? Don't hesitate to contact me or report a bug.