smartmontools: Should I replace my SSHD?

Question

Today, when I was watching a video in Firefox, suddenly the following window pupped up:

Or the Output from GSmartContol:

smartctl 7.1 2019-12-30 r5022 [x86_64-linux-4.19.0-22-amd64] (local build) Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Seagate Laptop SSHD Device Model: ST500LM000-1EJ162-SSHD Serial Number: W3715AR9 LU WWN Device Id: 5 000c50 06e236b9f Firmware Version: HPD3 User Capacity: 500,107,862,016 bytes [500 GB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 5400 rpm Form Factor: 2.5 inches Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-2, ACS-3 T13/2161-D revision 3b SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Sun Oct 23 14:41:09 2022 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled AAM feature is: Unavailable APM level is: 254 (maximum performance) Rd look-ahead is: Enabled Write cache is: Enabled DSN feature is: Unavailable ATA Security is: Disabled, frozen [SEC2] === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 634) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 99) minutes. SCT capabilities: (0x1081) SCT Status supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE 1 Raw_Read_Error_Rate POSR-K 118 099 006 - 195697992 3 Spin_Up_Time PO---K 099 099 000 - 0 4 Start_Stop_Count -O--CK 093 093 020 - 7676 5 Reallocated_Sector_Ct PO--CK 100 100 036 - 0 7 Seek_Error_Rate POSR-K 082 060 030 - 4473742513 9 Power_On_Hours -O--CK 087 087 000 - 11853 10 Spin_Retry_Count PO--CK 100 100 097 - 0 12 Power_Cycle_Count -O--CK 093 093 020 - 7668 180 Unknown_HDD_Attribute -O-R-K 100 100 000 - 64025461 183 Runtime_Bad_Block -O--CK 100 100 000 - 0 184 End-to-End_Error PO--CK 100 100 097 - 0 187 Reported_Uncorrect -O--CK 100 100 000 - 0 188 Command_Timeout -O--CK 100 099 000 - 2 189 High_Fly_Writes -O-RCK 063 063 000 - 37 190 Airflow_Temperature_Cel -O---K 069 055 045 - 31 (Min/Max 28/32) 191 G-Sense_Error_Rate -O--CK 100 100 000 - 0 192 Power-Off_Retract_Count -O--CK 100 100 000 - 228 193 Load_Cycle_Count -O--CK 097 097 000 - 7777 194 Temperature_Celsius -O---K 031 045 000 - 31 (0 14 0 0 0) 196 Reallocated_Event_Count -O--CK 100 100 000 - 0 197 Current_Pending_Sector -O--CK 100 100 000 - 16 198 Offline_Uncorrectable ----CK 100 100 000 - 16 199 UDMA_CRC_Error_Count -O--CK 200 200 000 - 0 254 Free_Fall_Sensor -O--CK 100 100 000 - 0 ||||||_ K auto-keep |||||__ C event count ||||___ R error rate |||____ S speed/performance ||_____ O updated online |______ P prefailure warning General Purpose Log Directory Version 1 SMART Log Directory Version 1 [multi-sector log support] Address Access R/W Size Description 0x00 GPL,SL R/O 1 Log Directory 0x01 SL R/O 1 Summary SMART error log 0x02 SL R/O 5 Comprehensive SMART error log 0x03 GPL R/O 5 Ext. Comprehensive SMART error log 0x06 SL R/O 1 SMART self-test log 0x07 GPL R/O 1 Extended self-test log 0x09 SL R/W 1 Selective self-test log 0x10 GPL R/O 1 NCQ Command Error log 0x11 GPL R/O 1 SATA Phy Event Counters log 0x21 GPL R/O 1 Write stream error log 0x22 GPL R/O 1 Read stream error log 0x24 GPL R/O 1223 Current Device Internal Status Data log 0x25 GPL R/O 1223 Saved Device Internal Status Data log 0x30 GPL,SL R/O 9 IDENTIFY DEVICE data log 0x80-0x9f GPL,SL R/W 16 Host vendor specific log 0xa1 GPL,SL VS 20 Device vendor specific log 0xa2 GPL VS 3900 Device vendor specific log 0xa8 GPL,SL VS 129 Device vendor specific log 0xa9 GPL,SL VS 1 Device vendor specific log 0xab GPL VS 1 Device vendor specific log 0xae GPL VS 1 Device vendor specific log 0xb0 GPL VS 4580 Device vendor specific log 0xb6 GPL VS 1918 Device vendor specific log 0xbe-0xbf GPL VS 65535 Device vendor specific log 0xc1 GPL,SL VS 10 Device vendor specific log 0xc2 GPL,SL VS 50 Device vendor specific log 0xc4 GPL,SL VS 5 Device vendor specific log 0xe0 GPL,SL R/W 1 SCT Command/Status 0xe1 GPL,SL R/W 1 SCT Data Transfer SMART Extended Comprehensive Error Log Version: 1 (5 sectors) Device Error Count: 1 CR = Command Register FEATR = Features Register COUNT = Count (was: Sector Count) Register LBA_48 = Upper bytes of LBA High/Mid/Low Registers ] ATA-8 LH = LBA High (was: Cylinder High) Register ] LBA LM = LBA Mid (was: Cylinder Low) Register ] Register LL = LBA Low (was: Sector Number) Register ] DV = Device (was: Device/Head) Register DC = Device Control Register ER = Error register ST = Status register Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 1 [0] occurred at disk power-on lifetime: 8134 hours (338 days + 22 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 40 -- 51 00 00 00 00 00 a0 3a 40 00 00 Error: UNC at LBA = 0x00a03a40 = 10500672 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 25 00 00 00 2a 00 00 00 a0 3a 40 e0 00 01:31:49.827 READ DMA EXT 25 00 00 00 35 00 00 00 a0 42 0b e0 00 01:31:49.348 READ DMA EXT 25 00 00 00 0b 00 00 00 a0 42 00 e0 00 01:31:49.345 READ DMA EXT 25 00 00 00 15 00 00 03 93 ac 6b e0 00 01:31:49.342 READ DMA EXT 25 00 00 00 2b 00 00 03 93 ac 40 e0 00 01:31:49.339 READ DMA EXT SMART Extended Self-test Log Version: 1 (1 sectors) Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 11852 - # 2 Short offline Completed without error 00% 11847 - # 3 Short offline Completed without error 00% 11844 - # 4 Short offline Completed without error 00% 11835 - # 5 Short offline Completed without error 00% 11830 - # 6 Short offline Completed without error 00% 11823 - # 7 Short offline Completed without error 00% 11818 - # 8 Short offline Completed without error 00% 11814 - # 9 Short offline Completed without error 00% 11806 - #10 Short offline Completed without error 00% 11801 - #11 Short offline Completed without error 00% 11792 - #12 Short offline Completed without error 00% 11790 - #13 Short offline Completed without error 00% 11780 - #14 Short offline Completed without error 00% 11772 - #15 Short offline Completed without error 00% 11765 - #16 Short offline Completed without error 00% 11756 - #17 Short offline Completed without error 00% 11751 - #18 Short offline Completed without error 00% 11747 - #19 Short offline Completed without error 00% 11740 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. SCT Status Version: 3 SCT Version (vendor specific): 522 (0x020a) Device State: Active (0) Current Temperature: 31 Celsius Power Cycle Min/Max Temperature: 25/32 Celsius Lifetime Min/Max Temperature: 16/44 Celsius Under/Over Temperature Limit Count: 0/2 SCT Data Table command not supported SCT Error Recovery Control command not supported Device Statistics (GP/SMART Log 0x04) not supported SATA Phy Event Counters (GP Log 0x11) ID Size Value Description 0x000a 2 3 Device-to-host register FISes sent due to a COMRESET 0x0001 2 0 Command failed due to ICRC error 0x0003 2 0 R_ERR response for device-to-host data FIS 0x0004 2 0 R_ERR response for host-to-device data FIS 0x0006 2 0 R_ERR response for device-to-host non-data FIS 0x0007 2 0 R_ERR response for host-to-device non-data FIS

Also today, when I was booting Linux it was not booting. So I have restarted the boot and it worked without problem. This was before this error popped up. No idea if this boot issue has something to do with the smartmontools error. The booting issue was before I had this error warning.

Confusing: In the reoprt there is a line "Error 1 [0] occurred at disk power-on lifetime: 8134 hours (338 days + 22 hours)". But there is no date. My expectation was, that there would be a date at which this error occured, so that I can show what todays date is and can definitely assign the error to the date of today. As I did not found a date in the whole output of the txt file, I was looking for the actual lifetime of my sshd, because it was said, that the error occurred at 8134h. So my expectation was, that I can somewhere find the amount of hours my sshd has run until the current time. But I also did not found this.

Which host's syslog is meant? Maybe this one: /var/log/syslog ?

If yes: Here it is: https://workupload.com/file/NVD2gpdrvHp

But my main question is: Is there a high risk, that my sshd soon will die?

It is said, that the hard disk health status has changed. But where can I now find the current health status?

Thank you.

Your topic asks whether you should replace you sshd? The answer to that is probably no, but it's completely unrelated to anything in the actual question. — Henrik supports the community
– Henrik supports the community, Commented Oct 24, 2022 at 9:17
@Henriksupportsthecommunity In this context, SSHD would mean a hybrid disk, or in other words, a HDD assisted by an integrated SSD component. That's just how Seagate has chosen to name them. — telcoM
– telcoM, Commented Oct 24, 2022 at 10:31
Ohh, I wasn't aware of that, then the usage of that makes sense. — Henrik supports the community
– Henrik supports the community, Commented Oct 24, 2022 at 11:42

Vlastimil Burián · Accepted Answer · 2022-10-25 07:09:11Z

Offline Uncorrectable Sectors

From the image you have posted, and also in text, there is already 16 unreadable/unwritable sectors.

As a past worker in data recovery, I recommend using ddrescue (man page) to copy the healthy remaining parts of your disk to some external medium ASAP.

Passing SMART is irrelevant as well as POHs at this point.

Now, that you have used ddrescue and can confirm there is an actual problem, completely another question would have been to find out which files are affected, which you cannot find from ddrescue's logfile.

You need to successfuly mount the ddrescue image, as root:

mount -o ro,loop,offset=$(( sector size, usually 512 * an actual offset )) /path/to/ddrescue/image /mnt/point/

Find errors = files affected:

cp -PRv /mnt/point/ /path/to/extracted/files/ 2>>/path/to/extracted/files/ERRORS.txt

These are just examples. Always double check paths and do not copy-paste.

The result of ddrescue (without options): imgur.com/QnjymGo - How to find out now which files are affected? — Wogehu
– Wogehu, Commented Oct 24, 2022 at 21:17
@Wogehu You can never find files affected until you do a ddrescue image, mount it and copy all files from it elsewhere with errors logged, which would be for another question. — Vlastimil Burián
– Vlastimil Burián, Commented Oct 25, 2022 at 4:56
I have made my ddrescue image, but there are no errors logged in the logfile: workupload.com/file/QMe8gbu2MnA — Wogehu
– Wogehu, Commented Oct 25, 2022 at 6:54
@Wogehu I did not mean log file, which serves only for ddrescue's purposes. I will edit my answer to make things clearer for you. — Vlastimil Burián
– Vlastimil Burián, Commented Oct 25, 2022 at 6:58

frostschutz · Accepted Answer · 2022-10-23 14:22:36Z

The drive itself does not know any date, nor is there a way to set one. It simply counts its own power on hours, and even that counter may be a rough one and not count correctly, if the drive only ever runs a few minutes at a time.

Your current Power On Hours is 11853 so maybe you can deduce the date based on average time this system is running per day. Or maybe you are logging the Power On Hours value somewhere else, so you could deduce a more exact date that way.

Your drive has unreadable (pending, uncorrectable) sectors so it's possible you already lost some data. Do you have any backups to compare with, or checksums you could check?

Personally I would replace it first (use ddrescue to handle read errors) and then test it more thoroughly. Error counters reported by SMART are always minimum values, i.e. problems the drive encountered without deliberately looking for them.

So there could still be many more errors currently not being reported.

In the future, also consider running long self-tests (or selective self-tests) as the short test may not be enough to detect read errors reliably.

Can be seen in GSmartContol Output, if the SSD part of the SSHD or the HD part is affected? — Wogehu
– Wogehu, Commented Oct 23, 2022 at 16:16

dirkt · Accepted Answer · 2022-10-24 08:41:45Z

I would be worried in particular about this:

 7 Seek_Error_Rate POSR-K 082 060 030 - 4473742513

You have a significant seek error rate (which has been worse in the past).

One uncorrectable error for a block can happen, and is nothing to worry about by itself, and even the 16 pending ones can happen, but based on the seek error rate, I wouldn't trust this drive, and when these drives fail, they usually fail quickly, to a significant degree, and surprisingly.

Run a badblock scan, run a long self-test, and decide what to do based on the outcome. This disk may be fine for system files (or anything else you can recover easily), but I probably wouldn't put important data on it.

Which host's syslog is meant? /var/log/syslog?

Yes. It will likely show the same error that's in the internal log, an uncorrectable READ DMA EXT at LBA 0x00a03a40.

I was looking for the actual lifetime of my sshd

 9 Power_On_Hours -O--CK 087 087 000 - 11853

SMART values are normalized to 100 (lower is worse), and when they go below the indicated threshold, the drive is considered "failing". That's why your drive still passes: All values are above the threshold.

It is still working, it has a few bad blocks (which can happen), and it's possible that once you reallocate those blocks it will be fine for quite some time. So you can still use it, but as I wrote, when it fails, it will probably fail suddenly, as the high seek error rate already indicates some problem (probably mechanical).

I think, I will replace this SSHD as soon as possible. But why is the "SMART overall-health self-assessment test result" nevertheless "PASSED". I always meant, that this means the SSHD health is well. (?) — Wogehu
– Wogehu, Commented Oct 23, 2022 at 20:11
The "passed" or "failed" is a decision the firmware makes, you'd have to ask the manufacturer to get exact information how a specific drive decides that (and you're unlikely to get an answer). It might have to do with the fact that the drive probably never executed a self-test, so the self-test status is still good. — Guntram Blohm
– Guntram Blohm, Commented Oct 24, 2022 at 10:48

Austin Hemmelgarn · Accepted Answer · 2022-10-24 12:35:50Z

Probably, but run a proper test first.

Specifically, you want a long self test of the disk. As root from a terminal smartctl -t long /dev/sda (assuming the drive is /dev/sda, and then come back in roughly an hour and forty minutes and check the output of GSmartControl again.

This will force the disk firmware to run it0s own internal test suite, and should result in some changes in the output of GSmartControl. In particular, you are looking for any of:

The ‘SMART overall-health self-assessment test result’ changing to something other than PASSED.
An increase in the raw values of any of attributes 5, 196, 197, or 198.
One or more additional errors in the ‘SMART Extended Comprehensive Error Log’ section of the output.
A new entry in the ‘SMART Extended Self-test Log’ section showing something other than a - in the LBA_of_first_error column.

If you see any of those things after running the extended self test, you should look at replacing the drive immediately.

If you see none of those things after running the extended self test, still consider replacing the drive, but it’s probably not as urgent. Definitely keep monitoring it though.

But what about that logged error?

The drive has spent 11853 hours powered on (raw value of attribute 9, also possible to infer this from the the ‘SMART Extended Self-test Log’), so this error happened long ago and can probably be safely ignored.

As a quick bit of background, this stuff is not listed with dates because there is no way for the system to map these numbers to exact dates. The drive has no internal clock, so it can’t record dates itself, and the system itself has no idea how much time the drive has spent powered off (which would be required to map the time spent powered on to a specific date and time).

What about the Offline Uncorrectable Sectors / Current Pending Sectors?

These metrics actually highlight one of the big issues with SMART. Because you only get a point-in-time snapshot of the current values with no historical data, and there are no timestamps for when the last change in the counter happened, there is no way to differentiate between events that happened in the distant past and those that happened recently, or between sudden changes and steady increases.

These particular metrics are one where this differentiation actually matters. If you get a sudden unexpected jump in either of these numbers (or the count of reallocated sectors), or they are increasing steadily, then those situations are concerning. If you get only one or two over the course of hundreds of hours, and it mostly just stays the same, then it’s not really as much of an issue (still something to watch, but it’s not going to eat your babies).

For your particular case you’re probably fine (you’re nowhere near what typical drives have available as backup sectors for reallocation) unless the number keeps changing or jumps up again suddenly.

Then why do you suggest replacement if everything is probably fine?

However, there are other things that are potentially concerning here. The biggest issue I see is the particularly high seek error rate (attribute 7). This is almost never zero, but it’s unusual for it to be high enough that the normalized attribute value dips below about 90. In most cases, this is indicative of mechanical issues inside the drive itself, which in turn is a pretty reliable indicator of impending failure. You also have a non-zero number of high-fly writes (also generally indicative of mechanical issues).

Given this, I would seriously consider at least starting to plan out replacement of this drive (with an SSD if possible, they solve most of the issues with using traditional hard drives in a laptop, and should both speed things up and give you a slight boost to battery life). You absolutely want to replace it before it fails though, mechanical failures of hard drives are almost always sudden and catastrophic, and it’s often not possible to actually recover any data afterwards.

This was very helpful. After the long self test I have now 20 new errors and an entry in the "LBA_of_first_error" column. — Wogehu
– Wogehu, Commented Oct 25, 2022 at 7:10

Stack Exchange Network

smartmontools: Should I replace my SSHD?

4 Answers 4

Offline Uncorrectable Sectors

Probably, but run a proper test first.

But what about that logged error?

What about the Offline Uncorrectable Sectors / Current Pending Sectors?

Then why do you suggest replacement if everything is probably fine?

You must log in to answer this question.

Linked

Hot Network Questions

smartmontools: Should I replace my SSHD?

4 Answers 4

Offline Uncorrectable Sectors

Probably, but run a proper test first.

But what about that logged error?

What about the Offline Uncorrectable Sectors / Current Pending Sectors?

Then why do you suggest replacement if everything is probably fine?

You must log in to answer this question.

Linked

Related

Hot Network Questions