T420 SMART and i915 challenges
Continuing the topic of old(er) hardware, my focus now turned towards the Lenovo Thinkpad T420 of my wife. Recently I upgraded the old 2.5 inch normal (spinning rust, 320GB) hard drive, with a Crucial MX500 1TB SSD. As expected, the performance increase was stellar (mostly because the old hard drive was really old and at the end of its life). After about a week, maybe 2 weeks, the new SSD gave a SMART error and VLC crashed the graphical user interface when playing an MPG video.
SMART scsi errors
When I bought the SSD I did a short and long SMART test to make sure the disk is OK. When my wife told me she got a SMART error (there's a GUI tool which popped up, she's rocking Ubuntu MATE), I had to take a look. When looking at the SMART information, I suddenly saw a scsi error badly formed scsi parameters
line which I have never seen before. And I know I checked the drive beforehand. Will this be my first Crucial DOA?
# smartctl -a /dev/sda smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-72-generic] (local build) Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Crucial/Micron BX/MX1/2/3/500, M5/600, 1100 SSDs Device Model: CT1000MX500SSD1 Serial Number: 1204A4E8045B LU WWN Device Id: 5 00a074 1e4e8043b Firmware Version: M3CR033 User Capacity: 1.000.204.886.016 bytes [1,00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: Solid State Device Form Factor: 2.5 inches Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-3 T13/2161-D revision 5 SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Sat May 8 09:17:13 2021 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x03) Offline data collection activity is in progress. Auto Offline Data Collection: Disabled. Self-test execution status: ( 249) Self-test routine in progress... 90% of test remaining. Total time to complete Offline data collection: ( 93) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 30) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x0031) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0 5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 12 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 13 171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0 172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0 173 Ave_Block-Erase_Count 0x0032 100 100 000 Old_age Always - 1 174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 0 180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 27 183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0 184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 194 Temperature_Celsius 0x0022 075 059 000 Old_age Always - 25 (Min/Max 0/41) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0 202 Percent_Lifetime_Remain 0x0030 100 100 001 Old_age Offline - 0 206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0 210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0 246 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 746175948 247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 6982955 248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 5272629 Read SMART Error Log failed: scsi error badly formed scsi parameters Read SMART Self-test Log failed: scsi error badly formed scsi parameters Read SMART Selective Self-test Log failed: scsi error badly formed scsi parameters
Multiple posts can be found online of people who have this issue (including a ticket at smartmontools), but I found one post on the OpenSUSE forums by hawake which turned out to be the solution (so far, for me): turn on AHCI. I did not bother to look at the BIOS settings, since I assumed AHCI would be enabled by default. Then I remembered this is old(er) hardware... The BIOS was set to "Compatability" and after changing it to "AHCI", smartctl
worked fine again.
BIOS -> Config -> Serial ATA (SATA) -> SATA Controller Mode Option -> AHCI
VLC and i915
Whilst looking at videos, I believe it was an MPEG file, with VLC, my wife complained about the system freezing up. Wait, this is a Linux system and VLC is my default media player and this has not happened before, so she must be mistaken, right? Unfortunately no, she was right, horribly right even. I reproduced the error and indeed, the entire Graphical User Interface froze. I could 'escape' to a VT, but that was about it. Looking at the logs, I found this:
kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 6:1:f100ff7b, in vlc [9334] kernel: i915 0000:00:02.0: [drm] Resetting chip for stopped heartbeat on rcs0 kernel: i915 0000:00:02.0: [drm] vlc[9334] context reset due to GPU hang kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 6:1:7100ff7b, in vlc [9334] kernel: i915 0000:00:02.0: [drm] Resetting chip for stopped heartbeat on rcs0 kernel: i915 0000:00:02.0: [drm] vlc[9334] context reset due to GPU hang
A serious kernel module oops of the (in my opinion) usually rock stable Intel kernel module. No way I am going to be able to fix that (nor will anyone want to fix it for old hardware, probably). Upgrading from 20.04 to 20.10 and to 21.04 did not fix the issue. So I worked around it by letting her use the celluloid media player, which was installed by default.
Looking online I found that information that seem to indicate that these kind of errors will not be fixed, so I made the right call by working around this issue.