Skip to main content

T420 SMART and i915 challenges

Continuing the topic of old(er) hardware, my focus now turned towards the Lenovo Thinkpad T420 of my wife. Recently I upgraded the old 2.5 inch normal (spinning rust, 320GB) hard drive, with a Crucial MX500 1TB SSD. As expected, the performance increase was stellar (mostly because the old hard drive was really old and at the end of its life). After about a week, maybe 2 weeks, the new SSD gave a SMART error and VLC crashed the graphical user interface when playing an MPG video.

SMART scsi errors

When I bought the SSD I did a short and long SMART test to make sure the disk is OK. When my wife told me she got a SMART error (there's a GUI tool which popped up, she's rocking Ubuntu MATE), I had to take a look. When looking at the SMART information, I suddenly saw a scsi error badly formed scsi parameters line which I have never seen before. And I know I checked the drive beforehand. Will this be my first Crucial DOA?

# smartctl -a /dev/sda
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-72-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Crucial/Micron BX/MX1/2/3/500, M5/600, 1100 SSDs
Device Model:     CT1000MX500SSD1
Serial Number:    1204A4E8045B
LU WWN Device Id: 5 00a074 1e4e8043b
Firmware Version: M3CR033
User Capacity:    1.000.204.886.016 bytes [1,00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sat May  8 09:17:13 2021 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x03) Offline data collection activity
                    is in progress.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 249) Self-test routine in progress...
                    90% of test remaining.
Total time to complete Offline 
data collection:        (   93) seconds.
Offline data collection
capabilities:            (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:    (   2) minutes.
Extended self-test routine
recommended polling time:    (  30) minutes.
Conveyance self-test routine
recommended polling time:    (   2) minutes.
SCT capabilities:          (0x0031) SCT Status supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       0
  5 Reallocate_NAND_Blk_Cnt 0x0032   100   100   010    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       12
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       13
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 Ave_Block-Erase_Count   0x0032   100   100   000    Old_age   Always       -       1
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       0
180 Unused_Reserve_NAND_Blk 0x0033   000   000   000    Pre-fail  Always       -       27
183 SATA_Interfac_Downshift 0x0032   100   100   000    Old_age   Always       -       0
184 Error_Correction_Count  0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   075   059   000    Old_age   Always       -       25 (Min/Max 0/41)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
202 Percent_Lifetime_Remain 0x0030   100   100   001    Old_age   Offline      -       0
206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
210 Success_RAIN_Recov_Cnt  0x0032   100   100   000    Old_age   Always       -       0
246 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       746175948
247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       6982955
248 FTL_Program_Page_Count  0x0032   100   100   000    Old_age   Always       -       5272629

Read SMART Error Log failed: scsi error badly formed scsi parameters

Read SMART Self-test Log failed: scsi error badly formed scsi parameters

Read SMART Selective Self-test Log failed: scsi error badly formed scsi parameters

Multiple posts can be found online of people who have this issue (including a ticket at smartmontools), but I found one post on the OpenSUSE forums by hawake which turned out to be the solution (so far, for me): turn on AHCI. I did not bother to look at the BIOS settings, since I assumed AHCI would be enabled by default. Then I remembered this is old(er) hardware... The BIOS was set to "Compatability" and after changing it to "AHCI", smartctl worked fine again.

BIOS -> Config -> Serial ATA (SATA) -> SATA Controller Mode Option -> AHCI

VLC and i915

Whilst looking at videos, I believe it was an MPEG file, with VLC, my wife complained about the system freezing up. Wait, this is a Linux system and VLC is my default media player and this has not happened before, so she must be mistaken, right? Unfortunately no, she was right, horribly right even. I reproduced the error and indeed, the entire Graphical User Interface froze. I could 'escape' to a VT, but that was about it. Looking at the logs, I found this:

kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 6:1:f100ff7b, in vlc [9334]
kernel: i915 0000:00:02.0: [drm] Resetting chip for stopped heartbeat on rcs0
kernel: i915 0000:00:02.0: [drm] vlc[9334] context reset due to GPU hang
kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 6:1:7100ff7b, in vlc [9334]
kernel: i915 0000:00:02.0: [drm] Resetting chip for stopped heartbeat on rcs0
kernel: i915 0000:00:02.0: [drm] vlc[9334] context reset due to GPU hang

A serious kernel module oops of the (in my opinion) usually rock stable Intel kernel module. No way I am going to be able to fix that (nor will anyone want to fix it for old hardware, probably). Upgrading from 20.04 to 20.10 and to 21.04 did not fix the issue. So I worked around it by letting her use the celluloid media player, which was installed by default.

Looking online I found that information that seem to indicate that these kind of errors will not be fixed, so I made the right call by working around this issue.