Bad Block Howto
Your hard disk fails on Linux - now what?
I happened to me a few times in the last few weeks. A filesystem check for a hard disk would take for ever - my system won't boot. One of the harddisks was reporting bad block errors. At this time i have four different hard disks that are giving me this kind of trouble. All broken harddisks have been taken out of the system and I am analyzing the problem using a virtual machine and a USB-SATA bridge.
First step: remove harddisk from system and put into SATA USB docking station
Hopefully your disk is not needed to boot or fully operate your system. In that case you might want to boot of a USB stick or other media. In any case my procedure is to remove the disk from the original system and use a different system for analysis. In my case i am using an Ubuntu based virtual machine and connect the drive via a USB-SATA bridge. With USB 2 devices the performance is poor. I am not using my old Logilink QP002 Sata Docking Station anymore for this reason.
The USB 3 device i bought in 2015: still serves me well.
Recently i added a 2 port docking station:
Second step - check problems
Tools needed
- A Linux virtual machine
- smartctl
- hdparam
- debugfs
- mount
Examples
disk Digda
Where is it mounted?
sudo mount | grep Digda
/dev/sdb1 on /media/wf/Digda type ext2 (rw,nosuid,nodev,relatime,uhelper=udisks2)
get basic info and check
get basic info
- using fixbad -i
- using smartctl -i
- using hdparm -I
see the details below check:
fixbad /dev/sdb -c
checking Current_Pending_Sector count for /dev/sdb partition /dev/sdb1 blocksize 4096 startsector
197 Current_Pending_Sector 0x0032 252 252 000 Old_age Always - 0
Current_Pending_Sector is zero!
So this drive has no bad block issues.
fixbad
The fixbad script tries to summarize the most important information for the bad block fixing needs.
fixbad /dev/sdb -i
Model Family: SAMSUNG SpinPoint F4 EG (AF)
Device Model: SAMSUNG HD204UI
Serial Number: S2H7J9EZC04171
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 5400 rpm
Model Number: SAMSUNG HD204UI
Serial Number: S2H7J9EZC04171
Partition: /dev/sdb1
Blocksize: 4096
smartctl
sudo smartctl -i /dev/sdb
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.0-72-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: SAMSUNG SpinPoint F4 EG (AF)
Device Model: SAMSUNG HD204UI
Serial Number: S2H7J9EZC04171
LU WWN Device Id: 5 0024e9 203f40fb7
Firmware Version: 1AQ10001
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 5400 rpm
Form Factor: 3.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS T13/1699-D revision 6
SATA Version is: SATA 2.6, 3.0 Gb/s
Local Time is: Mon Oct 5 11:41:44 2020 CEST
==> WARNING: Using smartmontools or hdparm with this
drive may result in data loss due to a firmware bug.
****** THIS DRIVE MAY OR MAY NOT BE AFFECTED! ******
Buggy and fixed firmware report same version number!
See the following web pages for details:
http://knowledge.seagate.com/articles/en_US/FAQ/223571en
http://www.smartmontools.org/wiki/SamsungF4EGBadBlocks
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
hdparm
sudo hdparm -I /dev/sdb
/dev/sdb:
ATA device, with non-removable media
Model Number: SAMSUNG HD204UI
Serial Number: S2H7J9EZC04171
Firmware Revision: 1AQ10001
Transport: Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6
Standards:
Used: unknown (minor revision code 0x0028)
Supported: 8 7 6 5
Likely used: 8
Configuration:
Logical max current
cylinders 16383 16383
heads 16 16
sectors/track 63 63
--
CHS current addressable sectors: 16514064
LBA user addressable sectors: 268435455
LBA48 user addressable sectors: 3907029168
Logical Sector size: 512 bytes
Physical Sector size: 512 bytes
device size with M = 1024*1024: 1907729 MBytes
device size with M = 1000*1000: 2000398 MBytes (2000 GB)
cache/buffer size = unknown
Form Factor: 3.5 inch
Nominal Media Rotation Rate: 5400
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, no device specific minimum
R/W multiple sector transfer: Max = 16 Current = ?
Advanced power management level: disabled
Recommended acoustic management value: 254, current value: 0
DMA: mdma0 mdma1 *mdma2 udma0 udma1 udma2 udma3 udma4 udma5 udma6
Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
Cycle time: no flow control=120ns IORDY flow control=120ns
Commands/features:
Enabled Supported:
* SMART feature set
Security Mode feature set
* Power Management feature set
* Write cache
* Look-ahead
* Host Protected Area feature set
* WRITE_BUFFER command
* READ_BUFFER command
* NOP cmd
* DOWNLOAD_MICROCODE
Advanced Power Management feature set
Power-Up In Standby feature set
* SET_FEATURES required to spinup after power up
SET_MAX security extension
Automatic Acoustic Management feature set
* 48-bit Address feature set
* Device Configuration Overlay feature set
* Mandatory FLUSH_CACHE
* FLUSH_CACHE_EXT
* SMART error logging
* SMART self-test
* General Purpose Logging feature set
* 64-bit World wide name
* WRITE_UNCORRECTABLE_EXT command
* {READ,WRITE}_DMA_EXT_GPL commands
* Segmented DOWNLOAD_MICROCODE
* Gen1 signaling speed (1.5Gb/s)
* Gen2 signaling speed (3.0Gb/s)
* Native Command Queueing (NCQ)
* Host-initiated interface power management
* Phy event counters
* NCQ priority information
DMA Setup Auto-Activate optimization
Device-initiated interface power management
* Software settings preservation
* SMART Command Transport (SCT) feature set
* SCT Read/Write Long (AC1), obsolete
* SCT Write Same (AC2)
* SCT Error Recovery Control (AC3)
* SCT Features Control (AC4)
* SCT Data Tables (AC5)
Security:
Master password revision code = 65534
supported
not enabled
not locked
not frozen
not expired: security count
supported: enhanced erase
344min for SECURITY ERASE UNIT. 344min for ENHANCED SECURITY ERASE UNIT.
Logical Unit WWN Device Identifier: 50024e9203f40fb7
NAA : 5
IEEE OUI : 0024e9
Unique ID : 203f40fb7
Checksum: correct
disk wendi
Where is it mounted?
sudo mount | grep wendi
/dev/sdb1 on /media/wf/wendi type ext3 (rw,nosuid,nodev,relatime,data=ordered,uhelper=udisks2)
get basic info and check
fixbad /dev/sdb -i -c
Model Family: SAMSUNG SpinPoint F4 EG (AF)
Device Model: SAMSUNG HD204UI
Serial Number: S2H7J9EZB05266
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 5400 rpm
Model Number: SAMSUNG HD204UI
Serial Number: S2H7J9EZB05266
Partition: /dev/sdb1
Blocksize: 4096
checking Current_Pending_Sector count for /dev/sdb partition /dev/sdb1 blocksize 4096 startsector
197 Current_Pending_Sector 0x0032 252 252 000 Old_age Always - 0
Current_Pending_Sector is zero!
disk riako
highlight='1'>
fixbad /dev/sdb -i -c
Model Family: Western Digital Green
Device Model: WDC WD40EZRX-00SPEB0
Serial Number: WD-WCC4E3LNJA7X
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Model Number: WDC WD40EZRX-00SPEB0
Serial Number: WD-WCC4E3LNJA7X
Partition: /dev/sdb1
Blocksize: 4096
checking Current_Pending_Sector count for /dev/sdb partition /dev/sdb1 blocksize 4096 startsector 2048
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 30
Current_Pending_Sector is not zero but 30
running short smartctl test for /dev/sdb ...
Links
- https://www.smartmontools.org/wiki/BadBlockHowto
- https://github.com/hradec/fix_smart_last_bad_sector
- http://dcere.com/hardware/2016/09/18/hard-disk.html
- https://serverfault.com/questions/461203/how-to-use-hdparm-to-fix-a-pending-sector
- https://serverfault.com/a/641135/162693
- https://linux.die.net/man/8/smartctl
fixbad script
This script has not been used for fixing disks yet. It 's main goal is the automation of the fixing process. Do not use this script without checking the steps beforehand - you might fatality damage your disk!
#!/bin/bash
# see http://wiki.bitplan.com/index.php/Bad_Block_Howto
# see https://github.com/hradec/fix_smart_last_bad_sector/blob/master/fix_smart_last_bad_sector.sh
# see https://www.thomas-krenn.com/de/wiki/Analyse_einer_fehlerhaften_Festplatte_mit_smartctl
# WF 2020-10-04
disk=/dev/sdb
mode=short
# verbose
verbose=false
# should commands only be shown?
dry=false
# range of sectors to modify after bad sector
range=8
# set to sudo if sudo is needed
sudo=sudo
# serial number
serial="-?-"
#ansi colors
#http://www.csc.uvic.ca/~sae/seng265/fall04/tips/s265s047-tips/bash-using-colors.html
blue='\033[0;34m'
red='\033[0;31m'
green='\033[0;32m' # '\e[1;32m' is too bright for white bg.
endColor='\033[0m'
#
# a colored message
# params:
# 1: l_color - the color of the message
# 2: l_msg - the message to display
#
color_msg() {
local l_color="$1"
local l_msg="$2"
echo -e "${l_color}$l_msg${endColor}"
}
#
# error
#
# show an error message and exit
#
# params:
# 1: l_msg - the message to display
error() {
local l_msg="$1"
# use ansi red for error
color_msg $red "Error: $l_msg" 1>&2
exit 1
}
#
# show the usage
#
usage() {
echo "usage: $0 [disk]"
echo " [-d|--dry]"
echo " [-h|--help]"
echo " [-i|--info]"
echo " [[-m|--mode] mode]"
echo " [[-r|--range] range]"
echo " [[-s|--serial [serial]]"
echo " [-v|--verbose]"
echo
echo " -h|--help: show this usage"
echo " -d|--dry: dry run - show commands only"
echo " -i|--info: show info about the given disk"
echo " -m|--mode: set mode: default=short"
echo " -r|--range: range of sectors to modify after bad sector"
echo " -s|--serial: get serial number of confirm serial number"
echo " -v|--verbose: set verbose mode"
echo ""
echo "example:"
echo " $0 /dev/sdb -i"
echo ""
echo "for any write operation you need to confirm the serial number"
echo "to get serial number: "
echo " $0 disk -s "
exit 1
}
#
# get a number range from 0 to the given n-1
#
# param $1: n
function getRange() {
local l_n="$1"
range=$(python -c "for i in range($l_n): print i,")
echo $range
}
#
# read the result of the smartctl test for the given disk
#
function readResult() {
local l_disk="$1"
$sudo smartctl -l selftest $l_disk | egrep "# [0-9]"
}
#
# get the serial number of the device
#
function getSerialNumber() {
local l_disk="$1"
serial=$($sudo smartctl -i $l_disk | grep "Serial Number" | cut -f 2 -d':')
echo $serial
}
#
# get the blocksize of the given file system
#
function getBlockSize() {
local l_fs="$1"
blocksize=$($sudo tune2fs -l $l_fs | grep "Block size:" | cut -f2 -d':')
echo $blocksize
}
#
# get the partition for the given disk
#
function getPartition() {
local l_disk="$1"
fs=$(mount | grep $l_disk | cut -f1 -d' ')
echo $fs
}
#
# get the start sector for the given disk
#
function getStartSector() {
local l_disk="$1"
local l_fs="$2"
startsector=$($sudo fdisk -l $l_disk | grep $l_fs | cut -f4 -d' ')
echo $startsector
}
#
# get Info about the given disk
#
function getInfo() {
local l_disk="$1"
$sudo smartctl -i $l_disk | egrep "(Model|Serial|Rotation|Sector|Capacity)"
$sudo hdparm -I $l_disk | egrep "(Serial Number|Model)"
fs=$(getPartition $l_disk)
color_msg $blue "Partition: $fs"
blocksize=$(getBlockSize $fs)
color_msg $blue "Blocksize: $blocksize"
}
#
# geh the current pending sector for the given disk
#
function getCurrentPendingSector() {
local l_disk="$1"
# if msg is empty don't show message but only return the current pending sector count
local l_msg="$2"
psectorline=$($sudo smartctl -A $l_disk | grep Current_Pending_Sector)
psector=0
if [ $? -eq 0 ]
then
if [ "$l_msg" != "" ]; then color_msg $green "$psectorline"; fi
psector=$(echo $psectorline | cut -f 10 -d ' ')
if [ $psector -gt 0 ]
then
if [ "$l_msg" != "" ]; then color_msg $red "Current_Pending_Sector is not zero but $psector"; fi
else
if [ "$l_msg" != "" ]; then color_msg $green "Current_Pending_Sector is zero!"; fi
fi
else
if [ "$l_msg" != "" ]; then color_msg $red "smartctl -A did not output Current_Pending_Sector"; fi
psector=-1
fi
if [ "$l_msg" == "" ]; then echo $psector; fi
}
#
# fix the given bad sector on the given disk with the given range of sectors to fix
#
# param 1: disk e.g. /dev/sdb1
# param 2: defect sector to repair
# param 3: range - range of sectors to repair e.g. 8
#
fixBad() {
local l_disk="$1"
local l_sector="$2"
local l_range="$3"
color_msg $blue "repairing sector $l_sector to $l_sector+$l_range on $l_disk ..."
r=$(getRange $l_range)
for i in $r ; do
let b1=$l_sector+$i
if [ "$dry" == "true" ]
then
echo hdparm --repair-sector $b1 --yes-i-know-what-i-am-doing $l_disk
else
color_msg $red "not implemented yet"
#hdparm --repair-sector $b1 --yes-i-know-what-i-am-doing $disk >> /tmp/smart_repaired.log
fi
done
#tail -n 60 /tmp/smart_repaired.log | grep writing | tail -n 20
#grep '#' /tmp/smart | head -5
#hdparm -I $disk > /tmp/hdparm
}
#
# check the needed software
#
checkSoftware() {
for sw in awk debugfs hdparm smartctl python $sudo
do
bin=$(which $sw)
if [ $? -eq 0 ]
then
if [ "$verbose" == "true" ]
then
color_msg $green "will use $bin as $sw"
fi
else
error "$0 needs $sw to work please install it"
fi
done
}
#
# check the given disk in the given mode
#
function checkDisk() {
local l_disk="$1"
local l_mode="$2"
local l_serial="$3"
fs=$(getPartition $l_disk)
blocksize=$(getBlockSize $fs)
startsector=$(getStartSector $l_disk $fs)
color_msg $blue "checking Current_Pending_Sector count for $l_disk partition $fs blocksize $blocksize startsector $startsector"
getCurrentPendingSector "$l_disk" show
psector=$(getCurrentPendingSector "$l_disk")
if [ $psector -gt 0 ]
then
echo "running $l_mode smartctl test for $l_disk ..."
$sudo smartctl -t $l_mode $l_disk > /tmp/null
readResult "$l_disk" | while read line
do
echo $line | grep "read failure" > /dev/null
if [ $? -eq 0 ]
then
index=$(echo $line | cut -f2 -d' ')
state=$(echo $line | cut -f3-4 -d ' ')
progress=$(echo $line | cut -f8 -d ' ')
lba=$(echo $line | cut -f10 -d ' ')
echo $index $state
echo "progress: $progress"
echo "lba: $lba"
# calculate the file system block
fsb=$(awk -v L=$lba -v S=$startsector -v B=$blocksize 'BEGIN {printf ("%.0f",((L-S)*512/B))}')
echo "file system block: $fsb"
fixBad $l_disk $lba $range
fi
done
fi
}
#
# start a check loop on the given drive
#
function checkLoop() {
local baddrive="$1"
badsect=1
while true; do
color_msg $blue "Testing $baddrive from LBA $badsect"
$sudo smartctl -t select,${badsect}-max ${baddrive} 2>&1 >> /dev/null
echo "Waiting for test to stop (each dot is 5 sec)"
while [ "$($sudo smartctl -l selective ${baddrive} | awk '/^ *1/{print substr($4,1,9)}')" != "Completed" ]; do
echo -n .
sleep 5
done
echo
badsect=$($sudo smartctl -l selective ${baddrive} | awk '/# 1 Selective offline Completed: read failure/ {print $10}')
[ $badsect = "-" ] && exit 0
echo Attempting to fix sector $badsect on $baddrive
echo hdparm --repair-sector ${badsect} --yes-i-know-what-i-am-doing $baddrive
echo running next test
done
}
# make sure the needed software is available
checkSoftware
# commandline option
while [ "$1" != "" ]
do
option=$1
shift
case $option in
-h|--help)
usage
;;
-i|--info)
getInfo $disk
;;
-m|--mode)
if [ $# -lt 1 ]
then
usage
else
mode=$1
shift
fi
;;
-c|--check)
checkDisk $disk $mode $serial
;;
-d|--dry)
dry=true
;;
-l|--loop)
checkLoop $disk
;;
-r|--range)
if [ $# -lt 1 ]
then
usage
else
range=$1
shift
fi
;;
-s|--serial)
if [ $# -lt 1 ]
then
getSerialNumber $disk
exit 1
else
serial=$1
shift
fi
;;
-v|--verbose)
verbose=true
;;
*)
disk=$option
;;
esac
done