Difference between revisions of "Bad Block Howto"

From BITPlan Wiki
Jump to navigation Jump to search
 
(27 intermediate revisions by the same user not shown)
Line 12: Line 12:
 
https://images-na.ssl-images-amazon.com/images/I/611EMR%2BPqWL._AC_SL1001_.jpg
 
https://images-na.ssl-images-amazon.com/images/I/611EMR%2BPqWL._AC_SL1001_.jpg
 
still serves me well.
 
still serves me well.
 +
 +
Recently i added a [https://www.amazon.de/gp/product/B06XYJSR8B/ 2 port docking station]:
 +
https://images-na.ssl-images-amazon.com/images/I/61ZUg7IV7ZL._AC_SL1300_.jpg
  
 
= Second step - check problems =
 
= Second step - check problems =
Line 21: Line 24:
 
# mount
 
# mount
 
== Examples ==
 
== Examples ==
 +
https://www.smartmontools.org/wiki/BadBlockHowto describes how to handle bad blocks by using the tools manually. The script fixbad on this page helps to automate this process.
 +
Please find below some examples for the usage of the script.
 +
 
=== disk Digda ===
 
=== disk Digda ===
 
==== Where is it mounted? ====
 
==== Where is it mounted? ====
Line 28: Line 34:
 
</source>
 
</source>
  
==== get basic info ====
+
==== get basic info and check ====
 +
get basic info
 
# using fixbad -i
 
# using fixbad -i
 
# using smartctl -i
 
# using smartctl -i
 
# using hdparm -I
 
# using hdparm -I
 +
see the details below
 +
check:
 +
<source lang='bash' highlight='1'>
 +
fixbad /dev/sdb -c
 +
checking Current_Pending_Sector count for /dev/sdb partition /dev/sdb1 blocksize 4096 startsector
 +
197 Current_Pending_Sector  0x0032  252  252  000    Old_age  Always      -      0
 +
Current_Pending_Sector is zero!
 +
</source>
 +
So this drive has no bad block issues.
 +
 +
 
===== fixbad =====
 
===== fixbad =====
 +
The [[Bad_Block_Howto#fixbad_script|fixbad script]] tries to summarize the most important information for the bad block fixing needs.
 
<source lang='bash' highlight='1'>
 
<source lang='bash' highlight='1'>
 
fixbad /dev/sdb -i
 
fixbad /dev/sdb -i
Line 179: Line 198:
 
Unique ID : 203f40fb7
 
Unique ID : 203f40fb7
 
Checksum: correct
 
Checksum: correct
 +
</source>
 +
=== disk wendi ===
 +
==== Where is it mounted? ====
 +
<source lang='bash' highlight='1'>
 +
sudo mount | grep wendi
 +
/dev/sdb1 on /media/wf/wendi type ext3 (rw,nosuid,nodev,relatime,data=ordered,uhelper=udisks2)
 +
</source>
 +
==== get basic info and check ====
 +
<source lang='bash' highlight='1'>
 +
fixbad /dev/sdb -i -c
 +
Model Family:    SAMSUNG SpinPoint F4 EG (AF)
 +
Device Model:    SAMSUNG HD204UI
 +
Serial Number:    S2H7J9EZB05266
 +
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
 +
Sector Size:      512 bytes logical/physical
 +
Rotation Rate:    5400 rpm
 +
Model Number:      SAMSUNG HD204UI                       
 +
Serial Number:      S2H7J9EZB05266     
 +
Partition:        /dev/sdb1
 +
Blocksize:        4096
 +
checking Current_Pending_Sector count for /dev/sdb partition /dev/sdb1 blocksize 4096 startsector
 +
197 Current_Pending_Sector  0x0032  252  252  000    Old_age  Always      -      0
 +
Current_Pending_Sector is zero!
 +
</source>
 +
=== disk riako ===
 +
check for bad sectors
 +
<source lang='bash' highlight='1'>
 +
fixbad /dev/sdb -i -c
 +
Model Family:    Western Digital Green
 +
Device Model:    WDC WD40EZRX-00SPEB0
 +
Serial Number:    WD-WCC4E3LNJA7X
 +
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
 +
Sector Sizes:    512 bytes logical, 4096 bytes physical
 +
Rotation Rate:    5400 rpm
 +
Model Number:      WDC WD40EZRX-00SPEB0                   
 +
Serial Number:      WD-WCC4E3LNJA7X
 +
Partition:        /dev/sdb1
 +
Blocksize:        4096
 +
checking Current_Pending_Sector count for /dev/sdb partition /dev/sdb1 blocksize 4096 startsector 2048
 +
197 Current_Pending_Sector  0x0032  200  200  000    Old_age  Always      -      30
 +
Current_Pending_Sector is not zero but 30
 +
running short smartctl test for /dev/sdb ...
 +
</source>
 +
find defect file system block
 +
<source lang='bash' highlight='1'>
 +
fixbad /dev/sdb -x
 +
1 Short offline
 +
progress:  90%
 +
lba: 1374590864
 +
file system block: 171823602
 +
2 Selective offline
 +
progress:  50%
 +
lba: 1374590880
 +
file system block: 171823604
 +
</source>
 +
use debugfs to find corresponding file
 +
<source lang='bash' highlight='1,8'>
 +
sudo debugfs /dev/sdb1
 +
debugfs 1.44.1 (24-Mar-2018)
 +
debugfs:  testb 171823602
 +
Block 171823602 marked in use
 +
debugfs:  icheck 171823602
 +
Block Inode number
 +
171823602 61244964
 +
debugfs:  ncheck 61244964
 +
Inode Pathname
 +
61244964 /vmware/carme10/Windows 10 x64-s006.vmdk
 +
</source>
 +
write sectors - this is the dangerous part protected by having to enter the serial number of the disk!
 +
<source lang='bash' highlight='1'>
 +
fixbad /dev/sdb -r 16 -s WD-WCC4E3LNJA7X -f -x
 +
1 Extended offline
 +
progress:  90%
 +
lba: 1374590889
 +
file system block: 171823605
 +
repairing sector 1374590889 to 1374590889+16 on /dev/sdb ...
 +
4 Short offline
 +
progress:  90%
 +
lba: 1374590864
 +
file system block: 171823602
 +
repairing sector 1374590864 to 1374590864+16 on /dev/sdb ...
 +
5 Selective offline
 +
progress:  50%
 +
lba: 1374590880
 +
file system block: 171823604
 +
repairing sector 1374590880 to 1374590880+16 on /dev/sdb ...
 +
</source>
 +
=== disk hypno ===
 +
==== get device info and check for pending sectors ====
 +
<source lang='bash' highlight='1'>
 +
fixbad /dev/sdb -i -c
 +
Model Family:    Western Digital Green
 +
Device Model:    WDC WD30EZRX-00D8PB0
 +
Serial Number:    WD-WMC4N0H2L71F
 +
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
 +
Sector Sizes:    512 bytes logical, 4096 bytes physical
 +
Rotation Rate:    5400 rpm
 +
Model Number:      WDC WD30EZRX-00D8PB0                   
 +
Serial Number:      WD-WMC4N0H2L71F
 +
Partition:        /dev/sdb1
 +
Blocksize:        4096
 +
checking Current_Pending_Sector count for /dev/sdb partition /dev/sdb1 blocksize 4096 startsector 2048
 +
197 Current_Pending_Sector  0x0032  199  066  000    Old_age  Always      -      844
 +
Current_Pending_Sector is not zero but 844
 +
running short smartctl test for /dev/sdb ...
 +
</source>
 +
==== do a selective test as a dry run ====
 +
<source lang='bash'>
 +
fixbad /dev/sdb -d -l
 +
</source>
 +
==== wait for the test result ====
 +
<source lang='bash'>
 +
fixbad /dev/sdb -v -w selective
 +
: 1 1 5860533167 Self_test_in_progress [10% left] (5196737056-5196802591)
 +
Completed: 1 1 5860533167 Completed_read_failure [10% left] (5196737056-5196802591)
 +
</source>
 +
==== cautiously start trying the repair ====
 +
===== dry run =====
 +
<source lang='bash'>
 +
wf@fur:~/source/bash/mp$ fixbad /dev/sdb -d --try
 +
attempting to fix sector 5196802591 on /dev/sdb
 +
repairing sector 5196802591 to 5196802591+8 on /dev/sdb ...
 +
hdparm --repair-sector 5196802591 --yes-i-know-what-i-am-doing /dev/sdb
 +
hdparm --repair-sector 5196802592 --yes-i-know-what-i-am-doing /dev/sdb
 +
hdparm --repair-sector 5196802593 --yes-i-know-what-i-am-doing /dev/sdb
 +
hdparm --repair-sector 5196802594 --yes-i-know-what-i-am-doing /dev/sdb
 +
hdparm --repair-sector 5196802595 --yes-i-know-what-i-am-doing /dev/sdb
 +
hdparm --repair-sector 5196802596 --yes-i-know-what-i-am-doing /dev/sdb
 +
hdparm --repair-sector 5196802597 --yes-i-know-what-i-am-doing /dev/sdb
 +
hdparm --repair-sector 5196802598 --yes-i-know-what-i-am-doing /dev/sdb
 +
</source>
 +
===== get serial number for write operation protection removal =====
 +
<source lang='bash'>
 +
fixbad /dev/sdb -s
 +
WD-WMC4N0H2L71F
 +
</source>
 +
===== use serial number and -f parameter to execute fix =====
 +
<source lang='bash' highlight='1,4'>
 +
wf@fur:~/source/bash/mp$ fixbad /dev/sdb -s WD-WMC4N0H2L71F -f --try
 +
attempting to fix sector 5196802591 on /dev/sdb
 +
repairing sector 5196802591 to 5196802591+8 on /dev/sdb ...
 +
cat /tmp/smart_repaired.log
 +
/dev/sdb:
 +
re-writing sector 5196802591: succeeded
 +
 +
/dev/sdb:
 +
re-writing sector 5196802592: succeeded
 +
 +
/dev/sdb:
 +
re-writing sector 5196802593: succeeded
 +
 +
/dev/sdb:
 +
re-writing sector 5196802594: succeeded
 +
 +
/dev/sdb:
 +
re-writing sector 5196802595: succeeded
 +
 +
/dev/sdb:
 +
re-writing sector 5196802596: succeeded
 +
 +
/dev/sdb:
 +
re-writing sector 5196802597: succeeded
 +
 +
/dev/sdb:
 +
re-writing sector 5196802598: succeeded
 
</source>
 
</source>
  
Line 188: Line 372:
 
* https://serverfault.com/a/641135/162693
 
* https://serverfault.com/a/641135/162693
 
* https://linux.die.net/man/8/smartctl
 
* https://linux.die.net/man/8/smartctl
 +
* https://unix.stackexchange.com/questions/473453/batch-processing-a-ddrescue-mapfile-with-debugfs
 +
* https://superuser.com/questions/979563/reallocate-bad-sector-linux
  
 
=fixbad script =
 
=fixbad script =
 +
This script has only been used for a few disks yet. It 's main goal is the automation of the fixing process.
 +
It's not clear yet whether the calculations of the script are correct - e.g. sector size might be wrong or wrongly used. So please only use on disks where further damage will do no harm e.g. you intend to throw away the disk any way.
 +
Do not use this script without checking the steps manually beforehand - you might fatality damage your disk!
 +
== usage ==
 +
<source lang='bash' highlight='1'>
 +
fixbad -h
 +
usage: fixbad [disk]
 +
  [-c|--check]
 +
  [-d|--dry]
 +
  [-h|--help]
 +
  [-i|--info]
 +
  [[-m|--mode] mode]
 +
  [[-r|--range] range]
 +
  [[-s|--serial [serial]]
 +
  [-t|--test]
 +
  [[-w|--wait [type]]
 +
  [-v|--verbose]
  
 +
  -h|--help: show this usage
 +
  -c|--check: check the disk
 +
  -d|--dry:  dry run - show commands only
 +
  -i|--info: show info about the given disk
 +
  -m|--mode: set mode: default=short
 +
  -r|--range: range of sectors to modify after bad sector
 +
  -s|--serial: get serial number of confirm serial number
 +
  -t|--test: run test for the given type e.g. selective selftest
 +
  -w|--wait: wait for the result of the given testype e.g. selective selftest
 +
  -v|--verbose: set verbose mode
 +
 +
example:
 +
  fixbad /dev/sdb -i
 +
 +
for any write operation you need to confirm the serial number
 +
to get serial number:
 +
  fixbad disk -s
 +
</source>
 +
 +
== fixbad ==
 
<source lang='bash'>
 
<source lang='bash'>
 
#!/bin/bash  
 
#!/bin/bash  
 +
# see http://wiki.bitplan.com/index.php/Bad_Block_Howto
 
# see https://github.com/hradec/fix_smart_last_bad_sector/blob/master/fix_smart_last_bad_sector.sh
 
# see https://github.com/hradec/fix_smart_last_bad_sector/blob/master/fix_smart_last_bad_sector.sh
 
# see https://www.thomas-krenn.com/de/wiki/Analyse_einer_fehlerhaften_Festplatte_mit_smartctl
 
# see https://www.thomas-krenn.com/de/wiki/Analyse_einer_fehlerhaften_Festplatte_mit_smartctl
Line 202: Line 426:
 
# should commands only be shown?
 
# should commands only be shown?
 
dry=false
 
dry=false
 +
# should write fixes be performed?
 +
fix=false
 
# range of sectors to modify after bad sector
 
# range of sectors to modify after bad sector
 
range=8
 
range=8
Line 247: Line 473:
 
usage() {
 
usage() {
 
   echo "usage: $0 [disk]"
 
   echo "usage: $0 [disk]"
 +
  echo "  [-a|--try]"
 +
  echo "  [-c|--check]"
 
   echo "  [-d|--dry]"
 
   echo "  [-d|--dry]"
 
   echo "  [-h|--help]"
 
   echo "  [-h|--help]"
Line 253: Line 481:
 
   echo "  [[-r|--range] range]"
 
   echo "  [[-r|--range] range]"
 
   echo "  [[-s|--serial [serial]]"
 
   echo "  [[-s|--serial [serial]]"
 +
  echo "  [-t|--test]"
 +
  echo "  [[-w|--wait [type]]"
 
   echo "  [-v|--verbose]"
 
   echo "  [-v|--verbose]"
 +
  echo "  [-x|--lba]"
 
   echo
 
   echo
 
   echo "  -h|--help: show this usage"
 
   echo "  -h|--help: show this usage"
 +
  echo "  -a|--try: attempt a sector repair based on selective test result"
 +
  echo "  -c|--check: check the disk"
 
   echo "  -d|--dry:  dry run - show commands only"
 
   echo "  -d|--dry:  dry run - show commands only"
 
   echo "  -i|--info: show info about the given disk"
 
   echo "  -i|--info: show info about the given disk"
Line 261: Line 494:
 
   echo "  -r|--range: range of sectors to modify after bad sector"
 
   echo "  -r|--range: range of sectors to modify after bad sector"
 
   echo "  -s|--serial: get serial number of confirm serial number"
 
   echo "  -s|--serial: get serial number of confirm serial number"
 +
  echo "  -t|--test: run test for the given type e.g. selective selftest"
 +
  echo "  -w|--wait: wait for the result of the given testype e.g. selective selftest"
 
   echo "  -v|--verbose: set verbose mode"
 
   echo "  -v|--verbose: set verbose mode"
 +
  echo "  -x|--lba: attempt a sector repair on short or long test result"
 
   echo ""
 
   echo ""
 
   echo "example:"
 
   echo "example:"
Line 275: Line 511:
 
# get a number range from 0 to the given n-1
 
# get a number range from 0 to the given n-1
 
#
 
#
# param $1: n  
+
# params
 +
1: n  
 
function getRange() {
 
function getRange() {
 
   local l_n="$1"
 
   local l_n="$1"
Line 285: Line 522:
 
# read the result of the smartctl test for the given disk
 
# read the result of the smartctl test for the given disk
 
#
 
#
 +
# params
 +
#  1: l_disk: the disk under test e.g. /dev/sdb
 +
#  2: l_type: the type of the test e.g. selective
 
function readResult() {
 
function readResult() {
 
   local l_disk="$1"
 
   local l_disk="$1"
  $sudo smartctl -l selftest $l_disk  | egrep "# [0-9]"
+
  local l_type="$2"
 +
  $sudo smartctl -l $l_type $l_disk  | egrep "^#?[[:space:]]*[0-9]"
 +
}
 +
 
 +
#
 +
# show the Result
 +
#
 +
function showResult() {
 +
  local l_logline="$1"
 +
  local l_logstatus="$2"
 +
  if [ "$verbose" == "true" ]
 +
  then
 +
    echo $l_logstatus:$l_logline 
 +
  else
 +
    echo $l_logline | gawk '
 +
    /#/ {
 +
      print $0; exit
 +
    }
 +
    {
 +
      status=substr($4,1,9)
 +
      progress=$5;
 +
      gsub("\\[","",progress);
 +
      range=$7
 +
      printf("\r%s",progress);
 +
    }'
 +
  fi
 +
}
 +
 
 +
#
 +
# wait for the result of a running selftest
 +
#
 +
# param 1: l_disk: the disk under test e.g. /dev/sdb
 +
# param 2: l_type: the type of the test e.g. selective
 +
# param 3: l_wait: number of seconds to wait
 +
#
 +
function waitForResult() {
 +
  # example
 +
  #=== START OF READ SMART DATA SECTION ===
 +
  #SMART Selective self-test log data structure revision number 1
 +
  #SPAN  MIN_LBA    MAX_LBA  CURRENT_TEST_STATUS
 +
  #        1  7814037167  Self_test_in_progress [90% left] (2564632-2630167)
 +
  local l_disk="$1"
 +
  local l_type="$2"
 +
  local l_wait="$3"
 +
  local l_logline=""
 +
  local l_logstatus=""
 +
  color_msg $blue "Waiting for $l_type test of $l_disk to stop (each dot is $l_wait sec)"
 +
  while [ "$l_logstatus" != "Completed" ]; do
 +
    l_logline=$(readResult "$l_disk" "$l_type"  | egrep "^#?[[:space:]]*1")
 +
    l_logstatus=$(echo $l_logline | gawk ' /Completed/ { print "Completed"; }')
 +
    showResult "$l_logline" "$l_logstatus"
 +
    sleep $l_wait
 +
  done
 
}
 
}
  
Line 335: Line 627:
 
   $sudo hdparm -I $l_disk | egrep "(Serial Number|Model)"
 
   $sudo hdparm -I $l_disk | egrep "(Serial Number|Model)"
 
   fs=$(getPartition $l_disk)
 
   fs=$(getPartition $l_disk)
   color_msg $blue "Partition:        $fs"
+
   if [ "$fs" != "" ]
  blocksize=$(getBlockSize $fs)
+
  then
  color_msg $blue "Blocksize:        $blocksize"
+
    color_msg $blue "Partition:        $fs"
 +
    blocksize=$(getBlockSize $fs)
 +
    color_msg $blue "Blocksize:        $blocksize"
 +
  else
 +
    color_msg $red "couldn't find mounted partition for $l_disk"
 +
  fi
 
}
 
}
  
Line 377: Line 674:
 
   local l_sector="$2"
 
   local l_sector="$2"
 
   local l_range="$3"
 
   local l_range="$3"
 +
  local diskserial=$(getSerialNumber $l_disk)
 
   color_msg $blue "repairing sector $l_sector to $l_sector+$l_range on $l_disk ..."
 
   color_msg $blue "repairing sector $l_sector to $l_sector+$l_range on $l_disk ..."
 
   r=$(getRange $l_range)
 
   r=$(getRange $l_range)
Line 385: Line 683:
 
  echo hdparm --repair-sector $b1  --yes-i-know-what-i-am-doing  $l_disk
 
  echo hdparm --repair-sector $b1  --yes-i-know-what-i-am-doing  $l_disk
 
     else
 
     else
       color_msg $red "not implemented yet"
+
       if [ "$fix" == "true" ]
#hdparm --repair-sector $b1  --yes-i-know-what-i-am-doing  $disk >> /tmp/smart_repaired.log
+
      then
 +
        if [ "$serial" != "$diskserial" ]
 +
        then
 +
          color_msg $red "you need to provide the serial number of $l_disk to perform fix operations"
 +
        else
 +
    $sudo hdparm --repair-sector $b1  --yes-i-know-what-i-am-doing  $disk >> /tmp/smart_repaired.log
 +
        fi
 +
      fi
 
     fi
 
     fi
 
done
 
done
Line 398: Line 703:
 
#
 
#
 
checkSoftware() {
 
checkSoftware() {
   for sw in awk debugfs hdparm smartctl python $sudo
+
   for sw in gawk debugfs fdisk hdparm smartctl tune2fs python $sudo
 
   do
 
   do
 
     bin=$(which $sw)
 
     bin=$(which $sw)
Line 411: Line 716:
 
     fi
 
     fi
 
   done
 
   done
 +
}
 +
 +
#
 +
# run a test for the given disk in the given mode
 +
#
 +
# params
 +
#  1: l_disk: the disk under test e.g. /dev/sdb
 +
#  2: l_mode: the mode of the self test e.g. short/long
 +
function runTest() {
 +
  local l_disk="$1"
 +
  local l_mode="$2"
 +
  color_msg $blue  "running $l_mode smartctl test for $l_disk ..."
 +
$sudo smartctl -t $l_mode $l_disk > /tmp/null
 
}
 
}
  
Line 428: Line 746:
 
   if [ $psector -gt 0 ]
 
   if [ $psector -gt 0 ]
 
   then
 
   then
     echo  "running $l_mode smartctl test for $l_disk ..."
+
     runTest $l_disk $l_mode
  $sudo smartctl -t $l_mode $l_disk > /tmp/null
+
  fi
    readResult "$l_disk" | while read line
+
}
    do
+
 
      echo $line | grep "read failure" > /dev/null
+
#
      if [ $? -eq 0 ]
+
# check the lba block
      then
+
#
          index=$(echo $line | cut -f2 -d' ')
+
function lbaCheck() {
          state=$(echo $line | cut -f3-4 -d ' ')
+
  local l_disk="$1"
          progress=$(echo $line | cut -f8 -d ' ')
+
  fs=$(getPartition $l_disk)
          lba=$(echo $line | cut -f10 -d ' ')
+
  blocksize=$(getBlockSize $fs)
          echo $index $state  
+
  startsector=$(getStartSector $l_disk $fs)
          echo "progress:  $progress"
+
  readResult "$l_disk" selftest | while read line
          echo "lba: $lba"
+
  do
          # calculate the file system block
+
    echo $line | grep "read failure" > /dev/null
          fsb=$(awk -v L=$lba -v S=$startsector -v B=$blocksize 'BEGIN {printf ("%.0f",((L-S)*512/B))}')
+
    if [ $? -eq 0 ]
          echo "file system block: $fsb"
+
    then
          fixBad $l_disk $lba $range
+
      if [ "$verbose" == "true" ]
      fi
+
      then
    done
+
        echo $line
  fi
+
      fi
 +
      index=$(echo $line | cut -f2 -d' ')
 +
      state=$(echo $line | cut -f3-4 -d ' ')
 +
      progress=$(echo $line | cut -f8 -d ' ')
 +
      lba=$(echo $line | cut -f10 -d ' ')
 +
      if [ "$lba" == "" ]
 +
      then
 +
        lba=0
 +
      fi
 +
      if  [ "$lba" -gt 0 ]
 +
      then
 +
        echo $index $state  
 +
        echo "progress:  $progress"
 +
        echo "lba: $lba"
 +
        # calculate the file system block
 +
        fsb=$(gawk -v L=$lba -v S=$startsector -v B=$blocksize 'BEGIN {printf ("%.0f",((L-S)*512/B))}')
 +
        echo "file system block: $fsb"
 +
        fixBad $l_disk $lba $range
 +
      fi
 +
    fi
 +
  done
 +
}
 +
 
 +
#
 +
# try Fixing bad sectors
 +
#
 +
function tryFix() {
 +
  local l_disk="$1"
 +
  badsect=$($sudo smartctl -l selective ${l_disk} | grep Completed | cut -f2 -d "(" | cut -f2 -d'-'| sed -e 's/)//')
 +
  [ $badsect = "-" ] && exit 0
 +
  color_msg $blue "attempting to fix sector $badsect on $l_disk"
 +
  fixBad $l_disk $badsect $range
 
}
 
}
  
Line 460: Line 809:
 
       color_msg $blue "Testing $baddrive from LBA $badsect"
 
       color_msg $blue "Testing $baddrive from LBA $badsect"
 
       $sudo smartctl -t select,${badsect}-max ${baddrive} 2>&1 >> /dev/null
 
       $sudo smartctl -t select,${badsect}-max ${baddrive} 2>&1 >> /dev/null
 
+
       waitForResult $baddrive selective 5
      echo "Waiting for test to stop (each dot is 5 sec)"
+
       tryFix $baddrive
       while [ "$($sudo smartctl -l selective ${baddrive} | awk '/^ *1/{print substr($4,1,9)}')" != "Completed" ]; do
+
       color_msg $blue "running next test"
        echo -n .
 
        sleep 5
 
       done
 
      echo
 
 
 
      badsect=$($sudo smartctl -l selective ${baddrive} | awk '/# 1  Selective offline  Completed: read failure/ {print $10}')
 
       [ $badsect = "-" ] && exit 0
 
 
 
    echo Attempting to fix sector $badsect on $baddrive
 
    echo hdparm --repair-sector ${badsect} --yes-i-know-what-i-am-doing $baddrive
 
    echo running next test
 
 
   done
 
   done
 
}
 
}
 
    
 
    
 
 
# make sure the needed software is available
 
# make sure the needed software is available
 
checkSoftware
 
checkSoftware
Line 486: Line 823:
 
   shift
 
   shift
 
   case $option in
 
   case $option in
 +
    -a|--try)
 +
      tryFix $disk
 +
      ;;
 +
    -c|--check)
 +
      checkDisk $disk $mode $serial
 +
      ;;
 +
    -d|--dry)
 +
      dry=true
 +
      ;;
 +
    -f|--fix)
 +
      fix=true
 +
      ;;
 
     -h|--help)
 
     -h|--help)
 
       usage
 
       usage
Line 500: Line 849:
 
         shift
 
         shift
 
       fi
 
       fi
      ;;
 
    -c|--check)
 
      checkDisk $disk $mode $serial
 
      ;;
 
    -d|--dry)
 
      dry=true
 
 
       ;;
 
       ;;
 
     -l|--loop)
 
     -l|--loop)
Line 528: Line 871:
 
         shift
 
         shift
 
       fi
 
       fi
 +
      ;;
 +
    -t|--test)
 +
      runTest $disk $mode
 
       ;;
 
       ;;
 
     -v|--verbose)
 
     -v|--verbose)
 
       verbose=true
 
       verbose=true
 
       ;;
 
       ;;
 +
    -w|--wait)
 +
      if [ $# -lt 1 ]
 +
      then
 +
        usage
 +
      else
 +
        type=$1
 +
        shift
 +
        waitForResult $disk $type 5
 +
      fi
 +
      ;;
 +
    -x|--lba)
 +
      lbaCheck $disk $serial;;
 
     *)
 
     *)
 
       disk=$option
 
       disk=$option
Line 537: Line 895:
 
   esac
 
   esac
 
done
 
done
 
 
</source>
 
</source>

Latest revision as of 06:11, 9 October 2020

Your hard disk fails on Linux - now what?

I happened to me a few times in the last few weeks. A filesystem check for a hard disk would take for ever - my system won't boot. One of the harddisks was reporting bad block errors. At this time i have four different hard disks that are giving me this kind of trouble. All broken harddisks have been taken out of the system and I am analyzing the problem using a virtual machine and a USB-SATA bridge.

First step: remove harddisk from system and put into SATA USB docking station

Hopefully your disk is not needed to boot or fully operate your system. In that case you might want to boot of a USB stick or other media. In any case my procedure is to remove the disk from the original system and use a different system for analysis. In my case i am using an Ubuntu based virtual machine and connect the drive via a USB-SATA bridge. With USB 2 devices the performance is poor. I am not using my old Logilink QP002 Sata Docking Station anymore for this reason.

The USB 3 device i bought in 2015: 611EMR%2BPqWL._AC_SL1001_.jpg still serves me well.

Recently i added a 2 port docking station: 61ZUg7IV7ZL._AC_SL1300_.jpg

Second step - check problems

Tools needed

  1. A Linux virtual machine
  2. smartctl
  3. hdparam
  4. debugfs
  5. mount

Examples

https://www.smartmontools.org/wiki/BadBlockHowto describes how to handle bad blocks by using the tools manually. The script fixbad on this page helps to automate this process. Please find below some examples for the usage of the script.

disk Digda

Where is it mounted?

sudo mount | grep Digda
/dev/sdb1 on /media/wf/Digda type ext2 (rw,nosuid,nodev,relatime,uhelper=udisks2)

get basic info and check

get basic info

  1. using fixbad -i
  2. using smartctl -i
  3. using hdparm -I

see the details below check:

fixbad /dev/sdb -c
checking Current_Pending_Sector count for /dev/sdb partition /dev/sdb1 blocksize 4096 startsector 
197 Current_Pending_Sector  0x0032   252   252   000    Old_age   Always       -       0
Current_Pending_Sector is zero!

So this drive has no bad block issues.


fixbad

The fixbad script tries to summarize the most important information for the bad block fixing needs.

fixbad /dev/sdb -i
Model Family:     SAMSUNG SpinPoint F4 EG (AF)
Device Model:     SAMSUNG HD204UI
Serial Number:    S2H7J9EZC04171
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    5400 rpm
	Model Number:       SAMSUNG HD204UI                         
	Serial Number:      S2H7J9EZC04171      
Partition:        /dev/sdb1
Blocksize:        4096
smartctl
sudo smartctl -i /dev/sdb
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.0-72-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     SAMSUNG SpinPoint F4 EG (AF)
Device Model:     SAMSUNG HD204UI
Serial Number:    S2H7J9EZC04171
LU WWN Device Id: 5 0024e9 203f40fb7
Firmware Version: 1AQ10001
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Mon Oct  5 11:41:44 2020 CEST

==> WARNING: Using smartmontools or hdparm with this
drive may result in data loss due to a firmware bug.
****** THIS DRIVE MAY OR MAY NOT BE AFFECTED! ******
Buggy and fixed firmware report same version number!
See the following web pages for details:
http://knowledge.seagate.com/articles/en_US/FAQ/223571en
http://www.smartmontools.org/wiki/SamsungF4EGBadBlocks

SMART support is: Available - device has SMART capability.
SMART support is: Enabled
hdparm
sudo hdparm -I /dev/sdb

/dev/sdb:

ATA device, with non-removable media
	Model Number:       SAMSUNG HD204UI                         
	Serial Number:      S2H7J9EZC04171      
	Firmware Revision:  1AQ10001
	Transport:          Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6
Standards:
	Used: unknown (minor revision code 0x0028) 
	Supported: 8 7 6 5 
	Likely used: 8
Configuration:
	Logical		max	current
	cylinders	16383	16383
	heads		16	16
	sectors/track	63	63
	--
	CHS current addressable sectors:    16514064
	LBA    user addressable sectors:   268435455
	LBA48  user addressable sectors:  3907029168
	Logical  Sector size:                   512 bytes
	Physical Sector size:                   512 bytes
	device size with M = 1024*1024:     1907729 MBytes
	device size with M = 1000*1000:     2000398 MBytes (2000 GB)
	cache/buffer size  = unknown
	Form Factor: 3.5 inch
	Nominal Media Rotation Rate: 5400
Capabilities:
	LBA, IORDY(can be disabled)
	Queue depth: 32
	Standby timer values: spec'd by Standard, no device specific minimum
	R/W multiple sector transfer: Max = 16	Current = ?
	Advanced power management level: disabled
	Recommended acoustic management value: 254, current value: 0
	DMA: mdma0 mdma1 *mdma2 udma0 udma1 udma2 udma3 udma4 udma5 udma6 
	     Cycle time: min=120ns recommended=120ns
	PIO: pio0 pio1 pio2 pio3 pio4 
	     Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
	Enabled	Supported:
	   *	SMART feature set
	    	Security Mode feature set
	   *	Power Management feature set
	   *	Write cache
	   *	Look-ahead
	   *	Host Protected Area feature set
	   *	WRITE_BUFFER command
	   *	READ_BUFFER command
	   *	NOP cmd
	   *	DOWNLOAD_MICROCODE
	    	Advanced Power Management feature set
	    	Power-Up In Standby feature set
	   *	SET_FEATURES required to spinup after power up
	    	SET_MAX security extension
	    	Automatic Acoustic Management feature set
	   *	48-bit Address feature set
	   *	Device Configuration Overlay feature set
	   *	Mandatory FLUSH_CACHE
	   *	FLUSH_CACHE_EXT
	   *	SMART error logging
	   *	SMART self-test
	   *	General Purpose Logging feature set
	   *	64-bit World wide name
	   *	WRITE_UNCORRECTABLE_EXT command
	   *	{READ,WRITE}_DMA_EXT_GPL commands
	   *	Segmented DOWNLOAD_MICROCODE
	   *	Gen1 signaling speed (1.5Gb/s)
	   *	Gen2 signaling speed (3.0Gb/s)
	   *	Native Command Queueing (NCQ)
	   *	Host-initiated interface power management
	   *	Phy event counters
	   *	NCQ priority information
	    	DMA Setup Auto-Activate optimization
	    	Device-initiated interface power management
	   *	Software settings preservation
	   *	SMART Command Transport (SCT) feature set
	   *	SCT Read/Write Long (AC1), obsolete
	   *	SCT Write Same (AC2)
	   *	SCT Error Recovery Control (AC3)
	   *	SCT Features Control (AC4)
	   *	SCT Data Tables (AC5)
Security: 
	Master password revision code = 65534
		supported
	not	enabled
	not	locked
	not	frozen
	not	expired: security count
		supported: enhanced erase
	344min for SECURITY ERASE UNIT. 344min for ENHANCED SECURITY ERASE UNIT.
Logical Unit WWN Device Identifier: 50024e9203f40fb7
	NAA		: 5
	IEEE OUI	: 0024e9
	Unique ID	: 203f40fb7
Checksum: correct

disk wendi

Where is it mounted?

sudo mount | grep wendi
/dev/sdb1 on /media/wf/wendi type ext3 (rw,nosuid,nodev,relatime,data=ordered,uhelper=udisks2)

get basic info and check

fixbad /dev/sdb -i -c
Model Family:     SAMSUNG SpinPoint F4 EG (AF)
Device Model:     SAMSUNG HD204UI
Serial Number:    S2H7J9EZB05266
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    5400 rpm
	Model Number:       SAMSUNG HD204UI                         
	Serial Number:      S2H7J9EZB05266      
Partition:        /dev/sdb1
Blocksize:        4096
checking Current_Pending_Sector count for /dev/sdb partition /dev/sdb1 blocksize 4096 startsector 
197 Current_Pending_Sector  0x0032   252   252   000    Old_age   Always       -       0
Current_Pending_Sector is zero!

disk riako

check for bad sectors

fixbad /dev/sdb -i -c
Model Family:     Western Digital Green
Device Model:     WDC WD40EZRX-00SPEB0
Serial Number:    WD-WCC4E3LNJA7X
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
	Model Number:       WDC WD40EZRX-00SPEB0                    
	Serial Number:      WD-WCC4E3LNJA7X
Partition:        /dev/sdb1
Blocksize:        4096
checking Current_Pending_Sector count for /dev/sdb partition /dev/sdb1 blocksize 4096 startsector 2048
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       30
Current_Pending_Sector is not zero but 30
running short smartctl test for /dev/sdb ...

find defect file system block

fixbad /dev/sdb -x
1 Short offline
progress:  90%
lba: 1374590864
file system block: 171823602
2 Selective offline
progress:  50%
lba: 1374590880
file system block: 171823604

use debugfs to find corresponding file

sudo debugfs /dev/sdb1
debugfs 1.44.1 (24-Mar-2018)
debugfs:  testb 171823602
Block 171823602 marked in use
debugfs:  icheck 171823602
Block	Inode number
171823602	61244964
debugfs:  ncheck 61244964
Inode	Pathname
61244964	/vmware/carme10/Windows 10 x64-s006.vmdk

write sectors - this is the dangerous part protected by having to enter the serial number of the disk!

fixbad /dev/sdb -r 16 -s WD-WCC4E3LNJA7X -f -x 
1 Extended offline
progress:  90%
lba: 1374590889
file system block: 171823605
repairing sector 1374590889 to 1374590889+16 on /dev/sdb ...
4 Short offline
progress:  90%
lba: 1374590864
file system block: 171823602
repairing sector 1374590864 to 1374590864+16 on /dev/sdb ...
5 Selective offline
progress:  50%
lba: 1374590880
file system block: 171823604
repairing sector 1374590880 to 1374590880+16 on /dev/sdb ...

disk hypno

get device info and check for pending sectors

fixbad /dev/sdb -i -c
Model Family:     Western Digital Green
Device Model:     WDC WD30EZRX-00D8PB0
Serial Number:    WD-WMC4N0H2L71F
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
	Model Number:       WDC WD30EZRX-00D8PB0                    
	Serial Number:      WD-WMC4N0H2L71F
Partition:        /dev/sdb1
Blocksize:        4096
checking Current_Pending_Sector count for /dev/sdb partition /dev/sdb1 blocksize 4096 startsector 2048
197 Current_Pending_Sector  0x0032   199   066   000    Old_age   Always       -       844
Current_Pending_Sector is not zero but 844
running short smartctl test for /dev/sdb ...

do a selective test as a dry run

fixbad /dev/sdb -d -l

wait for the test result

fixbad /dev/sdb -v -w selective
: 1 1 5860533167 Self_test_in_progress [10% left] (5196737056-5196802591)
Completed: 1 1 5860533167 Completed_read_failure [10% left] (5196737056-5196802591)

cautiously start trying the repair

dry run
wf@fur:~/source/bash/mp$ fixbad /dev/sdb -d --try
attempting to fix sector 5196802591 on /dev/sdb
repairing sector 5196802591 to 5196802591+8 on /dev/sdb ...
hdparm --repair-sector 5196802591 --yes-i-know-what-i-am-doing /dev/sdb
hdparm --repair-sector 5196802592 --yes-i-know-what-i-am-doing /dev/sdb
hdparm --repair-sector 5196802593 --yes-i-know-what-i-am-doing /dev/sdb
hdparm --repair-sector 5196802594 --yes-i-know-what-i-am-doing /dev/sdb
hdparm --repair-sector 5196802595 --yes-i-know-what-i-am-doing /dev/sdb
hdparm --repair-sector 5196802596 --yes-i-know-what-i-am-doing /dev/sdb
hdparm --repair-sector 5196802597 --yes-i-know-what-i-am-doing /dev/sdb
hdparm --repair-sector 5196802598 --yes-i-know-what-i-am-doing /dev/sdb
get serial number for write operation protection removal
fixbad /dev/sdb -s
WD-WMC4N0H2L71F
use serial number and -f parameter to execute fix
wf@fur:~/source/bash/mp$ fixbad /dev/sdb -s WD-WMC4N0H2L71F -f --try
attempting to fix sector 5196802591 on /dev/sdb
repairing sector 5196802591 to 5196802591+8 on /dev/sdb ...
cat /tmp/smart_repaired.log
/dev/sdb:
re-writing sector 5196802591: succeeded

/dev/sdb:
re-writing sector 5196802592: succeeded

/dev/sdb:
re-writing sector 5196802593: succeeded

/dev/sdb:
re-writing sector 5196802594: succeeded

/dev/sdb:
re-writing sector 5196802595: succeeded

/dev/sdb:
re-writing sector 5196802596: succeeded

/dev/sdb:
re-writing sector 5196802597: succeeded

/dev/sdb:
re-writing sector 5196802598: succeeded

Links

fixbad script

This script has only been used for a few disks yet. It 's main goal is the automation of the fixing process. It's not clear yet whether the calculations of the script are correct - e.g. sector size might be wrong or wrongly used. So please only use on disks where further damage will do no harm e.g. you intend to throw away the disk any way. Do not use this script without checking the steps manually beforehand - you might fatality damage your disk!

usage

fixbad -h
usage: fixbad [disk]
   [-c|--check]
   [-d|--dry]
   [-h|--help]
   [-i|--info]
   [[-m|--mode] mode]
   [[-r|--range] range]
   [[-s|--serial [serial]]
   [-t|--test]
   [[-w|--wait [type]]
   [-v|--verbose]

  -h|--help: show this usage
  -c|--check: check the disk
  -d|--dry:  dry run - show commands only
  -i|--info: show info about the given disk
  -m|--mode: set mode: default=short
  -r|--range: range of sectors to modify after bad sector
  -s|--serial: get serial number of confirm serial number
  -t|--test: run test for the given type e.g. selective selftest
  -w|--wait: wait for the result of the given testype e.g. selective selftest
  -v|--verbose: set verbose mode

example:
   fixbad /dev/sdb -i

for any write operation you need to confirm the serial number
to get serial number: 
   fixbad disk -s

fixbad

#!/bin/bash 
# see http://wiki.bitplan.com/index.php/Bad_Block_Howto
# see https://github.com/hradec/fix_smart_last_bad_sector/blob/master/fix_smart_last_bad_sector.sh
# see https://www.thomas-krenn.com/de/wiki/Analyse_einer_fehlerhaften_Festplatte_mit_smartctl
# WF 2020-10-04 
disk=/dev/sdb
mode=short
# verbose
verbose=false
# should commands only be shown?
dry=false
# should write fixes be performed?
fix=false
# range of sectors to modify after bad sector
range=8
# set to sudo if sudo is needed
sudo=sudo
# serial number
serial="-?-"

#ansi colors
#http://www.csc.uvic.ca/~sae/seng265/fall04/tips/s265s047-tips/bash-using-colors.html
blue='\033[0;34m'  
red='\033[0;31m'  
green='\033[0;32m' # '\e[1;32m' is too bright for white bg.
endColor='\033[0m'

#
# a colored message 
#   params:
#     1: l_color - the color of the message
#     2: l_msg - the message to display
#
color_msg() {
  local l_color="$1"
  local l_msg="$2"
  echo -e "${l_color}$l_msg${endColor}"
}

#
# error
#
#   show an error message and exit
#
#   params:
#     1: l_msg - the message to display
error() {
  local l_msg="$1"
  # use ansi red for error
  color_msg $red "Error: $l_msg" 1>&2
  exit 1
}

#
# show the usage
#
usage() {
  echo "usage: $0 [disk]"
  echo "   [-a|--try]"
  echo "   [-c|--check]"
  echo "   [-d|--dry]"
  echo "   [-h|--help]"
  echo "   [-i|--info]"
  echo "   [[-m|--mode] mode]"
  echo "   [[-r|--range] range]"
  echo "   [[-s|--serial [serial]]"
  echo "   [-t|--test]" 
  echo "   [[-w|--wait [type]]"
  echo "   [-v|--verbose]"
  echo "   [-x|--lba]"
  echo
  echo "  -h|--help: show this usage"
  echo "  -a|--try: attempt a sector repair based on selective test result"
  echo "  -c|--check: check the disk"
  echo "  -d|--dry:  dry run - show commands only"
  echo "  -i|--info: show info about the given disk"
  echo "  -m|--mode: set mode: default=short"
  echo "  -r|--range: range of sectors to modify after bad sector"
  echo "  -s|--serial: get serial number of confirm serial number"
  echo "  -t|--test: run test for the given type e.g. selective selftest"
  echo "  -w|--wait: wait for the result of the given testype e.g. selective selftest"
  echo "  -v|--verbose: set verbose mode"
  echo "  -x|--lba: attempt a sector repair on short or long test result"
  echo ""
  echo "example:"
  echo "   $0 /dev/sdb -i"
  echo ""
  echo "for any write operation you need to confirm the serial number"
  echo "to get serial number: "
  echo "   $0 disk -s "
  exit 1
}

#
# get a number range from 0 to the given n-1
#
# params 
#   1: n 
function getRange() {
  local l_n="$1"
  range=$(python -c "for i in range($l_n): print i,")
  echo $range
}

#
# read the result of the smartctl test for the given disk
#
# params
#   1: l_disk: the disk under test e.g. /dev/sdb
#   2: l_type: the type of the test e.g. selective
function readResult() {
   local l_disk="$1"
   local l_type="$2"
 	 $sudo smartctl -l $l_type $l_disk  | egrep "^#?[[:space:]]*[0-9]"
}

#
# show the Result
#
function showResult() {
  local l_logline="$1"
  local l_logstatus="$2"
  if [ "$verbose" == "true" ]
  then
    echo $l_logstatus:$l_logline  
  else
    echo $l_logline | gawk '
    /#/ {
      print $0; exit
    }
    { 
       status=substr($4,1,9)
       progress=$5;
       gsub("\\[","",progress);
       range=$7 
       printf("\r%s",progress);
     }'
  fi
}

#
# wait for the result of a running selftest
#
# param 1: l_disk: the disk under test e.g. /dev/sdb
# param 2: l_type: the type of the test e.g. selective
# param 3: l_wait: number of seconds to wait 
#
function waitForResult() {
   # example
   #=== START OF READ SMART DATA SECTION ===
   #SMART Selective self-test log data structure revision number 1
   #SPAN  MIN_LBA     MAX_LBA  CURRENT_TEST_STATUS
   #         1  7814037167  Self_test_in_progress [90% left] (2564632-2630167)
   local l_disk="$1"
   local l_type="$2"
   local l_wait="$3"
   local l_logline=""
   local l_logstatus=""
   color_msg $blue "Waiting for $l_type test of $l_disk to stop (each dot is $l_wait sec)"
   while [ "$l_logstatus" != "Completed" ]; do
     l_logline=$(readResult "$l_disk" "$l_type"  | egrep "^#?[[:space:]]*1")
     l_logstatus=$(echo $l_logline | gawk ' /Completed/ { print "Completed"; }')
     showResult "$l_logline" "$l_logstatus"
     sleep $l_wait 
   done
}

#
# get the serial number of the device
#
function getSerialNumber() {
  local l_disk="$1"
  serial=$($sudo smartctl -i  $l_disk  | grep "Serial Number" | cut -f 2 -d':')
  echo $serial
}

#
# get the blocksize of the given file system
#
function getBlockSize() {
  local l_fs="$1"
  blocksize=$($sudo tune2fs -l $l_fs | grep "Block size:" | cut -f2 -d':')
  echo $blocksize
}

#
# get the partition for the given disk
#
function getPartition() {
  local l_disk="$1"
  fs=$(mount | grep $l_disk | cut -f1 -d' ')
  echo $fs
}

#
# get the start sector for the given disk
#
function getStartSector() {
  local l_disk="$1"
  local l_fs="$2"
  startsector=$($sudo fdisk -l $l_disk | grep $l_fs | cut -f4 -d' ')
  echo $startsector
}

#
# get Info about the given disk
#
function getInfo() {
  local l_disk="$1"
  $sudo smartctl -i $l_disk | egrep "(Model|Serial|Rotation|Sector|Capacity)"
  $sudo hdparm -I $l_disk | egrep "(Serial Number|Model)"
  fs=$(getPartition $l_disk)
  if [ "$fs" != "" ]
  then
    color_msg $blue "Partition:        $fs"
    blocksize=$(getBlockSize $fs)
    color_msg $blue "Blocksize:        $blocksize"
  else
    color_msg $red "couldn't find mounted partition for $l_disk"
  fi
}

#
# geh the current pending sector for the given disk
#
function getCurrentPendingSector() {
   local l_disk="$1"
   # if msg is empty don't show message but only return the current pending sector count
   local l_msg="$2"
   psectorline=$($sudo smartctl -A $l_disk | grep Current_Pending_Sector)
   psector=0
   if [ $? -eq 0 ]
   then
     if [ "$l_msg" != "" ]; then color_msg $green "$psectorline"; fi
     psector=$(echo $psectorline | cut -f 10 -d ' ')
     if  [ $psector -gt 0 ]
     then
        if [ "$l_msg" != "" ]; then color_msg $red "Current_Pending_Sector is not zero but $psector"; fi
     else
        if [ "$l_msg" != "" ]; then color_msg $green "Current_Pending_Sector is zero!"; fi
     fi
   else
     if [ "$l_msg" != "" ]; then color_msg $red "smartctl -A did not output Current_Pending_Sector"; fi
     psector=-1
   fi
   if [ "$l_msg" == "" ]; then echo $psector; fi
}

#
# fix the given bad sector on the given disk with the given range of sectors to fix
#
# param 1: disk e.g. /dev/sdb1
# param 2: defect sector to repair
# param 3: range - range of sectors to repair e.g. 8
# 
fixBad() {
  local l_disk="$1"
  local l_sector="$2"
  local l_range="$3"
  local diskserial=$(getSerialNumber $l_disk)
  color_msg $blue "repairing sector $l_sector to $l_sector+$l_range on $l_disk ..."
  r=$(getRange $l_range)
	for i in $r ; do
		let b1=$l_sector+$i
    if [ "$dry" == "true" ]
    then
		  echo hdparm --repair-sector $b1  --yes-i-know-what-i-am-doing  $l_disk
    else
      if [ "$fix" == "true" ]
      then
        if [ "$serial" != "$diskserial" ]
        then
          color_msg $red "you need to provide the serial number of $l_disk to perform fix operations"
        else
			    $sudo hdparm --repair-sector $b1  --yes-i-know-what-i-am-doing  $disk >> /tmp/smart_repaired.log
        fi
      fi
    fi
	done
	#tail -n 60 /tmp/smart_repaired.log | grep writing | tail -n 20
	#grep '#' /tmp/smart | head -5
	#hdparm -I $disk > /tmp/hdparm
}

#
# check the needed software
#
checkSoftware() {
  for sw in gawk debugfs fdisk hdparm smartctl tune2fs python $sudo
  do
    bin=$(which $sw)
    if [ $? -eq 0 ]
    then
      if [ "$verbose" == "true" ]
      then
        color_msg $green "will use $bin as $sw"
      fi
    else
      error "$0 needs $sw to work please install it"
    fi
  done
}

#
# run a test for the given disk in the given mode
# 
# params
#   1: l_disk: the disk under test e.g. /dev/sdb
#   2: l_mode: the mode of the self test e.g. short/long 
function runTest() {
   local l_disk="$1"
   local l_mode="$2"
   color_msg $blue  "running $l_mode smartctl test for $l_disk ..."
	 $sudo smartctl -t $l_mode $l_disk > /tmp/null
}

#
# check the given disk in the given mode
#
function checkDisk() {
   local l_disk="$1"
   local l_mode="$2"
   local l_serial="$3"
   fs=$(getPartition $l_disk)
   blocksize=$(getBlockSize $fs)
   startsector=$(getStartSector $l_disk $fs)
   color_msg $blue "checking Current_Pending_Sector count for $l_disk partition $fs blocksize $blocksize startsector $startsector"
   getCurrentPendingSector "$l_disk" show
   psector=$(getCurrentPendingSector "$l_disk")
   if [ $psector -gt 0 ]
   then
     runTest $l_disk $l_mode
   fi
}

#
# check the lba block
#
function lbaCheck() {
  local l_disk="$1"
  fs=$(getPartition $l_disk)
  blocksize=$(getBlockSize $fs)
  startsector=$(getStartSector $l_disk $fs)
  readResult "$l_disk" selftest | while read line
  do
    echo $line | grep "read failure" > /dev/null
    if [ $? -eq 0 ]
    then
      if [ "$verbose" == "true" ]
      then
        echo $line
      fi
      index=$(echo $line | cut -f2 -d' ')
      state=$(echo $line | cut -f3-4 -d ' ')
      progress=$(echo $line | cut -f8 -d ' ')
      lba=$(echo $line | cut -f10 -d ' ')
      if [ "$lba" == "" ]
      then 
        lba=0
      fi
      if  [ "$lba" -gt 0 ]
      then
        echo $index $state 
        echo "progress:  $progress"
        echo "lba: $lba"
        # calculate the file system block
        fsb=$(gawk -v L=$lba -v S=$startsector -v B=$blocksize 'BEGIN {printf ("%.0f",((L-S)*512/B))}')
        echo "file system block: $fsb"
        fixBad $l_disk $lba $range
      fi
    fi
  done
}

#
# try Fixing bad sectors
#
function tryFix() {
   local l_disk="$1"
   badsect=$($sudo smartctl -l selective ${l_disk} | grep Completed | cut -f2 -d "(" | cut -f2 -d'-'| sed -e 's/)//')
   [ $badsect = "-" ] && exit 0
   color_msg $blue "attempting to fix sector $badsect on $l_disk"
   fixBad $l_disk $badsect $range
}

#
# start a check loop on the given drive
#
function checkLoop() {
   local baddrive="$1"
   badsect=1
   while true; do
      color_msg $blue "Testing $baddrive from LBA $badsect"
      $sudo smartctl -t select,${badsect}-max ${baddrive} 2>&1 >> /dev/null
      waitForResult $baddrive selective 5
      tryFix $baddrive
      color_msg $blue "running next test" 
  done
}
  
# make sure the needed software is available
checkSoftware
# commandline option
while [  "$1" != ""  ]
do
  option=$1
  shift
  case $option in
    -a|--try)
      tryFix $disk
      ;;
    -c|--check)
      checkDisk $disk $mode $serial
      ;;
    -d|--dry)
      dry=true
      ;;
    -f|--fix)
      fix=true
      ;;
    -h|--help)
       usage
       ;;
    -i|--info)
      getInfo $disk
      ;;
    -m|--mode)
      if [ $# -lt 1 ]
      then
        usage
      else
        mode=$1
        shift
      fi
      ;;
    -l|--loop)
      checkLoop $disk
      ;;
    -r|--range)
      if [ $# -lt 1 ]
      then
        usage
      else
        range=$1
        shift
      fi
      ;;
    -s|--serial)
      if [ $# -lt 1 ]
      then
        getSerialNumber $disk
        exit 1
      else
        serial=$1
        shift
      fi
      ;;
    -t|--test)
      runTest $disk $mode
      ;;
    -v|--verbose)
      verbose=true
      ;;
    -w|--wait)
      if [ $# -lt 1 ]
      then
        usage
      else
        type=$1
        shift
        waitForResult $disk $type 5
      fi
      ;;
    -x|--lba)
      lbaCheck $disk $serial;;
    *)
      disk=$option
      ;;
  esac
done