KB 10014
What type of error
handling does Vtrak and SuperTrak offer?
Bad Block Tables and Their Handling
Introduction
The handling of bad blocks for various RAID implementations can often be confusing
to end users. This article is intended to add clarity to what the Promise
RAID core does to handle some of those cases which could be encountered in the
field.
BBM - Bad Block Map
Feature Description
Bad Block Map is a function used to keep track of the grown defect entries for
each physical drive. Each entry is the physical location of an LBA. However for
SATA drives it is kept in the DDF (Promise metadata region not exposed to the host)
of the physical drive. This is to keep track of the number of error sector
entries along with the reconstructed data for ATA drives. This is done if auto
reassignment of sector fails with that drive. The data for up to 512 remapped
sectors for each physical drive can be stored in the DDF. For SCSI drives the
request for the BBM list will return the drive’s G-List from the drive. The
user can use the BBM command to view this list for each drive from the CLI.
BBM command is not supported in CLU or GUI.
Problem Manifestation
When
the drive encounters a medium error on a physical LBA, the firmware will
regenerate the data and rewrite the error location. If write command fails,
then follows a reassign operation for SCSI drives. However SATA drives do not
have an equivalent of a G-List data structure, so that is virtualized by the
RAID firmware. The drives will attempt to do an auto reassign on a write. Thus
if a write command fails either with medium errors or with a time-out, an entry
is made to the BBM.
Problem Causes
The entries are made in the BBM list when the drive returns a medium error for a
write command. SATA drives use auto reassign for sectors of a write commands that
fail the command if the re-map area on the drive is full. Thus once auto
reassign on the drive fails the entry will be made to the BBM. This error
recovery will not be visible to the user unless the BBM is full. When the list
is full, the command is returned with medium error. In such a scenario, where
in the error sectors cannot be reassigned in the BBM area, the drive is marked
dead. For SAS drive, the write error is followed up with re-assign
command which in turn updates drive’s G-list. If reassign command fails on SAS
drives, the drive is marked dead.
Problem
Resolution
No user action is necessary. Use of the BBM command will only show how many
entries are made for each drive. Only once the BBM is full then physical drive
replacement is recommended as any more errors found with the drive can bring
the drive offline.
IBT -
Inconsistent Block Table
Feature Description
Entries
to this table are for data blocks found to be inconsistent when a Redundancy
Check operation is executed. These entries are recorded in Inconsistent Block
Table. The firmware will not use this table. The data from this table will be
passed to the user upon request to determine what locations on the LUN are
suspect. The data blocks in the table are logical
LBAs. Using the CLI command CHECKTABLE, user can view this table. Deleting
the logical drive or the array will delete this table.
Problem Manifestation
This error is extremely rare for it to occur “naturally”. However, in the
case of an occurrence an unspecified amount of data up to the number of stripes
affected will be in question. It is also possible that only the parity is
affected and the data is okay.
Problem Causes
This happens when the parity or the mirror data does not match the data.
This is initially true when the system is initializing but during normal
operation it is not expected to happen as this is the result of an undetected
data corruption on the physical drive. The user can force this condition to
occur when a drive is marked offline and then write operations are completed to
the critical drive. Afterwards that drive is then forced back online.
Problem Resolution
The user needs to check the data on the affected LUN. If there are no data errors detected then no
user action is necessary as the default (Auto Fix enabled) redundancy check
operation will fix the parity or mirror metadata. If data errors are detected
then files will need to be restored. Following file restoration an additional
redundancy check execution is recommended as well as a file system check at the
host level.
RCT - Read
Check Table
Feature Description
Invalid
Data Blocks are described as the data in a sector which is physically good but
the data may be invalid. Read Check Table is used to record these logical LBAs
which have invalid data. A read to this LBA by the host will be returned as
Medium Error. A write to that LBA will clear the entry. The data blocks in the
table are logical LBAs. Using the CLI command CHECKTABLE, user can view this
table. Deleting the logical drive or the array will delete this table.
Problem
Manifestation
The user will see errors for read commands for non redundant logical drives.
The host will see medium errors from read commands to a particular LBA
location.
Problem Causes
The issue is caused by unrecoverable medium errors returned from a physical
drive for read operations while the logical drive is not in a redundant state.
Thus the RAID firmware is unable to re-generate and write the correct data to
the drive. An example of this case is if the controller is rebuilding a RAID 5
and it encounters an error while reading from one of the online members to
reconstruct data.
Problem Resolution
This is corrected by rewriting/restoring the data to that location. This is
normally handled by the operating system’s disk file system which will flag any
files affected by the error. Typically a
disk verify operation (chkdsk/fsck) is necessary to check all of the
partition(s). But with this case, any files affected needs to be restored.
WCT - Write
Check Table
Feature Description
Entries in this table are the LBAs which, a corresponding read-verify command
failed after a write command to the physical drive and also failed the
re-assign command (this could be caused if BBM is full). Write Check Table is used
to record these LBA entries. Any read or write to these blocks will be returned
with medium error. The data blocks in the table are logical LBAs. Using the CLI
command CHECKTABLE, user can view this table. Deleting the logical drive or the
array will delete this table. The rebuild operation will convert any WCT entry
which falls on the rebuilding physical drive to RCT entry.
Problem Manifestation
This problem is detected by the user when both read and write operation fail to
the same location with medium errors. This is considered to be a rare event.
This error is happening when there are write error blocks in the WCT.
Problem Causes
In the case of SAS drives there are many medium errors and the G-List is full.
For SATA drives the BBM is full.
Problem Resolution
It is recommended to replace (reconstruct the data) the physical drive which is
giving the errors before other corrective actions are taken. Any files affected
by this error will be flagged by the operating system and will need to be
restored.