KB 10014 

What type of error handling does Vtrak and SuperTrak offer?

 

Bad Block Tables and Their Handling

Introduction

The handling of bad blocks for various RAID implementations can often be confusing to end users. This article is intended to add clarity to what the Promise RAID core does to handle some of those cases which could be encountered in the field.


BBM - Bad Block Map


Feature Description


Bad Block Map is a function used to keep track of the grown defect entries for each physical drive. Each entry is the physical location of an LBA. However for SATA drives it is kept in the DDF (Promise metadata region not exposed to the host) of the physical drive. This is to keep track of the number of error sector entries along with the reconstructed data for ATA drives. This is done if auto reassignment of sector fails with that drive. The data for up to 512 remapped sectors for each physical drive can be stored in the DDF. For SCSI drives the request for the BBM list will return the drive’s G-List from the drive. The user can use the BBM command to view this list for each drive from the CLI.  BBM command is not supported in CLU or GUI.

Problem Manifestation

When the drive encounters a medium error on a physical LBA, the firmware will regenerate the data and rewrite the error location. If write command fails, then follows a reassign operation for SCSI drives. However SATA drives do not have an equivalent of a G-List data structure, so that is virtualized by the RAID firmware. The drives will attempt to do an auto reassign on a write. Thus if a write command fails either with medium errors or with a time-out, an entry is made to the BBM.

Problem Causes

The entries are made in the BBM list when the drive returns a medium error for a write command. SATA drives use auto reassign for sectors of a write commands that fail the command if the re-map area on the drive is full. Thus once auto reassign on the drive fails the entry will be made to the BBM. This error recovery will not be visible to the user unless the BBM is full. When the list is full, the command is returned with medium error. In such a scenario, where in the error sectors cannot be reassigned in the BBM area, the drive is marked dead.  For SAS drive, the write error is followed up with re-assign command which in turn updates drive’s G-list. If reassign command fails on SAS drives, the drive is marked dead.

Problem Resolution

No user action is necessary. Use of the BBM command will only show how many entries are made for each drive. Only once the BBM is full then physical drive replacement is recommended as any more errors found with the drive can bring the drive offline.


IBT - Inconsistent Block Table

Feature Description

Entries to this table are for data blocks found to be inconsistent when a Redundancy Check operation is executed. These entries are recorded in Inconsistent Block Table. The firmware will not use this table. The data from this table will be passed to the user upon request to determine what locations on the LUN are suspect. The data blocks in the table are logical LBAs. Using the CLI command CHECKTABLE, user can view this table. Deleting the logical drive or the array will delete this table.

Problem Manifestation

This error is extremely rare for it to occur “naturally”. However, in the case of an occurrence an unspecified amount of data up to the number of stripes affected will be in question. It is also possible that only the parity is affected and the data is okay.

Problem Causes

This happens when the parity or the mirror data does not match the data. This is initially true when the system is initializing but during normal operation it is not expected to happen as this is the result of an undetected data corruption on the physical drive. The user can force this condition to occur when a drive is marked offline and then write operations are completed to the critical drive. Afterwards that drive is then forced back online.

Problem Resolution

The user needs to check the data on the affected LUN.  If there are no data errors detected then no user action is necessary as the default (Auto Fix enabled) redundancy check operation will fix the parity or mirror metadata. If data errors are detected then files will need to be restored. Following file restoration an additional redundancy check execution is recommended as well as a file system check at the host level.

RCT - Read Check Table

Feature Description

Invalid Data Blocks are described as the data in a sector which is physically good but the data may be invalid. Read Check Table is used to record these logical LBAs which have invalid data. A read to this LBA by the host will be returned as Medium Error. A write to that LBA will clear the entry. The data blocks in the table are logical LBAs. Using the CLI command CHECKTABLE, user can view this table. Deleting the logical drive or the array will delete this table.

Problem Manifestation
The user will see errors for read commands for non redundant logical drives. The host will see medium errors from read commands to a particular LBA location.


Problem Causes
The issue is caused by unrecoverable medium errors returned from a physical drive for read operations while the logical drive is not in a redundant state. Thus the RAID firmware is unable to re-generate and write the correct data to the drive. An example of this case is if the controller is rebuilding a RAID 5 and it encounters an error while reading from one of the online members to reconstruct data.


Problem Resolution
This is corrected by rewriting/restoring the data to that location. This is normally handled by the operating system’s disk file system which will flag any files affected by the error.  Typically a disk verify operation (chkdsk/fsck) is necessary to check all of the partition(s). But with this case, any files affected needs to be restored. 


WCT - Write Check Table


Feature Description

Entries in this table are the LBAs which, a corresponding read-verify command failed after a write command to the physical drive and also failed the re-assign command (this could be caused if BBM is full). Write Check Table is used to record these LBA entries. Any read or write to these blocks will be returned with medium error. The data blocks in the table are logical LBAs. Using the CLI command CHECKTABLE, user can view this table. Deleting the logical drive or the array will delete this table. The rebuild operation will convert any WCT entry which falls on the rebuilding physical drive to RCT entry.

Problem Manifestation

This problem is detected by the user when both read and write operation fail to the same location with medium errors. This is considered to be a rare event. This error is happening when there are write error blocks in the WCT.

Problem Causes

In the case of SAS drives there are many medium errors and the G-List is full. For SATA drives the BBM is full.

Problem Resolution

It is recommended to replace (reconstruct the data) the physical drive which is giving the errors before other corrective actions are taken. Any files affected by this error will be flagged by the operating system and will need to be restored.