LTO-4 drive disappears after using tar and I/O error

I’ve got a very similar issue to the thread at IT Resource Center forums - Tar command issue with Ultrium LTO-4 1760 on SAS bus with an EH919A - HP LTO4 ULTRIUM 1760 SAS Internal Drive attached to an HP SC44Ge on openSUSE 11.2. I’ve posted this question at itrc.hp.com as well but without any responses as of yet. Hopefully somebody here can lend a hand.

The drive shows up as follows:

server3:~ # hwinfo --tape
30: SCSI 500.0: 10601 Tape
[Created at scsi.1452]
UDI: /org/freedesktop/Hal/devices/pci_1000_58_scsi_host_scsi_device_lun0_scsi_generic
Unique ID: Er1e.jH5QXkQpQf6
Parent ID: B35A.C24hfcjfg26
SysFS ID: /class/scsi_tape/st0
SysFS BusID: 5:0:0:0
SysFS Device Link: /devices/pci0000:00/0000:00:0a.0/0000:02:00.0/host5/port-5:0/end_device-5:0/target5:0:0/5:0:0:0
Hardware Class: unknown
Model: “HP Ultrium 4-SCSI”
Vendor: “HP”
Device: “Ultrium 4-SCSI”
Revision: “U52D”
Driver: “mptsas”, “st”
Driver Modules: “mptsas”
Device File: /dev/st0 (/dev/sg3)
Device Files: /dev/st0, /dev/char/9:0, /dev/tape/by-id/scsi-3500110a0012cee22
Device Number: char 9:0 (char 21:3)
Config Status: cfg=no, avail=yes, need=no, active=unknown
Attached to: #26 (SCSI storage controller)

and mt reports:

server3:~ # mt -f /dev/st0 status
drive type = Generic SCSI-2 tape
drive status = 1174405120
sense key error = 0
residue count = 0
file number = -1
block number = -1
Tape block size 0 bytes. Density code 0x46 (unknown).
Soft error count since last status=0
General status bits on (1010000):
ONLINE IM_REP_EN
server3:~ #

The device files present are:

server3:~ # ls -l /dev/st0
crw-rw---- 1 root tape 9, 128 Oct 30 12:05 /dev/nst0
crw-rw---- 1 root tape 9, 224 Oct 30 12:05 /dev/nst0a
crw-rw---- 1 root tape 9, 160 Oct 30 12:05 /dev/nst0l
crw-rw---- 1 root tape 9, 192 Oct 30 12:05 /dev/nst0m
crw-rw---- 1 root tape 9, 0 Oct 30 12:05 /dev/st0
crw-rw---- 1 root tape 9, 96 Oct 30 12:05 /dev/st0a
crw-rw---- 1 root tape 9, 32 Oct 30 12:05 /dev/st0l
crw-rw---- 1 root tape 9, 64 Oct 30 12:05 /dev/st0m

/proc/scsi/scsi reports:

server3:/proc/scsi # cat scsi
Attached devices:
Host: scsi2 Channel: 00 Id: 00 Lun: 00
Vendor: TSSTcorp Model: DVD±RW TS-H653G Rev: DW10
Type: CD-ROM ANSI SCSI revision: 05
Host: scsi4 Channel: 00 Id: 32 Lun: 00
Vendor: DP Model: BACKPLANE Rev: 1.07
Type: Enclosure ANSI SCSI revision: 05
Host: scsi4 Channel: 02 Id: 00 Lun: 00
Vendor: DELL Model: PERC 6/i Adapter Rev: 1.22
Type: Direct-Access ANSI SCSI revision: 05
Host: scsi5 Channel: 00 Id: 00 Lun: 00
Vendor: HP Model: Ultrium 4-SCSI Rev: U52D
Type: Sequential-Access ANSI SCSI revision: 05

Small backups done using dd or tar complete without any problem. So far I’ve tested up to about 500MB. I’m not exactly sure yet, but somewhere in the small GB range, tar reports an I/O error on st0 and then the device files disappear - i.e. /dev/st0 is gone. The only way to get them back it seems is to not just reboot, but to power off, and wait a few minutes.

Here’s what happens on a 5 GB test backup:

server3:~ # mt -f /dev/st0 status
drive type = Generic SCSI-2 tape
drive status = 1174405120
sense key error = 0
residue count = 0
file number = 0
block number = 0
Tape block size 0 bytes. Density code 0x46 (unknown).
Soft error count since last status=0
General status bits on (41010000):
BOT ONLINE IM_REP_EN
server3:~ # tar -cf /dev/st0 /home/bob/
tar: Removing leading `/’ from member names
tar: /dev/st0: Cannot write: Input/output error
tar: Error is not recoverable: exiting now
server3:~ # echo $?
2
server3:~ # mt -f /dev/st0 status
mt: /dev/st0: rmtopen failed: No such device or address

and in fact /dev/st0 is now missing. I shut down udev (after power cycling the box to get the drive to show up again) to see what would happen - the device files now persist as expected, however, the error is the same and the drive is inaccessible.

After the I/O error is reported, the following (snippet) is seen in dmesg:

[191698.629375] st0: Block limits 1 - 16777215 bytes.
[218431.969364] mptbase: ioc0: LogInfo(0x31170000): Originator={PL}, Code={IO Device Missing Delay Retry}, SubCode(0x0000)
[218432.450842] mptbase: ioc0: LogInfo(0x31170000): Originator={PL}, Code={IO Device Missing Delay Retry}, SubCode(0x0000)
[218432.450875] st0: Error 20000 (driver bt 0x0, host bt 0x2).
[218434.696385] mptbase: ioc0: LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed}, SubCode(0x0000)
[218434.696414] st0: Error 80000 (driver bt 0x0, host bt 0x8).
[218434.696420] st0: Error on write filemark.
[218434.696441] st0: Error 10000 (driver bt 0x0, host bt 0x1).
[218434.696855] end_device-5:0: mptsas: ioc0: removing ssp device: fw_channel 0, fw_id 4, phy 3,sas_addr 0x500110a0012cee20
[218434.696860] phy-5:3: mptsas: ioc0: delete phy 3, phy-obj (0xffff88022f1d9000)
[218434.696870] port-5:0: mptsas: ioc0: delete port 0, sas_addr (0x500110a0012cee20)
[218434.697167] scsi target5:0:0: mptsas: ioc0: delete device: fw_channel 0, fw_id 4, phy 3, sas_addr 0x500110a0012cee20
[218492.093127] mptsas: ioc0: attaching ssp device: fw_channel 0, fw_id 4, phy 3, sas_addr 0x500110a0012cee20
[218512.709936] mptscsih: ioc0: attempting task abort! (sc=ffff88022f25c5c0)
[218512.709943] scsi 5:0:1:0: CDB: Inquiry: 12 00 00 00 24 00
[218517.033156] mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000)
[218517.033308] mptscsih: ioc0: task abort: SUCCESS (sc=ffff88022f25c5c0)
[218546.973836] mptscsih: ioc0: attempting task abort! (sc=ffff88022f25c5c0)
[218546.973843] scsi 5:0:1:0: CDB: Test Unit Ready: 00 00 00 00 00 00
[218551.464821] mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000)
[218551.464998] mptscsih: ioc0: task abort: SUCCESS (sc=ffff88022f25c5c0)
[218551.465008] mptscsih: ioc0: attempting target reset! (sc=ffff88022f25c5c0)
[218551.465014] scsi 5:0:1:0: CDB: Inquiry: 12 00 00 00 24 00
[218552.463027] mptscsih: ioc0: target reset: SUCCESS (sc=ffff88022f25c5c0)
[218582.403452] mptscsih: ioc0: attempting task abort! (sc=ffff88022f25c5c0)
[218582.403459] scsi 5:0:1:0: CDB: Test Unit Ready: 00 00 00 00 00 00
[218586.894622] mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000)
[218586.894814] mptscsih: ioc0: task abort: SUCCESS (sc=ffff88022f25c5c0)
[218586.894824] mptscsih: ioc0: attempting bus reset! (sc=ffff88022f25c5c0)
[218586.894830] scsi 5:0:1:0: CDB: Inquiry: 12 00 00 00 24 00
[218587.892803] mptscsih: ioc0: bus reset: SUCCESS (sc=ffff88022f25c5c0)
[218627.814237] mptscsih: ioc0: attempting task abort! (sc=ffff88022f25c5c0)
[218627.814245] scsi 5:0:1:0: CDB: Test Unit Ready: 00 00 00 00 00 00
[218632.304679] mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000)
[218632.304865] mptscsih: ioc0: task abort: SUCCESS (sc=ffff88022f25c5c0)
[218632.304886] mptscsih: ioc0: attempting host reset! (sc=ffff88022f25c5c0)
[218632.304893] mptbase: ioc0: Initiating recovery
[218686.364004] INFO: task mpt/0:147 blocked for more than 120 seconds.
[218686.364014] “echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.
[218686.364021] mpt/0 D 0000000000000002 0 147 2 0x00000000
[218686.364031] ffff88022c7ef5d0 0000000000000046 ffff88022c7ef550 0000000000013a00
[218686.364041] 0000000000013a00 ffff88022cbd6c78 0000000000013a00 0000000000013a00
[218686.364053] 0000000000013a00 0000000000013a00 ffff88022cbd6c78 0000000000013a00
[218686.364059] Call Trace:
[218686.364072] <ffffffff81554225>] schedule_timeout+0x1c5/0x220
[218686.364079] <ffffffff815531fa>] wait_for_common+0xca/0x1a0
[218686.364086] <ffffffff8155341b>] wait_for_completion+0x2b/0x50
[218686.364094] <ffffffff81260ce7>] blk_execute_rq+0x87/0xe0
[218686.364102] <ffffffff8136e401>] scsi_execute+0x121/0x1a0
[218686.364109] <ffffffff8136e6f7>] scsi_execute_req+0xc7/0x210
[218686.364115] <ffffffff8136f9eb>] scsi_probe_lun+0x11b/0x450
[218686.364121] <ffffffff81370707>] scsi_probe_and_add_lun+0x197/0x5b0
[218686.364128] <ffffffff81371956>] __scsi_scan_target+0xe6/0x210
[218686.364134] <ffffffff81371faf>] scsi_scan_target+0xdf/0xf0
[218686.364148] <ffffffffa002a11a>] sas_rphy_add+0x12a/0x180 [scsi_transport_sas]
[218686.364159] <ffffffffa005d609>] mptsas_add_end_device+0x109/0x140 [mptsas]
[218686.364168] <ffffffffa0062e78>] mptsas_hotplug_work+0x288/0x4c0 [mptsas]
[218686.364178] <ffffffffa0063178>] mptsas_send_sas_event+0xc8/0x100 [mptsas]
[218686.364188] <ffffffffa00635c8>] mptsas_firmware_event_work+0x188/0x210 [mptsas]
[218686.364198] <ffffffff810857ea>] run_workqueue+0x9a/0x1e0

Presumably, this is due to the device file having disappeared.

Does anybody have any thoughts or suggestions on how to troubleshoot this further? One potential avenue I might explore is using the HP supplied driver for SLES as opposed to the out of the box driver in openSUSE. I’ve since looked at this and the supported kernels for the SC44Ge controller at hp.com for SLES do not match the kernel running on openSUSE 11.2.

I’m guessing at this point that it’s a driver issue with the SC44Ge rather than the tape drive itself. Clearly, there is an hp supplied SC44Ge driver for SLES but not for openSUSE which is using the native mptbase module.

Many thanks in advance!

I’ve had this recently with an LSI SAS controller and a HP autoloader.

Turns out that if the “blocking factor” is incorrect, your drive may disappear after attempting some writes. Blocking factor for LTO-5 for example, is 1024.

HTH