Page 1 of 2 12 LastLast
Results 1 to 10 of 11

Thread: MPT3SAS / LSI 9305-16i Issues (HBA/SATA Drive resets) on Multiple machines

  1. #1

    Default MPT3SAS / LSI 9305-16i Issues (HBA/SATA Drive resets) on Multiple machines

    I think I have found a problem with the LSI 9305-16i SAS/SATA HBA (and/or MPT3SAS driver) under LEAP15.2 (latest patches). I have been working this issue for about 4 weeks now and I think I've pretty much tried everything I can to mitigate the issue (short of simply using a different HBA). I have lots of info (inc dmesg dumps), but here is the basic gist of things:

    I have a couple of systems configured as prototypes for a Virtual Machine Server (we are an OEM Integrator). Both of these machines are running LEAP 15.2 on ASUS WS621E Dual Xeon Motherboards with recent online updates, patches, drivers, BIOS, and firmware (part of the troubleshooting process). These systems have Virtual Machines each running on their own dedicated SSD. One VM Guest to one 2.5" SATA SSD. These SATA SSD's are connected to LSI 9305-16i HBA's. The Firmware and BIOS on these HBA's have been updated to the very latest Rev (Dec 2020) but more on that later.

    Anywho... The Host O/S has been dismounting and remounting (Read Only) the Host Filesystem on each SSD about once a week for each of these VMs. This has happened pretty consistently on BOTH machines, on two different networks, and has happened to each and every VM regardless of the Guest O/S (Ubuntu Appliance, Suse Linux, OpenSuse SMG Appliance). The Host Filesystem (on each SSD) was originally EXT4 but we have also tried XFS as part of the troubleshooting process. (The only difference is HOW the disconnect problem manifests itself - Changing the Host FS didn't resolve the issue). Host File Systems are also mounted "noatime" and "nodirtime" to minimize unnecessary updates. Moving the VM's off the SATA drives and onto an internal RAID array mitigates the issue (likely as it eliminates the LSI HBA and MPT3SAS driver) but this is not a desirable end configuration for us.

    DMESG shows the Disk/HBA timeouts clearly related to MPT3SAS. One of the latest online updates actually upgraded this driver (to 35.101.00.00) but did not resolve the issue.

    I also found an interesting link (from June of 2020) about SATA drive timeouts on LSI 9305 HBA's and a firmware update (16.00.12.00) which was privately released to resolve these. Specifically, the fix was supposed to resolve timeouts which only effected SATA drives on this SAS/SATA HBA. I also noticed that a similarly versioned firmware release was made publicly available in Dec 2020. Needless to say I was pretty convinced this would solve the issue as it was a pretty good description of what we were seeing. I applied the latest firmware release and, unfortunately, it did not:

    https://www.truenas.com/community/re...re-update.145/

    I opened a ticket with LSI/Avago/Broadcom tech support but have not heard anything back. (Surprise, surprise..) I have a bunch of data and observations (and can surely get what else is needed) and I'ld appreciate the ability to get a dialog started with anyone here who might be able to help provide more insight and/or help resolve the issue.

  2. #2

    Default Re: MPT3SAS / LSI 9305-16i Issues (HBA/SATA Drive resets) on Multiple machines

    Update...

    It looks like the weekly TRIM operation is directly related to these issues. That operation runs at midnight every Monday morning and that is precisely when the SATA SSD's connected to the 9305-16i typically see the "remount" (reset) issues.



    Output from my latest fstrim logs (journalctl -b -u fstrim.service) looks like this. You can see the one VM ("Dimension" drop off line with an ioctl failure):

    -- Logs begin at Fri 2021-04-16 18:12:36 CDT, end at Wed 2021-04-21 12:00:01 CDT. --
    Apr 19 00:00:01 sundance systemd[1]: Starting Discard unused blocks on filesystems from /etc/fstab...
    Apr 19 00:03:43 sundance fstrim[13415]: fstrim: /VMDisks/Dimension: FITRIM ioctl failed: Input/output error
    Apr 19 00:07:46 sundance fstrim[13415]: /VMDisks/VMD6: 460.2 GiB (494097096704 bytes) trimmed on /dev/sdi1
    Apr 19 00:07:46 sundance fstrim[13415]: /VMDisks/VMD5: 411 GiB (441261481984 bytes) trimmed on /dev/sdh1
    Apr 19 00:07:46 sundance fstrim[13415]: /VMDisks/VMD4: 435 GiB (467064438784 bytes) trimmed on /dev/sde1
    Apr 19 00:07:46 sundance fstrim[13415]: /VMDisks/Mail: 435 GiB (467063312384 bytes) trimmed on /dev/sdg1
    Apr 19 00:07:46 sundance fstrim[13415]: /VMDisks/Web: 411 GiB (441261477888 bytes) trimmed on /dev/sdd1
    Apr 19 00:07:46 sundance systemd[1]: fstrim.service: Main process exited, code=exited, status=64/n/a
    Apr 19 00:07:46 sundance systemd[1]: Failed to start Discard unused blocks on filesystems from /etc/fstab.
    Apr 19 00:07:46 sundance systemd[1]: fstrim.service: Unit entered failed state.
    Apr 19 00:07:46 sundance systemd[1]: fstrim.service: Failed with result 'exit-code'.



    Corresponding info from dmesg looks like this:

    [193707.808764] sd 10:0:2:0: attempting task abort!scmd(0x00000000a6d4920f), outstanding for 30748 ms & timeout 30000 ms
    [193707.808773] sd 10:0:2:0: [sdf] tag#4967 CDB: Unmap/Read sub-channel 42 00 00 00 00 00 00 00 18 00
    [193707.808779] scsi target10:0:2: handle(0x001a), sas_address(0x300062b2069af741), phy(1)
    [193707.808783] scsi target10:0:2: enclosure logical id(0x500062b2069af740), slot(2)
    [193707.808786] scsi target10:0:2: enclosure level(0x0000), connector name( )
    .
    . (a bunch more like the above)
    .
    [193836.917666] scsi target10:0:2: target reset: SUCCESS scmd(0x00000000fa8806c1)
    [193837.771473] sd 10:0:2:0: Power-on or device reset occurred
    [193865.852538] sd 10:0:2:0: attempting task abort!scmd(0x00000000d20880d1), outstanding for 7028 ms & timeout 7000 ms
    [193865.852547] sd 10:0:2:0: [sdf] tag#4952 CDB: ATA command pass through(16) 85 06 20 00 00 00 00 00 00 00 00 00 00 00 e5 00
    [193865.852552] scsi target10:0:2: handle(0x001a), sas_address(0x300062b2069af741), phy(1)
    [193865.852556] scsi target10:0:2: enclosure logical id(0x500062b2069af740), slot(2)
    [193865.852559] scsi target10:0:2: enclosure level(0x0000), connector name( )
    [193865.881899] sd 10:0:2:0: [sdf] tag#4979 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
    [193865.881915] sd 10:0:2:0: [sdf] tag#4981 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
    [193865.881917] sd 10:0:2:0: [sdf] tag#4979 CDB: Write(10) 2a 08 1d d8 66 08 00 00 08 00
    [193865.881926] blk_update_request: I/O error, dev sdf, sector 500721160 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
    [193865.881929] sd 10:0:2:0: [sdf] tag#4981 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
    [193865.881933] sd 10:0:2:0: [sdf] tag#4980 timing out command, waited 180s
    [193865.881940] blk_update_request: I/O error, dev sdf, sector 500721160 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
    [193865.881951] blk_update_request: I/O error, dev sdf, sector 0 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
    [193865.881953] sd 10:0:2:0: [sdf] tag#4980 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
    [193865.881970] sd 10:0:2:0: [sdf] tag#4980 CDB: Unmap/Read sub-channel 42 00 00 00 00 00 00 00 18 00
    [193865.881993] blk_update_request: I/O error, dev sdf, sector 249563136 op 0x3:(DISCARD) flags 0x4800 phys_seg 1 prio class 0
    [193865.882004] sd 10:0:2:0: task abort: SUCCESS scmd(0x00000000d20880d1)
    [193865.882011] Aborting journal on device sdf1-8.
    [193865.882032] sd 10:0:2:0: [sdf] tag#4978 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
    [193865.882042] sd 10:0:2:0: [sdf] tag#4978 CDB: Unmap/Read sub-channel 42 00 00 00 00 00 00 00 18 00
    [193865.882051] blk_update_request: I/O error, dev sdf, sector 249825279 op 0x3:(DISCARD) flags 0x800 phys_seg 1 prio class 0
    [193866.521415] sd 10:0:2:0: Power-on or device reset occurred
    [193866.525954] EXT4-fs error (device sdf1): ext4_journal_check_start:61: Detected aborted journal
    [193866.525957] EXT4-fs (sdf1): Remounting filesystem read-only
    [193866.528151] EXT4-fs (sdf1): ext4_writepages: jbd2_start: 9223372036854775806 pages, ino 14; err -30


    So, it appears that the FSTRIM operation is not being "liked" by the LSI/Avago/Broadcom 9305-16i HBA. I don't know if this is unique to the TRIM request or if it just exacerbates some underlying issue...

    I expect I can just disable the service, but it seems like I shouldn't have to...

    Anyone?

  3. #3

    Default Re: MPT3SAS / LSI 9305-16i Issues (HBA/SATA Drive resets) on Multiple machines

    I would love to know how I could get the above info in front of the engineers at LSI/Broadcom/Avago! They are not responding to the Support Form I posted on their web site.

    I think I have this pretty well cornered and you think they would be interested in solving the problem...

  4. #4

    Default Re: MPT3SAS / LSI 9305-16i Issues (HBA/SATA Drive resets) on Multiple machines


  5. #5

    Default Re: MPT3SAS / LSI 9305-16i Issues (HBA/SATA Drive resets) on Multiple machines

    Quote Originally Posted by Svyatko View Post
    Thanks for the reply. I am running the latest distro with all online patches applied. This is not a custom kernel. The issue has survived several online updates. As this is a fully configured environment, I'm not inclined to replace/rebuild everything. What I am already running should be a "stable kernel" in as far as it has not been customized or rebuilt in any way. (Correct me if I am wrong...)

    In other news... I did finally hear back from LSI. After a brief discussion over the phone they are requesting some additional information as generated by their log collection utility. I am sending this to them now. As part of the discussion, the engineer I am working with hints that this very well may be a TRIM issue. He is not even sure if the HBA supports TRIM. (If it doesn't, I would simply think it should ignore the request).

    Anyway, it seems that the weekly FSTRIM cycle may very well be a key piece to this issue... This problem may also only manifest itself if you have an SSD connected to these HBA's.

  6. #6

    Default Re: MPT3SAS / LSI 9305-16i Issues (HBA/SATA Drive resets) on Multiple machines

    I just got the following info back from LSI (even before I submitted my diagnostics log to them). It appears that the pre-configured (automatic) FSTRIM service may be what is triggering these problems. I would expect that anyone who is attaching SATA SSD's to a 9305 HBA may very well be exposed to this type of issue. I have temporarily disabled the FSTRIM Timer to hopefully mitigate this problem. However, as these are SSD's, I think fully operational TRIM support is needed to take advantage of wear leveling in these types of drives.

    (BTW we are using 512 GB Samsung 860 Pro's in this application):
    ------------------------------------------------------------------------

    TRIM is partially supported.
    I think fstrim may be sending incorrect encapsulation commands through SAT-L data packet.
    See the following IT firmware limitations:

    LSI SAS HBAs with IT firmware do support TRIM, but with these limitations:

    The drives must support both “Data Set Management TRIM supported (limit 8blocks)” and “Deterministic read ZEROs after TRIM” in their ATA options.
    The Samsung 850 PROs don’t have “Deterministic read ZEROs after TRIM” support, and thus TRIM cannot be run on these drives when attached to a LSI SAS HBAs with IT firmware

    You can also use sg_unmap to send a SCSI UNMAP command. Sg_unmap is part of the sg3_utils, which can be downloaded from here: http://sg.danny.cz/sg/sg3_utils.html

    Usage: sg_unmap [--grpnum=GN] [--help] [--in=FILE] [--lba=LBA,LBA...]

    [--num=NUM,NUM...] [--timeout=TO] [--verbose] [--version] DEVICE

    Send a SCSI UNMAP command to DEVICE to unmap one or more logical blocks. The unmap capability is closely related to the ATA DATA SET MANAGEMENT command with the "Trim" bit set. Click here for a more detailed description: http://manpages.ubuntu.com/manpages/...g_unmap.8.html

    Example:

    In this example there is a SATA SSD at sdc. To tell the capacity of the SATA SSD:

    sg_readcap /dev/sdc

    Read Capacity results:

    Last logical block address=117231407 (0x6fccf2f), Number of block=117231408

    Logical block length=512 bytes

    Hence:

    Device size: 60022480896 bytes, 57241.9 MiB, 60.02 GB

    Then run the sg_unmap command:

    sg_unmap --lba=0 --num=117231407 /dev/sdc

    or

    sg_unmap --lba=0 --num=117231408 /dev/sdc

  7. #7

    Default Re: MPT3SAS / LSI 9305-16i Issues (HBA/SATA Drive resets) on Multiple machines

    So, LSI/Avago/Broadcom is telling us their SAS HBA's don't fully support TRIM unless the target drive supports BOTH

    * Data Set Management TRIM supported (limit 8 blocks)
    * Deterministic read ZEROs after TRIM

    and when I ask hdparm to do a detailed listing of one of our Samsung 860 Pro SSD's, it specifically says they ***DO*** support both of those features:

    hdparm -I /dev/sdf1:

    ATA device, with non-removable media
    Model Number: Samsung SSD 860 PRO 512GB
    Serial Number: S5HTNS0NA02561F
    Firmware Revision: RVM02B6Q
    Transport: Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0
    Standards:
    Used: unknown (minor revision code 0x005e)
    Supported: 11 8 7 6 5
    Likely used: 11
    Configuration:
    Logical max current
    cylinders 16383 16383
    heads 16 16
    sectors/track 63 63
    --
    CHS current addressable sectors: 16514064
    LBA user addressable sectors: 268435455
    LBA48 user addressable sectors: 1000215216
    Logical Sector size: 512 bytes
    Physical Sector size: 512 bytes
    Logical Sector-0 offset: 0 bytes
    device size with M = 1024*1024: 488386 MBytes
    device size with M = 1000*1000: 512110 MBytes (512 GB)
    cache/buffer size = unknown
    Form Factor: 2.5 inch
    Nominal Media Rotation Rate: Solid State Device
    Capabilities:
    LBA, IORDY(can be disabled)
    Queue depth: 32
    Standby timer values: spec'd by Standard, no device specific minimum
    R/W multiple sector transfer: Max = 1 Current = 1
    DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
    Cycle time: min=120ns recommended=120ns
    PIO: pio0 pio1 pio2 pio3 pio4
    Cycle time: no flow control=120ns IORDY flow control=120ns
    Commands/features:
    Enabled Supported:
    * SMART feature set
    Security Mode feature set
    * Power Management feature set
    * Write cache
    * Look-ahead
    * Host Protected Area feature set
    * WRITE_BUFFER command
    * READ_BUFFER command
    * NOP cmd
    * DOWNLOAD_MICROCODE
    SET_MAX security extension
    * 48-bit Address feature set
    * Device Configuration Overlay feature set
    * Mandatory FLUSH_CACHE
    * FLUSH_CACHE_EXT
    * SMART error logging
    * SMART self-test
    * General Purpose Logging feature set
    * WRITE_{DMA|MULTIPLE}_FUA_EXT
    * 64-bit World wide name
    Write-Read-Verify feature set
    * WRITE_UNCORRECTABLE_EXT command
    * {READ,WRITE}_DMA_EXT_GPL commands
    * Segmented DOWNLOAD_MICROCODE
    * Gen1 signaling speed (1.5Gb/s)
    * Gen2 signaling speed (3.0Gb/s)
    * Gen3 signaling speed (6.0Gb/s)
    * Native Command Queueing (NCQ)
    * Phy event counters
    * READ_LOG_DMA_EXT equivalent to READ_LOG_EXT
    * DMA Setup Auto-Activate optimization
    Device-initiated interface power management
    * Asynchronous notification (eg. media change)
    * Software settings preservation
    Device Sleep (DEVSLP)
    * SMART Command Transport (SCT) feature set
    * SCT Write Same (AC2)
    * SCT Error Recovery Control (AC3)
    * SCT Features Control (AC4)
    * SCT Data Tables (AC5)
    * reserved 69[4]
    * DOWNLOAD MICROCODE DMA command
    * SET MAX SETPASSWORD/UNLOCK DMA commands
    * WRITE BUFFER DMA command
    * READ BUFFER DMA command
    * Data Set Management TRIM supported (limit 8 blocks)
    * Deterministic read ZEROs after TRIM

    Security:
    Master password revision code = 65534
    supported
    not enabled
    not locked
    not frozen
    not expired: security count
    supported: enhanced erase
    2min for SECURITY ERASE UNIT. 2min for ENHANCED SECURITY ERASE UNIT.
    Logical Unit WWN Device Identifier: 5002538e30a290ec
    NAA : 5
    IEEE OUI : 002538
    Unique ID : e30a290ec
    Device Sleep:
    DEVSLP Exit Timeout (DETO): 50 ms (drive)
    Minimum DEVSLP Assertion Time (MDAT): 30 ms (drive)
    Checksum: correct


    So, something is not as it appears... I have asked LSI to clarify this information as according to hdparm and LSI we should NOT be having any TRIM (fstrim) related issues...

  8. #8

    Default Re: MPT3SAS / LSI 9305-16i Issues (HBA/SATA Drive resets) on Multiple machines

    So you're playing games "Guess what?", holding back needed info about used drives... Not good...

    Samsung SSDs have a long history of errors:

    1. With TRIM https://www.algolia.com/blog/enginee...ot-that-solid/
    https://git.kernel.org/pub/scm/linux...ad2c8ae67acf81
    Now: https://github.com/torvalds/linux/bl.../libata-core.c
    Check your libata-core.c.

    2. With NCQ - 860 SATA seies.
    I had problems with Samsung 860 Evo + NCQ + some chipsets.

    Check what you're using, and what is working, with different controllers.
    IMHO Samsung products poisons systems.

  9. #9

    Default Re: MPT3SAS / LSI 9305-16i Issues (HBA/SATA Drive resets) on Multiple machines

    Quote Originally Posted by Svyatko View Post
    So you're playing games "Guess what?", holding back needed info about used drives... Not good...

    Samsung SSDs have a long history of errors:

    1. With TRIM https://www.algolia.com/blog/enginee...ot-that-solid/
    https://git.kernel.org/pub/scm/linux...ad2c8ae67acf81
    Now: https://github.com/torvalds/linux/bl.../libata-core.c
    Check your libata-core.c.

    2. With NCQ - 860 SATA seies.
    I had problems with Samsung 860 Evo + NCQ + some chipsets.

    Check what you're using, and what is working, with different controllers.
    IMHO Samsung products poisons systems.
    Why so angry?

    These are BRAND NEW drives. These are not "Used". ...and I'm NOT playing any "games" or "holding anything back". I clearly identified we were using 860 Pro's in an earlier post. (But thanks for that anyway...)

    Yes, I am aware of the blacklisting of Samsung SSD's, but the code I find (listed on the net) is not always consistent. Sometimes it basically says "Samsung*", sometimes "Samsung 8*" and sometimes specifically calls out the 840 and 850's. (I'm not sure which variant is being used in my specific LEAP 15.2 kernel).

    ...and FSTRIM is absolutely "trying" to work on these drives (despite any blacklisting). So my thought is that this particular model (860 Pro) may not be blacklisted in my specific kernel.

    Regardless, it seems that using these drives on my specific HBA (LSI 9305) is problematic and FSTRIM appears to be the initiator.

    Furthermore, I just got another email from LSI last night. LSI Support has just confirmed that they do not support TRIM for ANY SATA SSD connected to ANY of their SAS/SATA HBA's. (They said they DO support the SAS variant of the command - for SAS SSD's - but fully implementing the SATA variant would essentially be too problematic/difficult/not-worth-it etc).

    They specifically recommend connecting SATA SSD's directly to a dedicated SATA controller (not a SAS controller which also supports SATA). Good luck finding one with anything more than a basic performance rating. Maybe that's what I need to do in any case...

    I have a couple of Areca ARC-1330-8i HBA's arriving next week for testing. I need support for 16 total drives so 2 HBA's will be required. We have a very good working relationship with Areca and they tell me they don't expect any problems. I also a couple of WD RED SSD's on order to throw into the mix.

    I'ld like to report back what I find but I'm not sure its worth doing that if I'm going to get accused of playing games and holding stuff back. I was the major contributor to my own thread to help make sure all information I was receiving was available to anyone who was following but it appears it was not appreciated (at least by some). It appears some know better and don't appreciate folks asking for help so Good For You!

  10. #10

    Default Re: MPT3SAS / LSI 9305-16i Issues (HBA/SATA Drive resets) on Multiple machines

    Quote Originally Posted by evanevery View Post
    I'ld like to report back what I find but I'm not sure its worth doing that if I'm going to get accused of playing games and holding stuff back. I was the major contributor to my own thread to help make sure all information I was receiving was available to anyone who was following but it appears it was not appreciated (at least by some). It appears some know better and don't appreciate folks asking for help so Good For You!
    You have to go on and contribute despite some criticism on forum (fora).
    You are playing games with hardware manufactures which crave for selling you pro-grade goods with overinflated price, and you are trying to use consumer-level stuff.
    For consumer SSDs you may use consumer controllers: https://www.asmedia.com.tw/products-...dYQ8bxZ4UR9wG5 - ASM1166/ASM1164/etc.
    To use them you will need to add theirs IDs into Linux AHCI driver.
    IMHO SAS is better than SATA in your case.
    IMHO you may use NVME SSD drives.

Page 1 of 2 12 LastLast

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •