Kernel errors with 32 Gig Transcend SSD drive as only system disk

We have multiple systems running OpenSUSE 11.2 on a 32 Gig Transcend SSD with 3 partitions.
There is a root, swap, and home partition.
The systems are all being loaded from a single image file created with the DD command after we installed the OS and configured and loaded all the application software to get the system to meet our requirements.

Initially we had no trim support turned on.
The systems ran for awhile, but eventually they each get media errors or the drive freezes.
This led to my investigation of trimming and implementing the following improvements:

  1. Added the discard,noatime, nodiratime parameters to the root and home file system mounts in the /etc/fstab file.
  2. Changed the Global i/o scheduler to “elevator=noop” in the Yast-> Kernel Settings->Kernel Settings tab -> Global I/O Scheduler drop down.
  3. Enabled drive caching by adding “hdparm -W1” to the /etc/init.d/boot.local file.
  4. Used the gparted live bootable USB tool to move the /dev/sda1 partition to start at sector 2048 for proper SSD alignment insuring partitions 2 and 3 were all started on sectors divisible by 2048 as well. We then rebuilt our base image file using dd command to incorporate the alignment changes.

Changes not made:
I have yet to change the swappiness value or run the wiper.sh from a cron job at regular intervals. I have not found documentation indicating these are necessary. Your input would be helpful here. We did not want to put the temporary file systems like /var/log into memory because we’d lose the ability to troubleshoot sudden shutdowns.

Despite all these changes I still get kernel ata1 errors on some of the fielded systems.

Here are some examples from /var/log/messages:


Feb  7 03:39:56 Unit-6 kernel: [48412.000103] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Feb  7 03:39:56 Unit-6 kernel: [48412.000146] ata1.00: cmd ca/00:08:37:64:c1/00:00:00:00:00/e1 tag 0 dma 4096 out
Feb  7 03:39:56 Unit-6 kernel: [48412.000152]          res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Feb  7 03:39:56 Unit-6 kernel: [48412.000188] ata1.00: status: { DRDY }
Feb  7 03:40:01 Unit-6 kernel: [48417.051908] ata1: link is slow to respond, please be patient (ready=0)
Feb  7 03:40:07 Unit-6 kernel: [48422.036051] ata1: device not ready (errno=-16), forcing hardreset
Feb  7 03:40:07 Unit-6 kernel: [48422.036086] ata1: soft resetting link
Feb  7 03:40:12 Unit-6 kernel: [48427.334844] ata1: link is slow to respond, please be patient (ready=0)
Feb  7 03:40:17 Unit-6 kernel: [48432.044056] ata1: SRST failed (errno=-16)
Feb  7 03:40:17 Unit-6 kernel: [48432.044097] ata1: soft resetting link
Feb  7 03:40:22 Unit-6 kernel: [48437.340076] ata1: link is slow to respond, please be patient (ready=0)
Feb  7 03:40:27 Unit-6 kernel: [48442.101049] ata1: SRST failed (errno=-16)
Feb  7 03:40:27 Unit-6 kernel: [48442.101079] ata1: soft resetting link
Feb  7 03:40:32 Unit-6 kernel: [48447.397084] ata1: link is slow to respond, please be patient (ready=0)
Feb  7 03:41:02 Unit-6 kernel: [48477.157299] ata1: SRST failed (errno=-16)
Feb  7 03:41:02 Unit-6 kernel: [48477.157337] ata1: soft resetting link

Feb 13 06:48:10 Unit-6 kernel: [488890.989174] ata1.00: exception Emask 0x0 SAct 0xff SErr 0x0 action 0x6 frozen
Feb 13 06:48:10 Unit-6 kernel: [488890.989256] ata1.00: cmd 61/08:00:5f:f1:a8/00:00:01:00:00/40 tag 0 ncq 4096 out
Feb 13 06:48:10 Unit-6 kernel: [488890.989268]          res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Feb 13 06:48:10 Unit-6 kernel: [488890.989342] ata1.00: status: { DRDY }
Feb 13 06:48:10 Unit-6 kernel: [488890.989395] ata1.00: cmd 61/08:08:b7:ab:19/00:00:02:00:00/40 tag 1 ncq 4096 out
Feb 13 06:48:10 Unit-6 kernel: [488890.989406]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Feb 13 06:48:10 Unit-6 kernel: [488890.989479] ata1.00: status: { DRDY }
Feb 13 06:48:10 Unit-6 kernel: [488890.989532] ata1.00: cmd 61/10:10:9f:1d:0e/00:00:01:00:00/40 tag 2 ncq 8192 out
Feb 13 06:48:10 Unit-6 kernel: [488890.989543]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Feb 13 06:48:10 Unit-6 kernel: [488890.989616] ata1.00: status: { DRDY }
Feb 13 06:48:10 Unit-6 kernel: [488890.989669] ata1.00: cmd 61/18:18:4f:5a:87/00:00:02:00:00/40 tag 3 ncq 12288 out
Feb 13 06:48:10 Unit-6 kernel: [488890.989680]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Feb 13 06:48:10 Unit-6 kernel: [488890.989753] ata1.00: status: { DRDY }
Feb 13 06:48:10 Unit-6 kernel: [488890.989806] ata1.00: cmd 61/28:20:47:cd:85/00:00:00:00:00/40 tag 4 ncq 20480 out
Feb 13 06:48:10 Unit-6 kernel: [488890.989817]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Feb 13 06:48:10 Unit-6 kernel: [488890.989890] ata1.00: status: { DRDY }
Feb 13 06:48:10 Unit-6 kernel: [488890.989943] ata1.00: cmd 60/08:28:4f:d4:0c/00:00:01:00:00/40 tag 5 ncq 4096 in
Feb 13 06:48:10 Unit-6 kernel: [488890.989954]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Feb 13 06:48:10 Unit-6 kernel: [488890.990027] ata1.00: status: { DRDY }
Feb 13 06:48:10 Unit-6 kernel: [488890.990080] ata1.00: cmd 60/08:30:9f:d4:0c/00:00:01:00:00/40 tag 6 ncq 4096 in
Feb 13 06:48:10 Unit-6 kernel: [488890.990091]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Feb 13 06:48:10 Unit-6 kernel: [488890.990163] ata1.00: status: { DRDY }
Feb 13 06:48:10 Unit-6 kernel: [488890.990216] ata1.00: cmd 60/10:38:b7:d4:0c/00:00:01:00:00/40 tag 7 ncq 8192 in
Feb 13 06:48:10 Unit-6 kernel: [488890.990227]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Feb 13 06:48:10 Unit-6 kernel: [488890.990300] ata1.00: status: { DRDY }
Feb 13 06:48:10 Unit-6 kernel: [488890.990342] ata1: hard resetting link
Feb 13 06:48:10 Unit-6 kernel: [488891.309145] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Feb 13 06:48:10 Unit-6 kernel: [488891.310226] ata1.00: configured for UDMA/133
Feb 13 06:48:10 Unit-6 kernel: [488891.310272] ata1.00: device reported invalid CHS sector 0
Feb 13 06:48:10 Unit-6 kernel: [488891.310313] ata1.00: device reported invalid CHS sector 0
Feb 13 06:48:10 Unit-6 kernel: [488891.310352] ata1.00: device reported invalid CHS sector 0
Feb 13 06:48:10 Unit-6 kernel: [488891.310391] ata1.00: device reported invalid CHS sector 0
Feb 13 06:48:10 Unit-6 kernel: [488891.310429] ata1.00: device reported invalid CHS sector 0
Feb 13 06:48:10 Unit-6 kernel: [488891.310468] ata1.00: device reported invalid CHS sector 0
Feb 13 06:48:10 Unit-6 kernel: [488891.310506] ata1.00: device reported invalid CHS sector 0
Feb 13 06:48:10 Unit-6 kernel: [488891.310545] ata1.00: device reported invalid CHS sector 0
Feb 13 06:48:10 Unit-6 kernel: [488891.310643] ata1: EH complete

I have been communicating with Transcend but have yet to arrive at a definitive solution to that data corruption. Updating with their latest firmware bricked one of our drives. My main question for the community is, am I doing everything I should be to correct this problem or am I barking up the wrong tree entirely? A side question would be are there other Transcend SSD users out there that had similar problems, and did you find a solution?

Other information: We have to stay with OpenSUSE 11.2 for now. All the computers are running identical hardware. See here for the specs:Adlink industrial computer).
Let me know what other information would help troubleshooting this issue.

Also, we are considering moving to an Intel SSD.

Thanks In advance!

-TWG

Having to stay with 11.2 is a big issue. My guess is that no members still run 11.2 and have the knowledge to help you here. The things one can do to prevent an SSD from wearing out, have already been done, though I couldn’t say if the 11.2 kernel is supporting them.

On 2013-02-25 22:56, twgregory wrote:
>
> We have multiple systems running OpenSUSE 11.2 on a 32 Gig Transcend SSD
> with 3 partitions.

Well, the first step would be to upgrade to a recent and supported
openSUSE version, 11.2 is outdated. Minimum is 12.2. You might use 11.4
with Evergreen, but support is minimal (long, but minimal).

But nobody is going to investigate a kernel problem on a 11.2 distro.
And, if the problem was found time ago, the patches would be on a recent
version.


Cheers / Saludos,

Carlos E. R.
(from 12.1 x86_64 “Asparagus” at Telcontar)

> OpenSUSE 11.2

sometimes SUSE Linux Enterprise Server/Desktop (SLES/D)11 SP2 users
mistakenly type they use “openSUSE 12.2”…if the output of


cat /etc/SuSE-release

shows that you are running SLE_ 11 SP2 then you are in great good
luck as you only need to get to the correct forum, here:
http://forums.suse.com/ where your system continues to be supported…

your ID/Pass used here works there also.

if however you are running openSUSE 11.2 (which went past its end of
life in May of 2011) and have not yet extended its life by applying
the Project Evergreen updates
http://en.opensuse.org/openSUSE:Evergreen you should do
that right away (to roll in all the security patches accumulated over
the last 21 months)…and, maybe those might help with your
problem… or maybe not.

note: backup (of course) and read my sig caveat first…

additionally, if you were to move to Evergreen then you might wish to
join the Evergreen Mail list and ask your question there, as there
may be some openSUSE 11.2 Evergreen users reading there who have
faced the same issue…

note: move soon as 11.2 Evergreen support ends November 2013…if you
need a longer life product you need to look at SLE_ or another distro…

also, you might find an 11.2 Evergreen SSD user on the openSUSE Mail
List or on IRC, the path to those communications channels begin here:
http://en.opensuse.org/openSUSE:Communication_channels

note: if need be it would be wise to review the “IRC for the newbies”
and/or “Mailing list netiquette” links first (some get overly cranky
[or just won’t answer] when conventions are ignored)

you are, of course welcome to check back here often to see if someone
with helpful experience has wandered through and dropped some bread
crumbs…good luck.


dd
openSUSE®, the “German Engineered Automobile” of operating systems!
http://tinyurl.com/DD-Caveat

On 2013-02-26 09:28, dd wrote:
> also, you might find an 11.2 Evergreen SSD user on the openSUSE Mail
> List or on IRC, the path to those communications channels begin here:
> http://en.opensuse.org/openSUSE:Communication_channels

Evergreen has its own mail list, not hosted by openSUSE, thus it does
not appear on that page. It is here:

openSUSE:Evergreen


Cheers / Saludos,

Carlos E. R.
(from 12.1 x86_64 “Asparagus” at Telcontar)

On 02/26/2013 04:28 PM, Carlos E. R. wrote:
> Evergreen has its own mail list, not hosted by openSUSE, thus it does
> not appear on that page. It is here:

maybe you missed it but i’d mentioned the Evergreen list and expected
the OP to find that on the Evergreen URL i had already given…

since i don’t expect a lot more help from this forum, i was
suggesting the OP could also try IRC and opensuse@opensuse.org
(where might be found [as you know] a generally higher level of
expertise as well as highly experienced and practicing system
administrators running both Evergreen and just plain old systems by
self-hacking them into an acceptable state of layered security…i
thought one or two might ‘take pity’ and help…)


dd
openSUSE®, the “German Engineered Automobile” of operating systems!

This response hi-lights some of the reasons I posted here and the questions I need answered.
If I knew for sure that upgrading to 12.x would definitely fix the problems it would make a good case to present to my superiors for doing so. As of right now I have no evidence it will. They are considering moving to Scientific Linux in the future as faith in Suse seems to be waning.

Thank you everyone for the suggestions about the Evergreen project. I have run their patches and updates for 11.2 and gotten a system up to date and dd imaged to a file so we can replicate the OS and am commencing with system testing on a box with those software changes to insure nothing was broken. (ie. configurations overwritten , etc.) by the Evergreen changes.

My reading of forum articles and SSD postings seem to indicate 11.2 kernel does support trim, but how do I know for certain it’s working? I ran a test script from Nicolay Doytchev and it indicated trim was not working on our setup. The forum articles indicated the trim support was added in kernel v2.6.33 and newer, but that the changes were back ported into 11.2. Is there anyone that can verify this is true?

These systems we are running into the problems on were new in August 2012, that seems like a really short lifetime for an SSD drive. Maybe there are other problems afoot. You comments and input are welcome.

On 03/01/2013 11:56 PM, twgregory wrote:

> support was
> added in kernel v2.6.33 and newer, but that the changes were back ported
> into 11.2. Is there anyone that can verify this is true?

You would have the read the changelog files of the kernel sources, or
ask the people that might have done it.

> These systems we are running into the problems on were new in August
> 2012, that seems like a really short lifetime for an SSD drive. Maybe
> there are other problems afoot. You comments and input are welcome.

IMHO, a general purpose Linux distribution is not geared toward caring
for long life of SSDs. Linux logs things often, syncs buffers, whatever.
And the older a distro is, the worse.


Cheers/Saludos
Carlos E. R. (12.3 Dartmouth test at Minas-Anor)

On 03/01/2013 11:56 PM, twgregory wrote:
> They are considering moving to Scientific Linux in the
> future as faith in Suse seems to be waning.

interesting!!

someone there made the decision to install openSUSE 11.2 on or after
its release (November 12, 2009) at which time its end-of-life was
known to be May 12, 2011!!

and now TWENTY-TWO months after May 2011, someone there has finally
considering switching to a supported Linux because “faith in Suse
seems to be waning”??

did that someone just wake up??

gimme a break!!


dd

On 03/02/2013 11:41 AM, dd wrote:
> and now TWENTY-TWO months after May 2011, someone there has finally
> considering switching to a supported Linux because “faith in Suse seems
> to be waning”??
>
> did that someone just wake up??
>
> gimme a break!!

Bussiness wants free (as in gratis) Linux and long time support.

Me, I want Long Time Windows 7 support for home versions O:-)


Cheers/Saludos
Carlos E. R. (12.3 Dartmouth test at Minas-Anor)

Robin,
Thanks for the input on the changelogs. I’ll look into that. All of my testing thus far with trim and the wiper.sh script while running the 2.6.31.14 kernel has failed to verify it is actually working (test_trim.sh script). However, we have not had another kernel freeze since enabling BIOS AHCI and activating trim in fstab so maybe we are good.

 On the hardware side, we will probably go with a more robust SSD model from another vendor as the current reliability especially during power failures is contributing to the data corruption issues. Unfortunately we have to stick with SSD as these systems are in outdoor environments where the start up temperatures can be below freezing and a traditional drive would be far worse off under those conditions.

@dd

> and now TWENTY-TWO months after May 2011, someone there has finally
> considering switching to a supported Linux because “faith in Suse seems
> to be waning”??
>
> did that someone just wake up??
>
> gimme a break!!

The Suse 11.2 version was chosen because 11.4 had problems with hardware video drivers 2.5 years ago. This was long before I worked here and I was not involved in the decision process at that point, inheriting the current setup upon hire. So as to someone waking up, I believe previous choices were made as to software price (free for openSuse) and what software works with previously purchased hardware (11.2). You go down that kind of decision path and you end up where we are…For now I am keeping things going as best I can until another upgrade path is agreed upon. I like Suse after having previously worked on Red Hat, Mandrake, Ubuntu, Mint, and Android. It has its quirks, but Yast is pretty useful as far as GUI interfaces go and you always have the command line to fall back on. OpenSUSE 12.3 might be an option moving forward. As far as support, I wholeheartedly would get behind going with a paid distribution like SLES but like Robin said, gratis is sometimes the expectation by management out in industry.

On 2013-03-06 21:36, twgregory wrote:

> On the hardware side, we will probably go with a more robust SSD
> model from another vendor as the current reliability especially during
> power failures is contributing to the data corruption issues.
> Unfortunately we have to stick with SSD as these systems are in outdoor
> environments where the start up temperatures can be below freezing and a
> traditional drive would be far worse off under those conditions.

Interesting job yours :slight_smile:

I had another idea. Look at the laptop_mode scripts. They have settings
when running on battery to delay syncs to the hard disk, and these maybe
useful to you as well.

What about high temperatures? Maybe you want to log them periodically.
Some of the smartctl output entries are different temperatures on disk.
Never tried on an SSD, though.

I would have a good read on those SSD specs re temperature, maybe they
are strickter than you thought. :-?

> The Suse 11.2 version was chosen because 11.4 had problems with
> hardware video drivers 2.5 years ago. This was long before I worked here
> and I was not involved in the decision process at that point, inheriting
> the current setup upon hire.

Development times in industry are often too slow compared to “of the
shelve” hardware and software.

I know of some people that developed a personnel entry control system,
and they used what at the time was state of the art MsDOS-NET, ie, a
network capable msdos version.

By the time they developed and tried and deployed the system just two
years later, that operating system was no longer sold… big problem. I
think they switched to install Win 95/98 in Dos mode. I don’t know what
they use now >:-)

In the Linux world, it is even worse.

> As far as support, I wholeheartedly would get behind going with
> a paid distribution like SLES but like Robin said, gratis is sometimes
> the expectation by management out in industry.

Indeed :slight_smile:


Cheers / Saludos,

Carlos E. R.
(from 11.4, with Evergreen, x86_64 “Celadon” (Minas Tirith))