Hello,
I am currentrly working on a openSUSE Laep 15 based appliance for VMware. After I installed it, I got errors like
68.547238] EXT4-fs (dm-2): Delayed block allocation failed for inode 394558 at logical offset 16 with max blocks 4 with error 95
68.547458] EXT4-fs (dm-2): This should not happen!! Data will be lost
My first idea was a corrupted datastore, so i tried another one (on a completely different SAN). I have the same effect. I did a fsck.ext4 -f on the volume, it came up clean. I did not have this effects on Leap 42.x Machines.
Furthermore, I discovered on another machine the same errors. This machine was recenently upgraded from Leap 42.3 to 15.
I have the system patched; it hast kernel 4.12.14-lp150.12.4-default.
The underlying host ist ESXi 5.5, the SAN was first FreeNAS (over FC, we’re usin it as a SAN, not NAS) and Datacore in the second try.
I did not find anything aout this in the internet or the release notes. Has anyone else experienced this problem or have an idea for a solution?
I forgot to describe te disk layout:
The Layout is /var and / in separate LVM volumes formated with ext4 (for several reasons I do not use btrfs in this setup) and a separate /boot disk as a traditional partition (formated with ext4 also). I mostly observed this errors for /var (may be due to the fact that there is a database running on it).
You need to take a look at whether that offset is expected or not.
If it’s an offset that simply points to the edge of a partition exactly, then it’s expected.
But, if it’s not then you need to either determine why it happened or just discard the entire virtual disk and create new.
I describe virtual file system mis-alignment in the following post… In it, I also describe the most common cause of mis-alignment which is to use a disk layout created for a Windows OS, and then migrate to a new machine. Supposedly MS is getting around to eliminating this offset caused by MS disk format tools so the problem will not occur when using most recent tools. But, on a SAN mis-alignment might happen for other reasons, too and may be in your SAN documentation. https://forums.opensuse.org/showthread.php/501319-Advice-on-Possible-Virtualization-Solution?p=2667932#post2667932
If that doesn’t provide clues,
Then I might suggest you try building a test VM without LVM and see if the problem disappears.
You might also inspect your system log for errors (journal), do you notice whether anything in particular might trigger the error? If so, then you can try displaying the journal in real time.
Thanks Tsu,
I read Your article, and it is a good one. Unfortunately, I am not quite sure if it relates to my problem. (I’ll have to do a bit learning - as I hadn’t dealt much with file system problems yet, there are some basics I’m missing. So I’m at a stage where I try to figure out if the offset is expected.) And i’m frankly a bit puzzled about the fact that this problem seems to surface since I’ve gone to Leap 15.
journalctl does not give any clues. I can see that the start of mariadb seems to be the trigger - but that is not exactly surprising, giving the fact that it writes on the volume in question.
The problem seems to be sporadic - if I start the vm an it occurs right at the beginning, it frequently occurs. If I start it and it does not occur, it seems it does not come up at all. (Or at least not for some hours.)
I’ll keep digging, in the meantime, I’m thankfull for every suggestion.
Thinking about your situation a bit,
You might not want to overlook the possibility that a partition misalignment might have existed for some time but was not noticed or reported by an application before now.
You may also want to consider what already exists in your storage partition.
Is that database file in a virtual disk or directly on the system file system?
Although it’s probably possible to mix virtual disks and non-virtualized files, the general recommendation is to set up your virtual disk storage in its own partition.
And, consider the possible effect of large fragmented files (both virtual disks and database data files would be considered “large”). If you have more than 50% free space, then fragmentation might be only a “sometimes” issue but otherwise could be serious. For both database data files and virtual disks, there are maintenance procedures for defragmentation and possible compaction. Fragmentation increases the number of non-contiguous disk blocks that must be mapped, increasing possibility problems might be reported.
Aside form general policy and practice how you use the partition where your disk storage is located,
As I suggested in my previous post you can also start by simply creating a test machine in a different storage location and/or configured differently.
Thanks again, this is one of the leads I will definitivly follow.
May be I was a bit to unspecific about the setup, so I’ll try to describe it in more detail. (That might show why I think this is not a matter of mis-alignment - although I will look into that.)
The host system is an ESXi Server. The storage it uses is DataCore or FreeNAS (does not make a difference here). The LUNs on those Storage are VMFS5 formated; they were created under vSphere 5.5 over the vCenter, so they at least should be correctly aligned (at least according to the VMware documentation), but I will look into that.
My VMs have two disk, with about 500 MB containig one single partition (/dev/sda1) to boot. The second does not hold any partitions, but is completely included as a pv in lvm. (This makes enlargement much more easy.) Thos pv is the single pv in a vg. This vg includes the affected lv,
Mis-Alignments of the VMFS datastores have been a problem in the past, but normally only had performance impact. I’m still trying to get this running without giving up the LVM, as I realy don’t want to miss it’s advantages. But I will, if i do not get closer, try to set up a new layout.
Best regards and thanks for the suggestions
Ortwin
Well, it seems my problem is solved. I’m not sure what to make of this.
Yesterday, I installed the latest kernel patches for Leap 15. I am now on Kernel 4.12.14-lp150.12.7-default. Since then, the file system error did not occur anymore. Well, may be I read it wrong, but I did not find any indications about changes to one of the modules involved in this setup - as far as I understood there where security-patches on totally different modules.
In case something shows up (or the error returns) I let this thread open for some days.
Ok, since I did the kernel update, no delayed block allocation errors occured.So it looks like I ran in some weird kernel bug… (No other package in connection with this phaenomen was updated at that time, neither lvm nor ext4.)