Bash script hangs

Since installing Leap 15 I’ve had a bash script of mine hang several times. Today I tried to run strace on it and strace responded that the pid was in x32 mode and therefore wouldn’t give me anything else. If I run strace on the script when it is running OK I get the normal output.

I’m therefore not sure whether the script is in some kind of strange loop or is genuinely stuck.

If you have no objection,
You should post your script and describe how it’s invoked.

TSU

I don’t have any basic objection to posting the script but it is 6000 lines long so I doubt if anyone would want to go through it in detail. It is invoked via systemd.

OTOH, what do you expect from others here? Having a sort of script radar?

The only thing we can do is giving hints on how to debug scripts. (But I guess that is nothing new to you?)

So on important points (e.g. where your script starts en new phase in the process) insert echo statements to a log file. The statements should tell where you are in the script and maybe show some important values of parameters.

echo "parameter checking ends here, number of gurus is: ${NGURU}" >>/tmp/scriptlog

I assume you get the idea.
Of course fine tune after you get an idea where in the script the hang occurs.

I have different logging levels built into the script and I can monitor what state it was at when it gets stuck and I’ve analysed that information and there doesn’t seem to be anything consistent. However what I find very strange is that strace won’t give me anything when it gets into the stuck state.

Still,
You either have to provide a better description about where your script is hanging or provide the actual detailed info so people can evaluate what is likely happening themselves.

If the script is too long, then post in a pastebin.
But even so, just posting a long script by itself tells little.
If your script is invoked without a problem, then you’ll have to instrument your script, one main way is how henk describes… at various points in your script generate some output either to file or to stdout.

TSU

Eh, with 6000 lines in one single script, you don’t have to doubt :slight_smile: . But this says something very important: no person wants scripts of 6000 lines because you loose sight on the details. If in my case a script goes towards ~1000 lines, I tend to split it in multiple scripts that call eachother and return to the parent script, and I make sure that everything is in functions. To be honest, I can’t think of anything that would need that much lines of bash.

I would be more than happy to provide a better description of where the script is hanging if there was some way I could realistically monitor it. My logging suggests that the point at which it hangs is random and doesn’t give me any indication as to where to add more logging.

The real issue is the fact that strace isn’t following it but simply coming up with this message which says something about the process being x32. However when I use strace on it when it is not hanging strace follows it quite happily. To me that suggests that there is something odd happening, plus the fact that it didn’t start happening until I upgraded to Leap 15.

How could we possibly say anything decent about “this message which says something about the process being x32.” ? We need to see code, output, otherwise we cannot help you.

Hi,

6k lines of shell script code?! wow i haven’t written shell script more than 1.5k but hey!
So what change from previous openSUSE version to Leap?
Is it some package or service or is your bash version is lower in the previous openSUSE release?
All of them needs to be accounted for so you can trouble shoot.

Previous version of Leap was 42.3 with bash at 4.3.42. New version is Leap 15 with bash at 4.4.19.

You could take a look in the file “/usr/share/doc/packages/bash/COMPAT”.
[HR][/HR]Fault finding rule number «something»:

Break the large code segment into several smaller sections and review and check the operation of each section.

“Slash and burn” method: break the script into 2 parts – check that the first ½ doesn’t hang – add parts of the second ½ until the script hangs – debug …
If the first ½ hangs, break that ½ into 2 parts and proceed as above …

Remove all “2&>1” occurences and run the script, let’s say " 6K.sh " like this:


nohup sh 6K.sh

This will dump a file nohup.out. Do a tail on that file where the script hangs.

My point about taking a look at “/usr/share/doc/packages/bash/COMPAT” possibly means a lot of hard work for all the people maintaining older Bash scripts:

Compatibility with previous versions

This document details the incompatibilities between this version of bash, bash-4.4, and the previous widely-available versions, bash-3.x (which is still the `standard’ version for Mac OS X), 4.1/4.2 (which are still standard on a few Linux distributions), and bash-4.3, the current widely-available version. These were discovered by users of bash-2.x through 4.x, so this list is not comprehensive. Some of these incompatibilities occur between the current version and versions 2.0 and above.

Thank you for suggesting looking at the compatibility document. I’ve had a look an nothing obvious springs out at me but I will look in more detail when I have the time.

I made a change to my script such that a part of the script which was being executed as a background function is now being run from a separate script. It is the same piece of code but instead of being inside the main script it is now outside. The main script was monitoring the background task and timing it out if it ran too long. The problem seems to be that this version of bash was somehow stopping the script from running anything and hence everything hangs up. With the routine external to the main script I occasionally see time-outs but the main script is no longer hanging up.

As I said right at the start of this thread, the script had been working with the previous version of bash without any problem at all.

Just as a follow up on this, in case somebody else has a similar problem. The original script, which was working fine with the older version of bash, had a number of places where it ran functions inside the script as background tasks. By taking these functions outside into another script the problem completely disappeared. So, the script is still large and works fine. Therefore something must have changed in the later version of bash.

Congratulations on finding the problem :slight_smile: