recently we have a big problem with running our Korn-shell scripts on SUSE Linux systems on some customer’s sites. The problem takes place only on Suse platform (SLES) with kernel version 126.96.36.199-0.21-smp (output of “uname -r” command).
The scenario is basically like this:
We have two kind of the Korn-shell scripts, the “parent” script and the “child” script, the parent script spawns at its start a limited number of child script processes in the backround. The number of concurrent childs is limited by a constant (typically cca 10 of them) and controlled by the “semaphore” files (ordinary files in the filesystem), every child signaling it’s start/end by creating a separate semaphore file for both of the events. The parent process cycles in a loop, processing semaphore files and removing them from the filesystem afterwards, saving the current number of processes in a special file of it’s own.
Now - under some circumstances - the parent process seemes to become “frozen” (ps ax showing “S” state), while child processes running normally - waiting for their semaphore file to be removed by the parent. Normally the child process spawns another script which starts some database processing, but we are able to simulate the “freezing” state also with simple echo commands instead of the real processing - seems it doesn’t matter what the real processing is.
We made some tests to get to the clue of this and we found out that:
- the state occurs typically when the OS is loaded by a lot of external processes/threads (context switches?), but neither CPU, nor RAM needs to be necessarily 100% used
- once the parent script has already got “frozen”, it is only very rarely awaken by OS, even if the overloading situation is over - external processes killed
- we are not able to simulate this behaviour on other Linux/Unix (we tried CentOS, Ubuntu and AIX till now)
- we have to overload machine with 4 CPU’s by much more processes then 2 CPU’s machine to invoke the state.
If anyone has a clue to this, thank you in advance for your note here, it’s quite urgent !