I need to have a systemd service’s FailureAction execute two different commands depending on whether the service crashed or is hung (meaning, it failed to call sd_notify() in time—I’m using the WatchdogSec option).
According to Table 1 in the Restart= section of http://www.freedesktop.org/software/systemd/man/systemd.service.html the failure type information is known to systemd as it’s used to specify the restart policy. However, I don’t see how I can access this information in FailureAction=. I’m hoping for a better answer than “parse the logs”.
Which openSUSE version are you using? 13.2 does not support FailureAction yet.
That’s even worse… So what would be the work-around to the lack of FailureAction? I need to send a message to a socket the moment the service has crashed.
On second look, FailureAction= doesn’t allow arbitrary commands, just reboot/shutdown. So even upgrading wouldn’t help me.
So how do I achieve this? I basically need one message to be sent to a port when the monitored program crashes, and another message to be sent when systemd detects is as hung.
Polling the logs is really a bad option. I need something with near-instant response.
OnFailure?
Hmm, looks like that starts another service, and I can’t pass it arguments. But I think it’ll make do, as I can get specific properties about the service state using systemctl show <unit> --property=[ActiveState|SubState|Result]
You can use templates and pass it name of own service as example. Or any other arbitrary tag that your templated service understands.
IMO you’re likely on the right path with this approach.
Recommend:
- There are a multitude of ways to kill or terminate a process. I suspect that these relate to various states you are trying to evaluate.
- You may need to more technically define what you mean by “crash” and whether the type of process stoppage might actually relate to a type of termination or if the process is really hung.
- If the process is hung and is the result of your code, isn’t the proper solution debugging? Debugging implies inserting debug code to generate notifications and possibly execute temporary resoluton… instead of waiting for some end result to be handled by systemd.
- Another approach is a combination of the above (if one wishes) which is to separate your service code into smaller pieces, each which operate differently and can be invoked by defined logic. Note that it’s possible to define scripted (or command line) logic as a parameter defined in the systemd Unit file (not sure why you’re finding difficulty “I can’t pass it arguments” - may need an example).
TSU
I’m deploying on many kiosks, so I can’t rely on user detection and reporting. When the service enters a failed state (exit or hang) I leave it up to systemd to restart the process, but I also trigger sending of appropriate messages to the server (as well as copy of the log data around the failure time). It’s useful for troubleshooting to know whether systemd restarted the process because it exited, or because it was frozen (the latter meaning it didn’t sd_notify() within WatchdogSec=). The service itself is based on one of the largest codebase third-party game engines so I can’t guarantee that it will never freeze.
I should think that if you’re triggering only on a service restart,
I should think that you should be able to append a command to the EXEC command to copy the journal entry to a messaging service.
Or,
If this is a truly enterprise (large deployment) app, I’d think that it should be worthwhile to copy <all> journal and other log data to a centralized server.
The benefits of doing this would be manyfold… archiving and preservation of data about not only historical routine performance but every event that ever happens, even if it’s overlooked at the time. Depending on what type of solution you decide upon, there are tools which can allow sifting through the mountains of data for whatever you want, including in the specific case you describe possible STOP or HALT events.
With a very big “log everything” approach, then you probably only need to trigger a minimal notification… maybe the name of the app or service and timestamp. You’d then be able to locate the specific event in the logs and then search for anything around it that might be relevant.
IMO,
TSU
Network bandwidth and reliability are an issue. I prefer to optimize for network traffic, as I’ve done for all the other protocols.
In any case, I’ve set OnFailure to call a script that parses the status properties and grabs logging data as needed to send the appropriate messages, so I guess my query is resolved.
Thanks