journalctl slow response time...

JimOpensuse · December 21, 2014, 7:06pm

Hello,

After several searches of the forums I could not find an answer. I would appreciate any assistance or point me in the right direction if this has already been solved/answered. I apologize if this is a lot of information I tried to be very detailed.

Quick Information:

openSUSE version you are working on -> openSUSE 13.2 (Tumbleweed Repositories)
the name and the version number of the application that you believe is giving you problems -> journalctl (systemd 210)
an appropriate description of the symptoms -> journalctl slow response time.

Background Details:

Many, Many years ago (geeze, now I feel old). I setup ssh to only Allow one user into server by setting the AllowUsers option; I then wrote a perl script which monitored the /var/log/messages file for ssh brute force login attempts. It would pull out the IP address of the attempted login and then would DROP the IP address with a call to iptables. The script was really simple and searching the /var/log/messages file ran very fast (less than a second) it looks something like:


open(MESSAGES, '/usr/bin/tail -f /var/log/messages |');

while(<MESSAGES>)
{       
        chomp();
        
        if((/Invalid user/                                ||
            /allowed because not listed in AllowUsers/    ||
            /but this does not map back to the addres/)   &&
          (/\b(\d{1,3}(?:\.\d{1,3}){3})\b/)) { block_punk($1); }
}

The Problem:

After recently upgrading to openSUSE 13.2 I noticed my simple little script started to fail miserably. I did some googling and quickly found that /var/log/messages no longer exists and I should now use journalctl. OK, I can deal with change and adjusted my script to:


open(MESSAGES, '/usr/bin/journalctl -f -u sshd |');  // OpenSuse 13.2 journalctl change

while(<MESSAGES>)
{       
        chomp();
        
        if((/Invalid user/                                ||
            /allowed because not listed in AllowUsers/    ||
            /but this does not map back to the addres/)   &&
          (/\b(\d{1,3}(?:\.\d{1,3}){3})\b/)) { block_punk($1); }
}

This now runs but what I noticed is that pulling records from the journal is very slow. I pulled out the new call to journalctl and ran it on the command line; for example when running this command directly:

time journalctl -u sshd |tail -100

I get the results of:


real    1m28.185s
user    1m5.448s
sys     0m28.105s

So it is taking 1 minute 28 seconds to pull the last 100 sshd log entries out of the journal.

And Now Finally A Question:

I appreciate you making it this far. *** Can you please tell me why journalctl is running so slow?***

Thanks For You Patience,

Jim

P.S. Just one more thing. I know there are plenty of ways to prevent brute force ssh attacks (fail2ban, Sshguard, Denyhosts, etc.) and I know I can re-enebale syslog-ng or rsyslog to resolve the issue. But what I really want to know is what is up with journalctl being so slow?

Miuku · December 21, 2014, 7:27pm

You don’t NEED to use journalctl, you can install (for example) rsyslog, start it, enable it to startup and you’ll have a /var/log/messages again.

So in essence;
zypper in rsyslog
systemctl enable rsyslog
systemctl start rsyslog

Boom you’re done I personally use this over journalctl for simplicity.

robin_listas · December 21, 2014, 7:43pm

On 2014-12-21 19:06, JimOpensuse wrote:


>   open(MESSAGES, '/usr/bin/tail -f /var/log/messages |');

You should use “tailf” instead, which doesn’t cause disk activity if
there is no new data.

> _And_Now_Finally_A_Question:
> _
> I appreciate you making it this far. - Can you please tell me why
> journalctl is running so slow?-

IMO, bad design, plain and simple.

The designers were using SSD devices, which are specially fast on seek
as compared to rotating hard disks. When normal people use it they find
it slow, but the devs answer is to buy an SSD (no kidding, we got that
answer in the mail list).

The basic reason is that it is an on-disk database, heavily fragmented.
Reading it makes the disk head jump all over the place.

You could try “dmesg --follow” instead.

–
Cheers / Saludos,

Carlos E. R.
(from 13.1 x86_64 “Bottle” at Telcontar)

tsu2 · December 22, 2014, 7:10pm

From time to time I see these posts about slow times parsing journalctl, but I wonder why I haven’t seen them (so can’t really speculate on a cause)

So, for example on one of my systems

time journalctl -u sshd | tail -100-- Logs begin at Sat 2014-11-15 16:36:51 PST, end at Sun 2014-12-21 20:55:35 PST. --


real	0m1.468s
user	0m0.000s
sys	0m0.026s

I don’t know that I would blame disk defragmentation, the amount of data that’s parsed is really tiny (only last 100 entries). I wonder if system resources might be committed elsewhere… So, for instance I’d look at

Committed RAM to buffers and available free by using the Free tool
https://en.opensuse.org/User:Tsu2/free_tool
Note: Is this happening only on machines which have been running for a long while? If so, my above article contains a command to free committed RAM. You’d have to weigh the upside (faster overall system response after heavy taskloads) vs the downside (slower re-invoking previous heavy taskloads).

Reading process activity in realtime with a “top” console already open.

Reading disk activity in realtime (sensors?). You might see whether disk activity is immediate when your command runs or if as I suspect there is no change in activity for a long while and only just before your results are returned and displayed.

TSU

JimOpensuse · December 23, 2014, 5:02am

Thank you very much for the information Miuku!

I will certainly look into that.

I was really however trying to figure out what was going on with journalctl.

JimOpensuse · December 23, 2014, 5:14am

Thanks for the information Carlos!

The great thing about Linux is that I learn something new all the time (Your tailf suggestion is great, I was not aware of that one)

in regards to the journalctl, I am actually running with SSDs and have lots of free RAM:


jim@slacker:~> free
             total       used       free     shared    buffers     cached
Mem:      32906756   19671536   13235220      38876     942384   15629576
-/+ buffers/cache:    3099576   29807180
Swap:      8400892          0    8400892

I will continue to play with it.

Thanks Again…

robin_listas:

On 2014-12-21 19:06, JimOpensuse wrote:
>   open(MESSAGES, '/usr/bin/tail -f /var/log/messages |');
You should use “tailf” instead, which doesn’t cause disk activity if
there is no new data.

> _And_Now_Finally_A_Question:
> _
> I appreciate you making it this far. - Can you please tell me why
> journalctl is running so slow?-

IMO, bad design, plain and simple.

The designers were using SSD devices, which are specially fast on seek
as compared to rotating hard disks. When normal people use it they find
it slow, but the devs answer is to buy an SSD (no kidding, we got that
answer in the mail list).

The basic reason is that it is an on-disk database, heavily fragmented.
Reading it makes the disk head jump all over the place.

You could try “dmesg --follow” instead.

–
Cheers / Saludos,

Carlos E. R.
(from 13.1 x86_64 “Bottle” at Telcontar)

robin_listas · December 23, 2014, 2:18pm

On 2014-12-23 05:16, JimOpensuse wrote:
>
> Thanks for the information Carlos!
>
> The great thing about Linux is that I learn something new all the time
> (-Your tailf suggestion is great, I was not aware of that one-)
>
> in regards to the journalctl, I am actually running with SSDs and have
> lots of free RAM:

In that case, its processing should be fast.

Try with dmesg instead…

Or…

With syslog, you can place hooks in syslog itself so that instead of you
reading the logs watching for some entries, the syslog daemon calls you
when he, that is already reading the logs, sees an entry that you told
him you are interested in.

Meaning that you can call code when a certain message appears, directly
from syslog.

I’m not aware of the systemd journal having a similar capability, but it
should.

–
Cheers / Saludos,

Carlos E. R.
(from 13.1 x86_64 “Bottle” at Telcontar)

robin_listas · December 23, 2014, 2:23pm

On 2014-12-22 19:16, tsu2 wrote:
>
> From time to time I see these posts about slow times parsing journalctl,
> but I wonder why I haven’t seen them (so can’t really speculate on a
> cause)

It happens when the persistent storage on rotating hard disk is large.

–
Cheers / Saludos,

Carlos E. R.
(from 13.1 x86_64 “Bottle” at Telcontar)

tsu2 · December 24, 2014, 6:58pm

OK,
So, out of curiosity I did a little Internet digging…

First, an original discussion in 2012 when a “slow journalctl” was observed. Note Linus Poettering’s comments in the thread. Less important are the specific problems with that journalctl since today’s versions have switches which address some of those issues(and there may be code changes)… perhaps most importantly the ability to view data “newest entry first”
https://bugzilla.redhat.com/show_bug.cgi?id=867841

But in general the various Internet postings about slowness have been about journal files several hundred megabytes in size. Mine certainly aren’t. You may need to check whether yours is still <100MB as well.
Example, this thread
https://bbs.archlinux.org/viewtopic.php?id=153922
Checking log size and fragmentation

More likely seems to be journal corruption. You can output errors in your journal with the following

journalctl --verify

The problem is that identifying journal corruption still leaves with you with limited options… You can delete your logs but there seems to be almost no way to know how they became corrupted and how to prevent it from happening again in the future.

Although there are some postings about slowness on btfrs, nothing specific was pinned down. So, absent some btfrs-specific discovery it’s not likely relevant and the above (<extreme> size and file corruption) are possibly more likely causes.

Lastly,
You’ll notice the references I posted are a couple years old. It may only be a limitation of my Googling, but I found <very> few postings about a slow journalctl after 2012.

TSU

robin_listas · December 25, 2014, 3:43pm

On 2014-12-24 19:06, tsu2 wrote:
>
> robin_listas;2684805 Wrote:

>> It happens when the persistent storage on rotating hard disk is large.

> OK,
> So, out of curiosity I did a little Internet digging…

My recollections are of openSUSE mail list, mostly. For instance, I see
one comment Nov 2013, grepping for “slow” and “journalctl” in the same
email.

BTW journalctl is terrible slow. For example
journalctl --full
would take a few hours (or days) I guess.

But there was a long and angry thread that I can not locate.

There are other comments mentioning that there are tools for verifying
the journal, but none for repair, and that it was not planned to create
one (2013).

I can not do any testing myself, because I use syslog instead for long
term logs; I only allow systemd to keep the journal of the current
session, and even then it is big:


Telcontar:~ # journalctl --disk-usage
Journals take up 408.3M on disk.
Telcontar:~ #

cer@Telcontar:~> uptime
15:27  up 12 days 18:55,  72 users,  load average: 0,83, 0,57, 0,60
cer@Telcontar:~>

Telcontar:~ # time journalctl > logs

real    1m14.254s
user    0m9.986s
sys     0m2.741s
Telcontar:~ # ls -lh logs
-rw-r--r-- 1 root root 44M Dec 25 15:28 logs
Telcontar:~ #

A second run is faster:

Telcontar:~ # time journalctl -l > logs


real    0m11.065s
user    0m9.680s
sys     0m1.372s
Telcontar:~ # ls -lh logs
-rw-r--r-- 1 root root 44M Dec 25 15:31 logs
Telcontar:~ #

which indicates that the main time factor is the hard disk reading. If I
had my 2 year log in the journal, it would be terribly big and slow
(consider: mine has 400 MB in 12 days).

Still the text dump is an order of magnitude smaller than the journal.

–
Cheers / Saludos,

Carlos E. R.
(from 13.1 x86_64 “Bottle” at Telcontar)

tsu2 · December 26, 2014, 5:51pm

robin_listas:

<snip>

I can not do any testing myself, because I use syslog instead for long
term logs; I only allow systemd to keep the journal of the current
session, and even then it is big:
Telcontar:~ # journalctl --disk-usage
Journals take up 408.3M on disk.
Telcontar:~ #

cer@Telcontar:~> uptime
15:27  up 12 days 18:55,  72 users,  load average: 0,83, 0,57, 0,60
cer@Telcontar:~>

Telcontar:~ # time journalctl > logs

real    1m14.254s
user    0m9.986s
sys     0m2.741s
Telcontar:~ # ls -lh logs
-rw-r--r-- 1 root root 44M Dec 25 15:28 logs
Telcontar:~ #
A second run is faster:

Telcontar:~ # time journalctl -l > logs
real    0m11.065s
user    0m9.680s
sys     0m1.372s
Telcontar:~ # ls -lh logs
-rw-r--r-- 1 root root 44M Dec 25 15:31 logs
Telcontar:~ #
which indicates that the main time factor is the hard disk reading. If I
had my 2 year log in the journal, it would be terribly big and slow
(consider: mine has 400 MB in 12 days).

Still the text dump is an order of magnitude smaller than the journal.

–
Cheers / Saludos,

Carlos E. R.
(from 13.1 x86_64 “Bottle” at Telcontar)

Carlos,
You fall in that category of extremely large journals (Yours reads over 400MB).
In the links I posted, you can restrict size to something much smaller (25MB is probably very small, something around 100MB might be a max but YMMV). I don’t know that keeping 400MB of logs is useful (but again, everyone has their own reasons to keep every logfile since installation).

That said, there are some native equivalents of tail and head which should be used which might provide better performance


# The following initially prints the most recent 10 lines so is equivalent to tail, but will also continuously update
journalctl --follow
# Displays the specified time frame
journalctl --since=-1day
# Similar to "--since" but display entries up to a specified date
journalctl --until=2014-12-25
or
journalctl --until=now

Since I’m not experiencing a slow journalctl, I can’t test whether the above commands fixes the problem.

You’ll also find in the links I posted (Linux Pottering’s comments) why the legacy syslog returns data quickly… The full syslog is stored as a series of smaller files so parsing only the relevant small logfile is quick.

TSU

robin_listas · December 27, 2014, 2:13am

On 2014-12-26 17:56, tsu2 wrote:
>
> robin_listas;2685118 Wrote:

>> which indicates that the main time factor is the hard disk reading. If I
>> had my 2 year log in the journal, it would be terribly big and slow
>> (consider: mine has 400 MB in 12 days).
>>
>> Still the text dump is an order of magnitude smaller than the journal.

> Carlos,
> You fall in that category of extremely large journals (Yours reads over
> 400MB).
> In the links I posted, you can restrict size to something much smaller
> (25MB is probably very small, something around 100MB might be a max but
> YMMV). I don’t know that keeping 400MB of logs is useful (but again,
> everyone has their own reasons to keep every logfile since
> installation).

No, I don’t have a need for them, because what I use is syslog, and it is much smaller, covering months of logs.

What you see is the current session logs, the runtime, which are automatically deleted on reboot. And it is large because systemd is keeping nntp and mail debug logs; I know how to prune them from syslog (rotated out), but I don’t know how to selectively prune systemd log records.

The other problem is that the runtime logs do not seem to be compacted, as the dump to text is much smaller.

> That said, there are some native equivalents of tail and head which
> should be used which might provide better performance

That will be interesting for the OP.

> Since I’m not experiencing a slow journalctl, I can’t test whether the
> above commands fixes the problem.
>
> You’ll also find in the links I posted (Linux Pottering’s comments) why
> the legacy syslog returns data quickly… The full syslog is stored as a
> series of smaller files so parsing only the relevant small logfile is
> quick.

LOL.

Look:


Telcontar:~ # ls /var/log/messages*
/var/log/messages               /var/log/messages-20140301.xz  /var/log/messages-20140506.xz  /var/log/messages-20140907.xz
/var/log/messages-20120816.bz2  /var/log/messages-20140305.xz  /var/log/messages-20140525.xz  /var/log/messages-20141001.xz
/var/log/messages-20131205.xz   /var/log/messages-20140402.xz  /var/log/messages-20140629.xz  /var/log/messages-20141028.xz
/var/log/messages-20140123.xz   /var/log/messages-20140411.xz  /var/log/messages-20140726.xz  /var/log/messages-20141130.xz
Telcontar:~ # time zgrep "Restarting kernel threads" /var/log/messages*xz
/var/log/messages-20131205.xz:<0.4> 2013-11-05 11:22:32 Telcontar kernel - - - [112002.376799] Restarting kernel threads ... done.
/var/log/messages-20131205.xz:<0.4> 2013-11-05 19:14:00 Telcontar kernel - - - [130783.602403] Restarting kernel threads ... done.
....
/var/log/messages-20141130.xz:<0.4> 2014-11-29 21:55:33 Telcontar kernel - - - [921439.104283] Restarting kernel threads ... done.
/var/log/messages-20141130.xz:<0.4> 2014-11-30 14:13:34 Telcontar kernel - - - [941075.361466] Restarting kernel threads ... done.
real    0m0.865s
user    0m0.443s
sys     0m0.115s
Telcontar:~ #

As you see, running a “query” on a year of syslog logs took less than a second - and it is not cached.
Overall, a much more efficient method

–
Cheers / Saludos,

Carlos E. R.
(from 13.1 x86_64 “Bottle” at Telcontar)