Bug 1200682 - rsyslog hangs during shutdown
rsyslog hangs during shutdown
Status: RESOLVED FIXED
Classification: openSUSE
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Basesystem
Current
x86-64 openSUSE Tumbleweed
: P5 - None : Normal with 1 vote (vote)
: ---
Assigned To: Andreas Schwab
E-mail List
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2022-06-19 07:33 UTC by Oliver Schwabedissen
Modified: 2022-07-06 07:54 UTC (History)
16 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
supportconfig log (2.69 MB, application/x-xz-compressed-tar)
2022-06-20 13:37 UTC, Stratos Zolotas
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Oliver Schwabedissen 2022-06-19 07:33:13 UTC
Since one of the last snapshots my system takes very log to shutdown. I found that there is a 90 second timeout when trying to stop system logging services.

The message I see is

"A stop job is running for system logging service (1 min 30 sec)"

After 90 seconds rsyslog is killed and shutdown continues.

I also see this when trying to stop rsyslog on a running system (systemctl stop rsyslog.service):

Jun 19 09:03:14 frodo systemd[1]: Stopping System Logging Service...
Jun 19 09:04:44 frodo systemd[1]: rsyslog.service: State 'stop-sigterm' timed out. Killing.
Jun 19 09:04:44 frodo systemd[1]: rsyslog.service: Killing process 1783 (rsyslogd) with signal SIGKILL.
Jun 19 09:04:44 frodo systemd[1]: rsyslog.service: Killing process 1787 (in:imklog) with signal SIGKILL.
Jun 19 09:04:44 frodo systemd[1]: rsyslog.service: Killing process 1788 (rs:main Q:Reg) with signal SIGKILL.
Jun 19 09:04:44 frodo systemd[1]: rsyslog.service: Main process exited, code=killed, status=9/KILL
Jun 19 09:04:44 frodo systemd[1]: rsyslog.service: Failed with result 'timeout'.
Jun 19 09:04:44 frodo systemd[1]: Stopped System Logging Service.

I found a thread about this in the openSUSE forums: https://forums.opensuse.org/showthread.php/571409-Ninety-second-shutdown-delay-quot-A-stop-job-is-running-for-system-logging-service-quot?p=3135179

Some tried to replace rsyslog with syslog-ng which solved the problem. But this is a workaround and I think a fix should be found.
Comment 1 Stuart Rogers 2022-06-19 08:49:26 UTC
I have had this exact issue on two Tumbleweed installs, one desktop and one laptop. I also removed rsyslog and so far no further problems.
Comment 2 Stratos Zolotas 2022-06-20 09:37:27 UTC
Can confirm the same issue on my Tumbleweed system.
Comment 3 Thomas Blume 2022-06-20 12:03:40 UTC
Please install the supportutils package, create a supportconfig log from your machine and attach it.
Comment 4 Stratos Zolotas 2022-06-20 13:37:47 UTC
Created attachment 859709 [details]
supportconfig log

Requested supportconfig log.
Comment 5 Thomas Blume 2022-06-20 16:11:14 UTC
(In reply to Stratos Zolotas from comment #4)
> Created attachment 859709 [details]
> supportconfig log
> 
> Requested supportconfig log.

Thanks, I could reproduce the issue.
It hangs at canceling the kernel logging thread:

-->
155.554233929:main thread    : ../threads.c: request term via canceling for input thread imklog
--<

The code (https://github.com/rsyslog/rsyslog/blob/master/threads.c)
shows that it uses pthread_cancel therefore:

-->
/* terminate a thread gracefully.
 */
rsRetVal thrdTerminate(thrdInfo_t *pThis)
{
	DEFiRet;
	assert(pThis != NULL);

	if(pThis->bNeedsCancel) {
		DBGPRINTF("request term via canceling for input thread %s\n", pThis->name);
		if(dbgTimeoutToStderr) {
			fprintf(stderr, "rsyslogd debug: request term via canceling for "
				"input thread %s\n", pThis->name);
		}
		pthread_cancel(pThis->thrdID);
[...]
--<

Andreas, are you the right one to ask for that glibc stuff?
If so, do you know anything that would block a pthread_cancel on a recent tumbleweed?
Comment 6 Andreas Schwab 2022-06-21 08:37:33 UTC
Does the process block any signals?
Comment 7 Thomas Blume 2022-06-21 10:05:39 UTC
(In reply to Andreas Schwab from comment #6)
> Does the process block any signals?

rsyslog has a signal handler, but that only sets a variable and leaves the action to the main process:

-->
/* rsyslogdDoDie() is a signal handler. If called, it sets the bFinished variable
 * to indicate the program should terminate. However, it does not terminate
 * it itself, because that causes issues with multi-threading. The actual
 * termination is then done on the main thread. This solution might introduce
 * a minimal delay, but it is much cleaner than the approach of doing everything
 * inside the signal handler.
 * rgerhards, 2005-10-26
 * Note:
 * - we do not call DBGPRINTF() as this may cause us to block in case something
 *   with the threading is wrong.
 * - we do not really care about the return state of write(), but we need this
 *   strange check we do to silence compiler warnings (thanks, Ubuntu!)
 */
--<


the main loop sets a signal mask:

-->
static void
mainloop(void)
{
!·······time_t tTime;
!·······sigset_t origmask;
!·······sigset_t sigblockset;
!·······int need_free_mutex;

!·······sigemptyset(&sigblockset);
!·······sigaddset(&sigblockset, SIGTERM);
!·······sigaddset(&sigblockset, SIGCHLD);
!·······sigaddset(&sigblockset, SIGHUP);

!·······do {
!·······!·······pthread_sigmask(SIG_BLOCK, &sigblockset, &origmask);
!·······!·······pthread_mutex_lock(&mutChildDied);
!·······!·······need_free_mutex = 1;
!·······!·······if(bChildDied) {
!·······!·······!·······bChildDied = 0;
!·······!·······!·······pthread_mutex_unlock(&mutChildDied);
!·······!·······!·······need_free_mutex = 0;
!·······!·······!·······reapChild();
!·······!·······}
!·······!·······if(need_free_mutex) {
!·······!·······!·······pthread_mutex_unlock(&mutChildDied);
!·······!·······}

!·······!·······pthread_mutex_lock(&mutHadHUP);
!·······!·······need_free_mutex = 1;
!·······!·······if(bHadHUP) {
!·······!·······!·······bHadHUP = 0;
!·······!·······!·······need_free_mutex = 0;
!·······!·······!·······pthread_mutex_unlock(&mutHadHUP);
!·······!·······!·······doHUP();
!·······!·······}
!·······!·······if(need_free_mutex) {
!·······!·······!·······pthread_mutex_unlock(&mutHadHUP);
!·······!·······}

!·······!·······processImInternal();

!·······!·······if(bFinished)
!·······!·······!·······break;!·/* exit as quickly as possible */

!·······!·······wait_timeout(&origmask);
!·······!·······pthread_sigmask(SIG_UNBLOCK, &sigblockset, NULL);

!·······!·······janitorRun();

!·······!·······datetime.GetTime(&tTime);
!·······!·······checkGoneAwaySenders(tTime);

!·······} while(!bFinished); /* end do ... while() */
}
--<

Does that help or do you need more information?
Comment 8 Jiri Slaby 2022-06-22 08:05:18 UTC
FWIW, the thread waits in:
[<0>] syslog_print+0xda/0x2e0
[<0>] kmsg_read+0x3e/0x50
[<0>] vfs_read+0x9b/0x190
[<0>] ksys_read+0x69/0xf0
[<0>] do_syscall_64+0x5b/0x80
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae


The very first entry translates to:
$ nm vmlinux|grep -w syslog_print
ffffffff8112c280 t syslog_print
$ addr2line -e vmlinux `printf "0x%x" $((0xffffffff8112c280+0xda))`
/usr/src/debug/kernel-default-5.18.5-27.1.g983c97f.x86_64/linux-5.18/linux-obj/../kernel/printk/printk.c:1537 (discriminator 9)

Which is:
                len = wait_event_interruptible(log_wait,
                                prb_read_valid(prb, seq, NULL));

So it's the S state. I'm not sure why sending a TERM signal to the thread (via gdb's "signal" command while being switched to the input thread) doesn't cause it to die...
Comment 9 Jiri Slaby 2022-06-22 08:56:58 UTC
FTR, I checked 5.17.9, 5.18.2 and 5.18.4 kernels. They are all affected. The same holds for older glibc without nptl patches.

But it is enough to do:
rpm -Uvh http://download.opensuse.org/history/20220618/tumbleweed/repo/oss/x86_64/rsyslog-8.2204.1-1.1.x86_64.rpm --oldpackage

So I assume the full rebuild caused this, perhaps the linkage changes?
Comment 10 Jiri Slaby 2022-06-22 08:58:33 UTC
(In reply to Jiri Slaby from comment #9)
> FTR, I checked 5.17.9, 5.18.2 and 5.18.4 kernels. They are all affected. The
> same holds for older glibc without nptl patches.
> 
> But it is enough to do:
> rpm -Uvh
> http://download.opensuse.org/history/20220618/tumbleweed/repo/oss/x86_64/
> rsyslog-8.2204.1-1.1.x86_64.rpm --oldpackage
> 
> So I assume the full rebuild caused this, perhaps the linkage changes?

BTW doing update from the very same snapshot, but the newer (rebuilt) version:
rpm -Uvh http://download.opensuse.org/history/20220618/tumbleweed/repo/oss/x86_64/rsyslog-8.2204.1-1.2.x86_64.rpm

makes it hang again.
Comment 11 Jiri Slaby 2022-06-22 09:25:55 UTC
The info was provided, I think.
Comment 12 Thomas Blume 2022-06-22 10:48:35 UTC
(In reply to Jiri Slaby from comment #11)
> The info was provided, I think.

Thanks a lot colleagues!
Indeed version rsyslog-8.2204.1-1.1.x86_64 works, but rsyslog-8.2204.1-1.2.x86_64 hangs, even though both builds are from identical sources sources.
Comment 13 Andreas Schwab 2022-06-22 12:01:50 UTC
It's an age-old bug that was uncovered by _FORTIFY_SOURCE=3.
Comment 14 Michael Pujos 2022-06-26 10:30:21 UTC
Unsurprisingly, also happens on a 'systemctl stop rsyslog'
Comment 15 OBSbugzilla Bot 2022-06-27 10:40:03 UTC
This is an autogenerated message for OBS integration:
This bug (1200682) was mentioned in
https://build.opensuse.org/request/show/985279 Factory / glibc
Comment 16 Andreas Schwab 2022-07-06 07:54:17 UTC
Fixed.