Bugzilla – Bug 918226
systemd segfaults after updating from 208-23.3 to 208-28.1
Last modified: 2015-05-05 06:50:33 UTC
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:35.0) Gecko/20100101 Firefox/35.0 Build Identifier: On two different SuSE 13.1 x86_64 systems, systemd segfaults several hours after installing: systemd-208-28.1.x86_64.rpm systemd-32bit-208-28.1.x86_64.rpm systemd-rpm-macros-2-28.1.noarch.rpm systemd-sysvinit-208-28.1.x86_64.rpm ... var/log/messages contains: kernel: [680150.869695] systemd[1]: segfault at 137a7020 ip 000000000040e526 sp 00007fffd861c290 error 4 in systemd[400000+ed000] systemd[1]: Caught <SEGV>, dumped core as pid 32253. systemd[1]: Freezing execution Going back to v208-23.3 seems to solve the problem. The first symptom that made me notice this problem is that thousands of defunct processes accumulate on the system. I tried running gdb against the coredump file, but am not sure how much value this will be due to lack of symbol table information: # gdb /bin/systemd core Reading symbols from /usr/lib/systemd/systemd...Missing separate debuginfo for /usr/lib/systemd/systemd Try: zypper install -C "debuginfo(build-id)=14e9c2ba2f551a445f792d053a7f9dc593a60a2e" (no debugging symbols found)...done. [New LWP 32253] Core was generated by `/usr/lib/systemd/systemd --system --deserialize 20'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00007f5644b2d8cb in ?? () (gdb) backtrace #0 0x00007f5644b2d8cb in ?? () #1 0x000000000040cbcb in ?? () #2 0x00007f5644b2d9f0 in ?? () #3 0x0000000000000001 in ?? () #4 0x0000000000000000 in ?? () If anyone can provide a link as to how to get symbol table information, I'll certainly try to get it. The suggested zypper command doesn't help without the missing repository: # zypper install -C "debuginfo(build-id)=14e9c2ba2f551a445f792d053a7f9dc593a60a2e" Loading repository data... Reading installed packages... No provider of 'debuginfo(build-id) = 14e9c2ba2f551a445f792d053a7f9dc593a60a2e' found. Resolving package dependencies... Nothing to do. Reproducible: Always Steps to Reproduce: 0. HAPPENS EVERY TIME ON SOME COMPUTERS AFTER SEVERAL HOURS OR OVERNIGHT 1. Install systemd 208-28.1 2. wait several hours (overnight) 3. grep segfault /var/log/messages Actual Results: - systemd segfault message in /var/log/messages - ps -ef | grep defunct | wc -l 8992 - cron jobs don't run: systemd-logind[10804]: Failed to start session scope session-7576.scope: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. org.freedesktop.DBus.Error.NoReply /usr/sbin/cron[32297]: pam_systemd(crond:session): Failed to create session: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. 2015-02-16T09:41:03.642884-05:00 iggy dbus[920]: [system] Reloaded configuration /usr/sbin/cron[32296]: pam_systemd(crond:session): Failed to create session: Input/output error Expected Results: did not expect to see segfault message in /var/log/messages, or thousands of defunct processes. I have two computers exhibiting this problem: One system exhibiting this problem is an email server running kernel 3.11.10-25-default in multiuser text mode (not running X11). The other computer is running KDE and is my desktop computer running kernel 3.11.10-25-desktop with NVIDIA-Linux-x86_64-346.35.run. Another system running in text mode does not exhibit the problem at all (at least not yet). Thank you for looking into this!! Please let me know if I can be of further assistance.
I also have this problem: The new systemd segfaults (I can provide cores). Additionally there's no way to safely shutdown/reboot the system after this crash.
See also Bug 918158 and Bug 918231
Alle my 13.1 machine are affect with this bug. Here the command to get back to the preview version. zypper in --oldpackage systemd-32bit-208-23.3.x86_64 systemd-208-23.3.x86_64 systemd-sysvinit-208-23.3.x86_64
I can confirm, all our 13.1 servers (IBM x3650 and x3550) are affected. Going back to v208-23.3 solves the problem.
You need the systemd debuginfo packages to see details of the backtrace. From the core in bug 918231 I see this: -->-- Core was generated by `/usr/lib/systemd/systemd --system --deserialize 20'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00007f38df5818cb in raise () from /lib64/libpthread.so.0 (gdb) bt #0 0x00007f38df5818cb in raise () from /lib64/libpthread.so.0 #1 0x000000000040cbcb in crash (sig=11) at src/core/main.c:143 #2 <signal handler called> #3 0x000000000047912e in unit_unwatch_pid (u=u@entry=0xb951a0, pid=12842) at src/core/unit.c:1682 #4 0x000000000040e50f in manager_dispatch_sigchld (m=m@entry=0xae69d0) at src/core/manager.c:1392 #5 0x0000000000413a51 in manager_process_signal_fd (m=<optimized out>) at src/core/manager.c:1636 #6 process_event (ev=0x7fff72a0c4c0, m=0xae69d0) at src/core/manager.c:1661 #7 manager_loop (m=0xae69d0) at src/core/manager.c:1858 #8 0x000000000040ad44 in main (argc=<optimized out>, argv=0x7fff72a0cd28) at src/core/main.c:1652 --<-- Investigating...
*** Bug 918231 has been marked as a duplicate of this bug. ***
So, a SIGCHLD is sent to PID 12842: -->-- (gdb) bt full #0 0x00007f38df5818cb in raise () from /lib64/libpthread.so.0 No symbol table info available. #1 0x000000000040cbcb in crash (sig=11) at src/core/main.c:143 rl = {rlim_cur = 18446744073709551615, rlim_max = 18446744073709551615} sa = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0}, sa_mask = {__val = {0 <repeats 16 times>}}, sa_flags = 0, sa_restorer = 0x0} pid = 0 __func__ = "crash" __PRETTY_FUNCTION__ = "crash" #2 <signal handler called> No symbol table info available. #3 0x000000000047912e in unit_unwatch_pid (u=u@entry=0xb951a0, pid=12842) at src/core/unit.c:1682 __PRETTY_FUNCTION__ = "unit_unwatch_pid" #4 0x000000000040e50f in manager_dispatch_sigchld (m=m@entry=0xae69d0) at src/core/manager.c:1392 si = {si_signo = 17, si_errno = 0, si_code = 1, _sifields = {_pad = {12842, 40, 1, 0 <repeats 25 times>}, _kill = { si_pid = 12842, si_uid = 40}, _timer = {si_tid = 12842, si_overrun = 40, si_sigval = {sival_int = 1, sival_ptr = 0x1}}, _rt = {si_pid = 12842, si_uid = 40, si_sigval = {sival_int = 1, sival_ptr = 0x1}}, _sigchld = {si_pid = 12842, si_uid = 40, si_status = 1, si_utime = 0, si_stime = 0}, _sigfault = { si_addr = 0x280000322a, si_addr_lsb = 1}, _sigpoll = {si_band = 171798704682, si_fd = 1}, _sigsys = { _call_addr = 0x280000322a, _syscall = 1, _arch = 0}}} u = 0xb951a0 r = <optimized out> __PRETTY_FUNCTION__ = "manager_dispatch_sigchld" __func__ = "manager_dispatch_sigchld" #5 0x0000000000413a51 in manager_process_signal_fd (m=<optimized out>) at src/core/manager.c:1636 sfsi = {ssi_signo = 17, ssi_errno = 0, ssi_code = 1, ssi_pid = 12842, ssi_uid = 40, ssi_fd = 0, ssi_tid = 0, ssi_band = 0, ssi_overrun = 0, ssi_trapno = 0, ssi_status = 1, ssi_int = 0, ssi_ptr = 0, ssi_utime = 0, ssi_stime = 0, ssi_addr = 0, __pad = '\000' <repeats 47 times>} sigchld = true --<-- which is a re-executed (--deserialize) systemd system instance: (gdb) info proc 12842 exe = '/usr/lib/systemd/systemd --system --deserialize 20'
The problem seems to be in unit_unwatch_pid, which calls: 1682 hashmap_remove_value(u->manager->watch_pids, LONG_TO_PTR(pid), u); u->manager is a nullpointer: -->-- (gdb) down #3 0x000000000047912e in unit_unwatch_pid (u=u@entry=0xb951a0, pid=12842) at src/core/unit.c:1682 1682 hashmap_remove_value(u->manager->watch_pids, LONG_TO_PTR(pid), u); (gdb) print *u->manager Cannot access memory at address 0x0 (gdb) print *u $7 = {manager = 0x0, type = UNIT_SERVICE, load_state = UNIT_STUB, merged_into = 0xa0, id = 0x40 <Address 0x40 out of bounds>, instance = 0xb95bd0 "\001\001", names = 0xb95120, dependencies = {0x0, 0x0, 0x0, 0x0, 0x0, 0x281, 0xb95ad0, 0xb955d0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x771e800000000, 0x0, 0x0, 0x0, 0x0, 0x0}, requires_mounts_for = 0x0, description = 0xa0 <Address 0xa0 out of bounds>, documentation = 0x40, fragment_path = 0xb95a10 "\001\001", source_path = 0xb95200 "\320Z\271", dropin_paths = 0x0, fragment_mtime = 0, source_mtime = 0, dropin_mtime = 0, job = 0x0, nop_job = 0x161, job_timeout = 12146960, refs = 0xb95610, conditions = 0x0, condition_timestamp = {realtime = 0, monotonic = 0}, inactive_exit_timestamp = {realtime = 0, monotonic = 0}, active_enter_timestamp = {realtime = 0, monotonic = 0}, active_exit_timestamp = {realtime = 0, monotonic = 0}, inactive_enter_timestamp = {realtime = 2095570378293248, monotonic = 0}, cgroup_path = 0x0, cgroup_mask = (unknown: 0), slice = {unit = 0x0, refs_next = 0x0, refs_prev = 0x0}, units_by_type_next = 0xa0, units_by_type_prev = 0x40, has_requires_mounts_for_next = 0xb95910, has_requires_mounts_for_prev = 0xb952e0, load_queue_next = 0x0, load_queue_prev = 0x0, dbus_queue_next = 0x0, dbus_queue_prev = 0x0, cleanup_queue_next = 0x0, cleanup_queue_prev = 0x41, gc_queue_next = 0xb954b0, gc_queue_prev = 0xb95650, cgroup_queue_next = 0x0, cgroup_queue_prev = 0x0, pids = 0x0, gc_marker = 0, deserialized_job = 0, load_error = 288, unit_file_state = UNIT_FILE_ENABLED, stop_when_unneeded = 64, default_dependencies = false, refuse_manual_start = false, refuse_manual_stop = false, allow_isolate = false, on_failure_isolate = false, ignore_on_isolate = false, ignore_on_snapshot = false, condition_result = 16, transient = 86, in_load_queue = true, in_dbus_queue = false, in_cleanup_queue = false, in_gc_queue = true, in_cgroup_queue = true, sent_dbus_new_signal = true, no_gc = false, in_audit = true, cgroup_realized = false} --<--
(In reply to Martin Schröder from comment #1) > I also have this problem: The new systemd segfaults (I can provide cores). > Additionally there's no way to safely shutdown/reboot the system after this > crash. Please provide your cores to double check the root cause.
(In reply to Thomas Blume from comment #9) > Please provide your cores to double check the root cause. I can't; see bug 918366. :-(
One core is attached to bug #918231
Created attachment 623677 [details] coredump of systemd process systemd just core dumped on my openSUSE 13.1 machine. I just attached the core file.
Found another core in bug 878853. This one goes further than the one I've analyzed in comment#8. u->manager is correct: -->-- (gdb) down #5 0x000000000047913a in unit_unwatch_pid (u=u@entry=0x1f4f130, pid=16907) at src/core/unit.c:1682 1682 hashmap_remove_value(u->manager->watch_pids, LONG_TO_PTR(pid), u); (gdb) print u->manager $23 = (Manager *) 0x7f1fa9a3fa68 <main_arena+1096> (gdb) print u->manager->watch_pids $24 = (Hashmap *) 0x1f60590 --<-- But it crashes a little later in a hashmap function.
Created attachment 623680 [details] Coredump of systemd crash
Created attachment 623682 [details] And another Coredump
Created attachment 623683 [details] core of systemd
Created attachment 623684 [details] core of systemd
I got no segfaults at 13.1. Machine is running as a Desktop but surely longer than 8 hours a day. (Mo 16 Feb 2015 13:56:43 CET) systemd-rpm-macros (So 15 Feb 2015 21:46:02 CET) kcm_systemd (Mo 16 Feb 2015 13:56:42 CET) systemd-32bit (Mo 16 Feb 2015 13:57:40 CET) systemd (Di 01 Apr 2014 03:58:07 CEST) systemd-presets-branding-openSUSE (Do 29 Mai 2014 16:52:37 CEST) systemd-ui (Mo 16 Feb 2015 13:57:43 CET) systemd-sysvinit 2015-01-18 01:27:42|install|kcm_systemd|0.7.0-1.7|x86_64||KDE4.12extra|b34643b07b85b829ee27bd4310367eaa178ffefa890e8c4f58d02c0ad07067c9| 2015-02-15 21:46:02|install|kcm_systemd|0.7.0-1.8|x86_64||KDE4.12extra|6a696479d6955c916a0ae4ac14c51f02c773390f33135d15e8a8905e1fc09eb7| 2015-02-16 13:56:43|install|systemd-32bit|208-28.1|x86_64||repo-update|322f5c42d35d0229a3e596c0d6da21a473f1a53c7f5337361adb3830b4e5200b| 2015-02-16 13:56:43|install|systemd-rpm-macros|2-28.1|noarch||repo-update|170153237e58549dcb5f7934fa2e37cd7ea2a217a326c7a73dd225d82cb94293| 2015-02-16 13:57:42|install|systemd|208-28.1|x86_64||repo-update|25a12bc00ef463abca468bfb518372ffd9baec8fb49295596708c15ee09ee8ce| 2015-02-16 13:57:43|install|systemd-sysvinit|208-28.1|x86_64||repo-update|b0b85eee3dd9453dd39b45835f634f86dd4faf4a3e12948bcf085de76ad00357|
We have about a dozen 13.1 systems, all experienced this issue couple or few hours after the new systemd was running. Adding to the earlier comment, here's a temporary workaround we put in, until a fixed version is released. Of course, this works only if you have a functioning systemd zypper -n in --oldpackage systemd-32bit-208-23.3.x86_64 systemd-208-23.3.x86_64 systemd-sysvinit-208-23.3.x86_64 && zypper al systemd-32bit systemd systemd-sysvinit
Hi folks, Not sure it is related but you can print out the old version of systemd and choose from it when using the --oldpackage option from zypper. lynx -dump -nonumbers -nolist http://download.opensuse.org/update/13.1/x86_64/ | awk '/systemd/{print $NF}'
Unfortunately, I still cannot reproduce the issue myself. It works perfectly in my VM. However, all cores, I've seen so far were related to a call to function unit_unwatch_pid in manager.c. I have now built some testpackages with a proposed fix for this call. Please find them at: http://download.opensuse.org/repositories/home:/tsaupe:/branches:/openSUSE:/13.1:/Update:/bsc918226/standard/ Would be good if you could quickly test them and report freedback.
I have your 208-31.1 packages installed (specifically systemd, systemd-sysvinit, and systemd-32bit) and everything seems fine so far.
Got the proposed updates running on two systems as a test. One received a complete reboot after the update was applied, the second one didn't. Since it took at least two hours for the segfault to happen on either system, we may need to wait a little bit for verification. Simple recipe zypper ar -G http://download.opensuse.org/repositories/home:/tsaupe:/branches:/openSUSE:/13.1:/Update:/bsc918226/standard systemd-patch zypper dup --repo systemd-patch Note: beware that this will yank out rsyslog if you happen to use it.
13.1, up_to_date updates & patches. Kernel 3.11. desktop. Message from syslogd@bhaal at Feb 18 22:36:02 ... kernel:[ 6639.687047] systemd[1]: segfault at 1188a120 ip 000000000040e526 sp 00007fffb308bf00 error 4 in systemd[400000+ed000] kernel:[ 6639.687047] systemd[1]: segfault at 1188a120 ip 000000000040e526 sp 00007fffb308bf00 error 4 in systemd[400000+ed000] Feb 18 22:36:02 bhaal kernel: [ 6639.687047] systemd[1]: segfault at 1188a120 ip 000000000040e526 sp 00007fffb308bf00 error 4 in systemd[400000+ed000] Feb 18 22:36:03 bhaal systemd[1]: Caught <SEGV>, dumped core as pid 11629. Feb 18 22:36:03 bhaal systemd[1]: Freezing execution. Feb 18 22:36:01 bhaal systemd-logind[664]: message repeated 1130 times: [ Failed to store session release timer fd] Feb 18 22:36:03 bhaal systemd-logind[664]: Failed to abandon scope session-1186.scope Feb 18 22:36:03 bhaal systemd-logind[664]: Failed to abandon session scope: Message did not receive a reply (timeout by message bus) bhaal:~ # systemctl daemon-reload Failed to get D-Bus connection: Failed to connect to socket /run/systemd/private: Connection refused bhaal:~ # ps auxw|grep dbus message+ 663 0.0 0.0 41964 2236 ? Ss 20:45 0:03 /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation root 21043 0.0 0.0 9264 928 pts/2 S+ 22:39 0:00 /usr/bin/grep --color=auto dbus
My server machine is monitored by nagios via SSH, so I get tons of new processes during a few hours. Maybe the mass of processes is part of the cause?
I have had 13k+ zombified processes before the machine bailed out with "can't allocate memory" for even the simplest ls command. Those processes had been started by e.g. cron (like logfile scanning for intrusions). But the zombification started only AFTER the segv of systemd...
My productive servers die every day now - and must be restarted via hardware reset (due to the script-kiddies-ssh-attacks, causing 15k+ sshd-zombies in few hours). Perhaps it would be a good idea to distribute VERY SOON a patch that just resets the modfications of the last systemd-patch (2015-149?) to the prior state?
(In reply to Sysadmin VBI from comment #24) > Got the proposed updates running on two systems as a test. One received a > complete reboot after the update was applied, the second one didn't. Since > it took at least two hours for the segfault to happen on either system, we > may need to wait a little bit for verification. Two hours seems to be the maximum amount of time systemd survives before it segfaults on one of those hosts. This includes both not rebooting after applying the patched version, and rebooting. Not sure if it helps, but the host in question is quite busy running nagios server, zabbix server and mysqld for zabbix. All other hosts that we've had issues with happen to run NRPE service.
Mine has also segfaulted and core dumped 2x in the last 2 days. First core dump attached, but likely useless since no debug info. 2015-02-19T01:09:32.081094+08:00 ntmph kernel: [61701.439204] systemd[1]: segfault at a8 ip 000000000047912e sp 00007fffe9eb2b20 error 4 in systemd[400000+ed000] 2015-02-19T01:09:32.238037+08:00 ntmph systemd[1]: Caught <SEGV>, dumped core as pid 9526. 2015-02-19T01:09:32.238530+08:00 ntmph systemd[1]: Freezing execution. Have to reboot with reboot-f since it cannot connect to init. I have tried updating to the packages in http://download.opensuse.org/repositories/Base:/System/openSUSE_13.1/, which was mentioned in the bugzilla link in the patch announcement. I decided to try this versus going back to the last version, which was stable on this machine. Hopefully the 210 version will be stable, if not, I will be going back.
Created attachment 623781 [details] core dump of first systemd crash
We changed yesterday one server, that crashed with 208-28.1, to your new 208-31.1. Until now no problem. We left one virtual server with 208-23.3, but no problem until now. We changed one server to 210-913.1 2 days ago. Until now no problem. Well... It seems that virtual servers are not affected. It seems that 208-31.1 is fixing the problem. But both servers are not realy productive, with low usage and that makes a big difference. But 210-913.1 seems be fine. That server is under heavy load and crashed every 2-4 hours with 208-23.3. This is from http://download.opensuse.org/repositories/Base:/System/openSUSE_13.1
Got it here too on a serer (virtual, under VMWare) which is executing a rather large amount of nrpe checks (and thus sees a lot of batched forking) glbc complains via kernel log: systemd[1]: segfault at 1010514 ip 000000000047912e sp 00007fff9a1c5670 error 4 in systemd[400000+ed000] Started happending with update from systemd-208-23.3.x86_64 to systemd-208-28.1.x86_64. Looking up IP via addr2line using debuginfo&debugsource packages I get: addr2line -e /usr/lib/systemd/systemd 0x47912e /usr/src/debug/systemd-208/src/core/unit.c:1682 /usr/src/debug/systemd-208/src/core/unit.c 1677: 1678: void unit_unwatch_pid(Unit *u, pid_t pid) { 1679: assert(u); 1680: assert(pid >= 1); 1681: 1682: hashmap_remove_value(u->manager->watch_pids, LONG_TO_PTR(pid), u); 1683: set_remove(u->pids, LONG_TO_PTR(pid)); 1684: } 1685: This seems to match report from comment #8 with NULL u->manager. Once systemd has crashed nrpe zombies start piling up until kernel refuses more processes (clone() returns -1 with errno=EAGAIN) due to rlimit on per-user process count. One possible reason why nrpe triggers this bug more than anything else is that it forks a few levels deep for each check and seems to have some of its children reparented to init. nrpe is running as a daemon and not xinetd service.
Created attachment 623802 [details] core dump of systemd-208-31.1.x86_64 I got also a core dump from version systemd-208-31.1. The dump is attached. It's however the only dump of several machines over night. This version core dumps much less.
Thanks for the new dumps, still investigating. We seem to have invalid pointers at random places. We might have a duplicate of: https://bugs.freedesktop.org/show_bug.cgi?id=81327 here. I will provide new testpackages asap.
I have also experienced problems with systemd-208-28.1.x86_64. Yesterday I had a server crash, but can't provide a coredump. Currently the server is running again, but I see many worrying error messages in /var/log/messages: systemd-logind[875]: Failed to store session release timer fd I fear that this could result in a new crash. Should I temporary downgrade to systemd 208-23.3 or test the new systemd package?
Hello, I can confirm, that 208-31.1 from Test-repo does *not* fix the problem.
(In reply to Richard Hammerl from comment #36) > > I fear that this could result in a new crash. Should I temporary downgrade > to systemd 208-23.3 or test the new systemd package? Yes, the downgrade is the current workaround.
(In reply to Tilman Sandig from comment #28) > My productive servers die every day now - and must be restarted via hardware > reset (due to the script-kiddies-ssh-attacks, causing 15k+ sshd-zombies in > few hours). Perhaps it would be a good idea to distribute VERY SOON a patch > that just resets the modfications of the last systemd-patch (2015-149?) to > the prior state? I second this motion.
I've downgraded via the command in comment #3. Do I also need to reboot?
(In reply to Mathias Homann from comment #39) > (In reply to Tilman Sandig from comment #28) > > My productive servers die every day now - and must be restarted via hardware > > reset (due to the script-kiddies-ssh-attacks, causing 15k+ sshd-zombies in > > few hours). Perhaps it would be a good idea to distribute VERY SOON a patch > > that just resets the modfications of the last systemd-patch (2015-149?) to > > the prior state? > > I second this motion. Agreed, removing the latest patch. This will give you the hanging ssh sessions at system shutdown back, but you should see no more systemd crashes. Testpackages are available at: http://download.opensuse.org/repositories/home:/tsaupe:/branches:/openSUSE:/13.1:/Update:/bsc918226/standard/ Please double check and confirm that the systemd crash is gone now.
(In reply to Adam Spiers from comment #40) As root systemctl daemon-reexec may help, but this is already done by the rpm postinstall scriptlet
The issue definitely triggers with a high number / after a high number of stuff systemd had to run. In my case it was two (uncritical....) production servers that receive a nagios check connect once per minute against a systemd socket activated script. Does not survive for more than a few hours (updated this morning, got the segv hang an hour ago) Running one of these boxes with the updated 208-32.1 packages, will report if I see the issue again.
(In reply to Thomas Blume from comment #41) > (In reply to Mathias Homann from comment #39) > > (In reply to Tilman Sandig from comment #28) > > > My productive servers die every day now - and must be restarted via hardware > > > reset (due to the script-kiddies-ssh-attacks, causing 15k+ sshd-zombies in > > > few hours). Perhaps it would be a good idea to distribute VERY SOON a patch > > > that just resets the modfications of the last systemd-patch (2015-149?) to > > > the prior state? > > > > I second this motion. > > Agreed, removing the latest patch. > This will give you the hanging ssh sessions at system shutdown back, but you > should see no more systemd crashes. > Testpackages are available at: > > bsc918226/standard/">http://download.opensuse.org/repositories/home:/tsaupe:/ > branches:/openSUSE:/13.1:/Update:/bsc918226/standard/ > > Please double check and confirm that the systemd crash is gone now. I appreciate that - but I think, this bug is very critical and should be reset asap. Testing is good, but in this case it needs many hours and results (absence of a crash after X hours) may be not reliable. Is it possible to make a binary diff of the built libs before the patch and after the removal of the patch to confirm the correctness of the removal and then release it immediately?
(In reply to Tilman Sandig from comment #44) > (In reply to Thomas Blume from comment #41) > > (In reply to Mathias Homann from comment #39) > > > (In reply to Tilman Sandig from comment #28) > > I appreciate that - but I think, this bug is very critical and should be > reset asap. Testing is good, but in this case it needs many hours and > results (absence of a crash after X hours) may be not reliable. > Is it possible to make a binary diff of the built libs before the patch and > after the removal of the patch to confirm the correctness of the removal and > then release it immediately? Sorry, but this time I want to make sure that the update is correct. For an immediate fix, please downgrade to the previous systemd package, e.g.: zypper in -f systemd=208-23.3
I also just removed the systemd update from the 13.1 update repository.
Looks like systemd-208-32.1.x86_64 may be working. Our most frequent offender lasted usually around two hours before segfaulting on the previous two packages. We're now up to close to eight hours and still fully working.
I have been running systemd-210-913 from http://download.opensuse.org/repositories/Base:/System/openSUSE_13.1/ for over 24 hours here, and it appears to have solved the problem caused by the update. (Thanks Marcus for pulling that update. First patch in 15 years of running SUSE/openSUSE to have our server fail because of an update). So far, for me, 210-913 seems to be running well, maybe even better than 208-23. I am sticking with 210-913, and was sure glad to see things back to normal again.
My test system, with several once-per-minute cronjobs and incoming socket activation, now has been stable for the last 16 hours, running the 208-32.1 test package. However, looking at the journal, I just noticed something new with this update that wasn't there before. Each and every of the once-per-minute cron runs, in addition to the usual cron related logging, now puts the following into the logs: Feb 19 16:05:01 dev9 systemd[1]: Starting user-0.slice. Feb 19 16:05:01 dev9 systemd[1]: Created slice user-0.slice. Feb 19 16:05:01 dev9 systemd[1]: Starting User Manager for 0... .... Feb 19 16:05:01 dev9 systemd[436]: Starting Default. Feb 19 16:05:01 dev9 systemd[436]: Reached target Default. Feb 19 16:05:01 dev9 systemd[436]: Startup finished in 3ms. Feb 19 16:05:01 dev9 systemd[1]: Started User Manager for 0. ... cronjobs run ... Feb 19 16:05:10 dev9 systemd[1]: Stopping User Manager for 0... Feb 19 16:05:10 dev9 systemd[436]: Stopping Default. Feb 19 16:05:10 dev9 systemd[436]: Stopped target Default. Feb 19 16:05:10 dev9 systemd[436]: Starting Shutdown. Feb 19 16:05:10 dev9 systemd[436]: Reached target Shutdown. Feb 19 16:05:10 dev9 systemd[436]: Starting Exit the Session... Feb 19 16:05:10 dev9 systemd[1]: Stopped User Manager for 0. Feb 19 16:05:10 dev9 systemd[1]: Stopping user-0.slice. Feb 19 16:05:10 dev9 systemd[1]: Removed slice user-0.slice. This does not look particularly healthy...
(In reply to Patrick Schaaf from comment #49) > > Feb 19 16:05:01 dev9 systemd[1]: Starting user-0.slice. > Feb 19 16:05:01 dev9 systemd[1]: Created slice user-0.slice. > Feb 19 16:05:01 dev9 systemd[1]: Starting User Manager for 0... > .... > Feb 19 16:05:01 dev9 systemd[436]: Starting Default. > Feb 19 16:05:01 dev9 systemd[436]: Reached target Default. > Feb 19 16:05:01 dev9 systemd[436]: Startup finished in 3ms. > Feb 19 16:05:01 dev9 systemd[1]: Started User Manager for 0. > ... cronjobs run ... > Feb 19 16:05:10 dev9 systemd[1]: Stopping User Manager for 0... > Feb 19 16:05:10 dev9 systemd[436]: Stopping Default. > Feb 19 16:05:10 dev9 systemd[436]: Stopped target Default. > Feb 19 16:05:10 dev9 systemd[436]: Starting Shutdown. > Feb 19 16:05:10 dev9 systemd[436]: Reached target Shutdown. > Feb 19 16:05:10 dev9 systemd[436]: Starting Exit the Session... > Feb 19 16:05:10 dev9 systemd[1]: Stopped User Manager for 0. > Feb 19 16:05:10 dev9 systemd[1]: Stopping user-0.slice. > Feb 19 16:05:10 dev9 systemd[1]: Removed slice user-0.slice. > > This does not look particularly healthy... Actually, this is a fix. You will see the same with systemd-210. crond starts a user session on each run. These are the messages from the startup and shutdown of this session. The changes have been implemented with the following upstream patch: 0001-login-Don-t-stop-a-running-user-manager-from-garbage.patch Without it, you might have stale components of an already closed session laying around.
But that's annoying; I have many cronjobs and the logs fill up with the stuff; at least, it seems to run stable here, too.
(In reply to Johannes Weberhofer from comment #51) > But that's annoying; I have many cronjobs and the logs fill up with the > stuff; at least, it seems to run stable here, too. Per default, systemd only logs to its in-memory journal. This would not affect your on-disk logfiles. Writing the logs to disk is done by rsyslogd. You might want to configure a filter for the cronjob messages in the rsyslog configuration.
It would be nice if a fixed systemd would be submitted today. - either revert to the last known good state - or added bugfix
(In reply to Marcus Meissner from comment #53) > It would be nice if a fixed systemd would be submitted today. > > - either revert to the last known good state > - or added bugfix Ok, assuming sufficient evidence that the patch removal fixes the crash. Submit request 286938 created.
This is an autogenerated message for OBS integration: This bug (918226) was mentioned in https://build.opensuse.org/request/show/286938 13.1 / systemd
(In reply to Thomas Blume from comment #50) > (In reply to Patrick Schaaf from comment #49) > > > > Feb 19 16:05:01 dev9 systemd[1]: Started User Manager for 0. > > ... cronjobs run ... > > Feb 19 16:05:10 dev9 systemd[1]: Stopping User Manager for 0... > > > This does not look particularly healthy... > > Actually, this is a fix. > You will see the same with systemd-210. Ah thanks for the explanation. There is a knob to selectively "revert" it per user: loginctl enable-linger root # or other usernames which touches /var/lib/systemd/linger/root, which makes that User Manager stay around. I'll just put that into one of our own system config packages...
*** Bug 918585 has been marked as a duplicate of this bug. ***
*** Bug 918507 has been marked as a duplicate of this bug. ***
openSUSE-RU-2015:0347-1: An update that has two recommended fixes can now be installed. Category: recommended (moderate) Bug References: 878853,918226 CVE References: Sources used: openSUSE 13.1 (src): systemd-208-32.1, systemd-mini-208-32.1, systemd-rpm-macros-2-32.1
Which version should be installed now that there are no segfaults anymore? It's hard to follow.
i released thomas update yesterday night
So far it seems to be running fine.
There's a report on the german ML that it breaks in different ways :-/ http://lists.opensuse.org/opensuse-de/2015-02/msg00401.html
(In reply to Christian Boltz from comment #63) > There's a report on the german ML that it breaks in different ways :-/ > http://lists.opensuse.org/opensuse-de/2015-02/msg00401.html The session processing messages are not bugs. This is inline with the behaviour of sytemd-210. For an explanation see comment #50. The logs in this report show that there is a very frequent session creation (multiple new sessions per minute). In this case, I would recommend to activate session lingering as described in comment #56. The message: systemd[26233]: Failed to open private bus connection: Failed to connect to socket /run/user/0/dbus/user_bus_socket: No such file or directory might point to a dead user session. For further investigation, I would need a new bug report with verbose systemd logs when the problem appears.
Btw. I've learned my lesson from this. There will be no more backports from upstream to systemd-208. If you experience limitations with this version, please use systemd-210 (preferrably on 13.2) instead.
The systemd logging is excessive IMHO, i get this about once a minute and cant seem to stop it. Its just useless informatiomn thats being logged. This is especially not good on an SSD. I tried filtering it in rsyslog.conf but that doesnt seem to stop it and I tried the linger thing above and it also doesnt stop it. Why is the default like this? I cant think of any reason. 015-02-25T01:29:01.976221-08:00 erb1 systemd[1]: Starting Session 11299 of user root. 2015-02-25T01:29:01.976977-08:00 erb1 systemd[1]: Started Session 11299 of user root. 2015-02-25T01:29:01.978171-08:00 erb1 systemd[1]: Starting Session 11301 of user xx. 2015-02-25T01:29:01.978636-08:00 erb1 systemd[1]: Started Session 11301 of user xx. 2015-02-25T01:29:01.979727-08:00 erb1 systemd[1]: Starting Session 11300 of user xx. 2015-02-25T01:29:01.980190-08:00 erb1 systemd[1]: Started Session 11300 of user xx.
(In reply to Eric Benton from comment #66) > The systemd logging is excessive IMHO 100% ACK. That kind of logging is SPAM in my book, unless I explicitely enabled it because I wanted to have it.
(In reply to Eric Benton from comment #66) > The systemd logging is excessive IMHO, i get this about once a minute and > cant seem to stop it. > > 015-02-25T01:29:01.976221-08:00 erb1 systemd[1]: Starting Session 11299 of > user root. > 2015-02-25T01:29:01.976977-08:00 erb1 systemd[1]: Started Session 11299 of > user root. This is nothing new, so a bit off-topic for this bug report. Anyway, you can get rid of it (along with any other info or debug messages from systemd), with a call to "systemd-analzye set-log-level notice". Continuing off-topic :) that then leaves me with useless messages from cron pam_unix(crond:session) for the same events...
I would perfer to see ALL logging (system wide) default to warning or higher, not info, notice, information and debug I spend a lot of time trying to stop useless logging like this If I have a need for it i can go enable that particular item and level but in general its really not needed to have so much logging of minutiae "as a default setting" sorry dont mean to hijack this, I'll not post more on it. I admit i am getting off topic but....
Still OT, but: you can suppress the PAM spam by editing /etc/pam.d/common-session-pc : add the following _before_ the "session required pam_unix.so" line: session [success=1 default=ignore] pam_succeed_if.so quiet use_uid service in crond user ingroup root which means: - when success then skip one line (the pam_unix one), otherwise ignore - be quiet and use job UID, not authenticated UID - success if "service is crond" and "user is in group root" HTH
For a statement from an upstream developer about the logging, please refer to: https://bugzilla.redhat.com/show_bug.cgi?id=995792#c25
(In reply to Manfred Schwarb from comment #70) > Still OT, but: > > you can suppress the PAM spam by editing /etc/pam.d/common-session-pc : > > add the following _before_ the "session required pam_unix.so" line: > session [success=1 default=ignore] pam_succeed_if.so quiet use_uid service > in crond user ingroup root > > which means: > - when success then skip one line (the pam_unix one), otherwise ignore > - be quiet and use job UID, not authenticated UID > - success if "service is crond" and "user is in group root" > > HTH I'm assuming that the stuff that goes into /etc/pam.d/common-session-pc is one line, not two, right?
> I'm assuming that the stuff that goes into /etc/pam.d/common-session-pc is one > line, not two, right? Yes. See also the man page pam_succeed_if(8) or http://www.linux-pam.org/Linux-PAM-html/Linux-PAM_SAG.html You can of course also omit the second condition-triplet if you want, and for testing purposes, you can omit "quiet" so you have detailed information about the condition matching in your syslog.
Reverting the upstream commit fixed the systemd crash. Superfluous log messages have been fixed in bug 922536 closing