Bug 918226 - systemd segfaults after updating from 208-23.3 to 208-28.1
systemd segfaults after updating from 208-23.3 to 208-28.1
Status: RESOLVED FIXED
: 918231 918507 918585 (view as bug list)
Classification: openSUSE
Product: openSUSE 13.1
Classification: openSUSE
Component: Basesystem
Final
x86-64 openSUSE 13.1
: P5 - None : Critical with 31 votes (vote)
: ---
Assigned To: systemd maintainers
E-mail List
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2015-02-17 16:12 UTC by Peter Szaban
Modified: 2015-05-05 06:50 UTC (History)
41 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
coredump of systemd process (344.85 KB, application/x-gzip)
2015-02-18 11:53 UTC, Roland Bernet
Details
Coredump of systemd crash (1.91 MB, application/x-core)
2015-02-18 12:23 UTC, Ralf Zenklusen
Details
And another Coredump (2.17 MB, application/x-core)
2015-02-18 12:30 UTC, Ralf Zenklusen
Details
core of systemd (3.06 MB, application/x-core)
2015-02-18 12:41 UTC, Martin Schröder
Details
core of systemd (2.88 MB, application/x-core)
2015-02-18 12:42 UTC, Martin Schröder
Details
core dump of first systemd crash (3.41 MB, application/x-core)
2015-02-19 03:00 UTC, Joe Morris
Details
core dump of systemd-208-31.1.x86_64 (190.12 KB, application/x-xz)
2015-02-19 08:54 UTC, Roland Bernet
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Peter Szaban 2015-02-17 16:12:19 UTC
User-Agent:       Mozilla/5.0 (X11; Linux x86_64; rv:35.0) Gecko/20100101 Firefox/35.0
Build Identifier: 

On two different SuSE 13.1 x86_64 systems, systemd segfaults several hours after installing:
    systemd-208-28.1.x86_64.rpm
    systemd-32bit-208-28.1.x86_64.rpm
    systemd-rpm-macros-2-28.1.noarch.rpm
    systemd-sysvinit-208-28.1.x86_64.rpm
... var/log/messages contains:
kernel: [680150.869695] systemd[1]: segfault at 137a7020 ip 000000000040e526 sp 00007fffd861c290 error 4 in systemd[400000+ed000]
systemd[1]: Caught <SEGV>, dumped core as pid 32253.
systemd[1]: Freezing execution

    Going back to v208-23.3 seems to solve the problem.

    The first symptom that made me notice this problem is that thousands of defunct processes accumulate on the system. 

    I tried running gdb against the coredump file, but am not sure how much value this will be due to lack of symbol table information:

  # gdb /bin/systemd core
Reading symbols from /usr/lib/systemd/systemd...Missing separate debuginfo for /usr/lib/systemd/systemd
Try: zypper install -C "debuginfo(build-id)=14e9c2ba2f551a445f792d053a7f9dc593a60a2e"
(no debugging symbols found)...done.
[New LWP 32253]
Core was generated by `/usr/lib/systemd/systemd --system --deserialize 20'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f5644b2d8cb in ?? ()
(gdb) backtrace
#0  0x00007f5644b2d8cb in ?? ()
#1  0x000000000040cbcb in ?? ()
#2  0x00007f5644b2d9f0 in ?? ()
#3  0x0000000000000001 in ?? ()
#4  0x0000000000000000 in ?? ()

    If anyone can provide a link as to how to get symbol table information, I'll certainly try to get it.

    The suggested zypper command doesn't help without the missing repository:

# zypper install -C "debuginfo(build-id)=14e9c2ba2f551a445f792d053a7f9dc593a60a2e"
Loading repository data...
Reading installed packages...
No provider of 'debuginfo(build-id) = 14e9c2ba2f551a445f792d053a7f9dc593a60a2e' found.
Resolving package dependencies...

Nothing to do.



Reproducible: Always

Steps to Reproduce:
0. HAPPENS EVERY TIME ON SOME COMPUTERS AFTER SEVERAL HOURS OR OVERNIGHT

1. Install systemd 208-28.1
2. wait several hours (overnight)
3. grep segfault /var/log/messages
Actual Results:  
- systemd segfault message in /var/log/messages

- ps -ef | grep defunct | wc -l
  8992

- cron jobs don't run: 
systemd-logind[10804]: Failed to start session scope session-7576.scope: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. org.freedesktop.DBus.Error.NoReply

/usr/sbin/cron[32297]: pam_systemd(crond:session): Failed to create session: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2015-02-16T09:41:03.642884-05:00 iggy dbus[920]: [system] Reloaded configuration

/usr/sbin/cron[32296]: pam_systemd(crond:session): Failed to create session: Input/output error



Expected Results:  
did not expect to see segfault message in /var/log/messages, or thousands of defunct processes.

I have two computers exhibiting this problem:

 One system exhibiting this problem is an email server running kernel 3.11.10-25-default in multiuser text mode (not running X11).

   The other computer is running KDE and is my desktop computer running kernel 3.11.10-25-desktop with NVIDIA-Linux-x86_64-346.35.run.  

Another system running in text mode does not exhibit the problem at all (at least not yet).

    Thank you for looking into this!!  Please let me know if I can be of further assistance.
Comment 1 Martin Schröder 2015-02-17 19:12:46 UTC
I also have this problem: The new systemd segfaults (I can provide cores).
Additionally there's no way to safely shutdown/reboot the system after this crash.
Comment 2 Martin Schröder 2015-02-17 19:15:50 UTC
See also Bug 918158 and Bug 918231
Comment 3 alexis Pellicier 2015-02-18 04:14:35 UTC
Alle my 13.1 machine are affect with this bug. 

Here the command to get back to the preview version.

zypper in --oldpackage systemd-32bit-208-23.3.x86_64 systemd-208-23.3.x86_64 systemd-sysvinit-208-23.3.x86_64
Comment 4 Ralf Zenklusen 2015-02-18 07:29:15 UTC
I can confirm, all our 13.1 servers (IBM x3650 and x3550) are affected. 
Going back to v208-23.3 solves the problem.
Comment 5 Thomas Blume 2015-02-18 07:30:03 UTC
You need the systemd debuginfo packages to see details of the backtrace.
From the core in bug 918231 I see this:

-->--
Core was generated by `/usr/lib/systemd/systemd --system --deserialize 20'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f38df5818cb in raise () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f38df5818cb in raise () from /lib64/libpthread.so.0
#1  0x000000000040cbcb in crash (sig=11) at src/core/main.c:143
#2  <signal handler called>
#3  0x000000000047912e in unit_unwatch_pid (u=u@entry=0xb951a0, pid=12842) at src/core/unit.c:1682
#4  0x000000000040e50f in manager_dispatch_sigchld (m=m@entry=0xae69d0) at src/core/manager.c:1392
#5  0x0000000000413a51 in manager_process_signal_fd (m=<optimized out>) at src/core/manager.c:1636
#6  process_event (ev=0x7fff72a0c4c0, m=0xae69d0) at src/core/manager.c:1661
#7  manager_loop (m=0xae69d0) at src/core/manager.c:1858
#8  0x000000000040ad44 in main (argc=<optimized out>, argv=0x7fff72a0cd28) at src/core/main.c:1652
--<--

Investigating...
Comment 6 Thomas Blume 2015-02-18 07:32:06 UTC
*** Bug 918231 has been marked as a duplicate of this bug. ***
Comment 7 Thomas Blume 2015-02-18 08:03:53 UTC
So, a SIGCHLD is sent to PID 12842:

-->--
(gdb) bt full
#0  0x00007f38df5818cb in raise () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x000000000040cbcb in crash (sig=11) at src/core/main.c:143
        rl = {rlim_cur = 18446744073709551615, rlim_max = 18446744073709551615}
        sa = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0}, sa_mask = {__val = {0 <repeats 16 times>}}, 
          sa_flags = 0, sa_restorer = 0x0}
        pid = 0
        __func__ = "crash"
        __PRETTY_FUNCTION__ = "crash"
#2  <signal handler called>
No symbol table info available.
#3  0x000000000047912e in unit_unwatch_pid (u=u@entry=0xb951a0, pid=12842) at src/core/unit.c:1682
        __PRETTY_FUNCTION__ = "unit_unwatch_pid"
#4  0x000000000040e50f in manager_dispatch_sigchld (m=m@entry=0xae69d0) at src/core/manager.c:1392
        si = {si_signo = 17, si_errno = 0, si_code = 1, _sifields = {_pad = {12842, 40, 1, 0 <repeats 25 times>}, _kill = {
              si_pid = 12842, si_uid = 40}, _timer = {si_tid = 12842, si_overrun = 40, si_sigval = {sival_int = 1, 
                sival_ptr = 0x1}}, _rt = {si_pid = 12842, si_uid = 40, si_sigval = {sival_int = 1, sival_ptr = 0x1}}, 
            _sigchld = {si_pid = 12842, si_uid = 40, si_status = 1, si_utime = 0, si_stime = 0}, _sigfault = {
              si_addr = 0x280000322a, si_addr_lsb = 1}, _sigpoll = {si_band = 171798704682, si_fd = 1}, _sigsys = {
              _call_addr = 0x280000322a, _syscall = 1, _arch = 0}}}
        u = 0xb951a0
        r = <optimized out>
        __PRETTY_FUNCTION__ = "manager_dispatch_sigchld"
        __func__ = "manager_dispatch_sigchld"
#5  0x0000000000413a51 in manager_process_signal_fd (m=<optimized out>) at src/core/manager.c:1636
        sfsi = {ssi_signo = 17, ssi_errno = 0, ssi_code = 1, ssi_pid = 12842, ssi_uid = 40, ssi_fd = 0, ssi_tid = 0, 
          ssi_band = 0, ssi_overrun = 0, ssi_trapno = 0, ssi_status = 1, ssi_int = 0, ssi_ptr = 0, ssi_utime = 0, 
          ssi_stime = 0, ssi_addr = 0, __pad = '\000' <repeats 47 times>}
        sigchld = true
--<--

which is a re-executed (--deserialize) systemd system instance:

(gdb) info proc 12842
exe = '/usr/lib/systemd/systemd --system --deserialize 20'
Comment 8 Thomas Blume 2015-02-18 10:04:56 UTC
The problem seems to be in unit_unwatch_pid, which calls:

1682	        hashmap_remove_value(u->manager->watch_pids, LONG_TO_PTR(pid), u);

u->manager is a nullpointer:

-->--
(gdb) down
#3  0x000000000047912e in unit_unwatch_pid (u=u@entry=0xb951a0, pid=12842) at src/core/unit.c:1682
1682	        hashmap_remove_value(u->manager->watch_pids, LONG_TO_PTR(pid), u);
(gdb) print *u->manager
Cannot access memory at address 0x0
(gdb) print *u
$7 = {manager = 0x0, type = UNIT_SERVICE, load_state = UNIT_STUB, merged_into = 0xa0, 
  id = 0x40 <Address 0x40 out of bounds>, instance = 0xb95bd0 "\001\001", names = 0xb95120, dependencies = {0x0, 0x0, 0x0, 
    0x0, 0x0, 0x281, 0xb95ad0, 0xb955d0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x771e800000000, 0x0, 0x0, 0x0, 0x0, 
    0x0}, requires_mounts_for = 0x0, description = 0xa0 <Address 0xa0 out of bounds>, documentation = 0x40, 
  fragment_path = 0xb95a10 "\001\001", source_path = 0xb95200 "\320Z\271", dropin_paths = 0x0, fragment_mtime = 0, 
  source_mtime = 0, dropin_mtime = 0, job = 0x0, nop_job = 0x161, job_timeout = 12146960, refs = 0xb95610, conditions = 0x0, 
  condition_timestamp = {realtime = 0, monotonic = 0}, inactive_exit_timestamp = {realtime = 0, monotonic = 0}, 
  active_enter_timestamp = {realtime = 0, monotonic = 0}, active_exit_timestamp = {realtime = 0, monotonic = 0}, 
  inactive_enter_timestamp = {realtime = 2095570378293248, monotonic = 0}, cgroup_path = 0x0, cgroup_mask = (unknown: 0), 
  slice = {unit = 0x0, refs_next = 0x0, refs_prev = 0x0}, units_by_type_next = 0xa0, units_by_type_prev = 0x40, 
  has_requires_mounts_for_next = 0xb95910, has_requires_mounts_for_prev = 0xb952e0, load_queue_next = 0x0, 
  load_queue_prev = 0x0, dbus_queue_next = 0x0, dbus_queue_prev = 0x0, cleanup_queue_next = 0x0, cleanup_queue_prev = 0x41, 
  gc_queue_next = 0xb954b0, gc_queue_prev = 0xb95650, cgroup_queue_next = 0x0, cgroup_queue_prev = 0x0, pids = 0x0, 
  gc_marker = 0, deserialized_job = 0, load_error = 288, unit_file_state = UNIT_FILE_ENABLED, stop_when_unneeded = 64, 
  default_dependencies = false, refuse_manual_start = false, refuse_manual_stop = false, allow_isolate = false, 
  on_failure_isolate = false, ignore_on_isolate = false, ignore_on_snapshot = false, condition_result = 16, transient = 86, 
  in_load_queue = true, in_dbus_queue = false, in_cleanup_queue = false, in_gc_queue = true, in_cgroup_queue = true, 
  sent_dbus_new_signal = true, no_gc = false, in_audit = true, cgroup_realized = false}
--<--
Comment 9 Thomas Blume 2015-02-18 10:30:44 UTC
(In reply to Martin Schröder from comment #1)
> I also have this problem: The new systemd segfaults (I can provide cores).
> Additionally there's no way to safely shutdown/reboot the system after this
> crash.

Please provide your cores to double check the root cause.
Comment 10 Martin Schröder 2015-02-18 10:47:45 UTC
(In reply to Thomas Blume from comment #9)
> Please provide your cores to double check the root cause.

I can't; see bug 918366. :-(
Comment 11 Rolf Eike Beer 2015-02-18 11:02:10 UTC
One core is attached to bug #918231
Comment 12 Roland Bernet 2015-02-18 11:53:29 UTC
Created attachment 623677 [details]
coredump of systemd process

systemd just core dumped on my openSUSE 13.1 machine.
I just attached the core file.
Comment 13 Thomas Blume 2015-02-18 12:11:19 UTC
Found another core in bug 878853.
This one goes further than the one I've analyzed in comment#8.
u->manager is correct:

-->--
(gdb) down
#5  0x000000000047913a in unit_unwatch_pid (u=u@entry=0x1f4f130, pid=16907) at src/core/unit.c:1682
1682	        hashmap_remove_value(u->manager->watch_pids, LONG_TO_PTR(pid), u);
(gdb) print u->manager
$23 = (Manager *) 0x7f1fa9a3fa68 <main_arena+1096>
(gdb) print u->manager->watch_pids
$24 = (Hashmap *) 0x1f60590
--<--

But it crashes a little later in a hashmap function.
Comment 14 Ralf Zenklusen 2015-02-18 12:23:50 UTC
Created attachment 623680 [details]
Coredump of systemd crash
Comment 15 Ralf Zenklusen 2015-02-18 12:30:48 UTC
Created attachment 623682 [details]
And another Coredump
Comment 16 Martin Schröder 2015-02-18 12:41:01 UTC
Created attachment 623683 [details]
core of systemd
Comment 17 Martin Schröder 2015-02-18 12:42:34 UTC
Created attachment 623684 [details]
core of systemd
Comment 18 Karl Thomas Schmidt 2015-02-18 13:45:51 UTC
I got no segfaults at 13.1.
Machine is running as a Desktop but surely longer than 8 hours a day.

(Mo 16 Feb 2015 13:56:43 CET) systemd-rpm-macros
(So 15 Feb 2015 21:46:02 CET) kcm_systemd
(Mo 16 Feb 2015 13:56:42 CET) systemd-32bit
(Mo 16 Feb 2015 13:57:40 CET) systemd
(Di 01 Apr 2014 03:58:07 CEST) systemd-presets-branding-openSUSE
(Do 29 Mai 2014 16:52:37 CEST) systemd-ui
(Mo 16 Feb 2015 13:57:43 CET) systemd-sysvinit

2015-01-18 01:27:42|install|kcm_systemd|0.7.0-1.7|x86_64||KDE4.12extra|b34643b07b85b829ee27bd4310367eaa178ffefa890e8c4f58d02c0ad07067c9|
2015-02-15 21:46:02|install|kcm_systemd|0.7.0-1.8|x86_64||KDE4.12extra|6a696479d6955c916a0ae4ac14c51f02c773390f33135d15e8a8905e1fc09eb7|
2015-02-16 13:56:43|install|systemd-32bit|208-28.1|x86_64||repo-update|322f5c42d35d0229a3e596c0d6da21a473f1a53c7f5337361adb3830b4e5200b|
2015-02-16 13:56:43|install|systemd-rpm-macros|2-28.1|noarch||repo-update|170153237e58549dcb5f7934fa2e37cd7ea2a217a326c7a73dd225d82cb94293|
2015-02-16 13:57:42|install|systemd|208-28.1|x86_64||repo-update|25a12bc00ef463abca468bfb518372ffd9baec8fb49295596708c15ee09ee8ce|
2015-02-16 13:57:43|install|systemd-sysvinit|208-28.1|x86_64||repo-update|b0b85eee3dd9453dd39b45835f634f86dd4faf4a3e12948bcf085de76ad00357|
Comment 19 Sysadmin VBI 2015-02-18 13:50:49 UTC
We have about a dozen 13.1 systems, all experienced this issue couple or few hours after the new systemd was running. 

Adding to the earlier comment, here's a temporary workaround we put in, until a fixed version is released. Of course, this works only if you have a functioning systemd

zypper -n in --oldpackage systemd-32bit-208-23.3.x86_64 systemd-208-23.3.x86_64 systemd-sysvinit-208-23.3.x86_64 && zypper al systemd-32bit systemd systemd-sysvinit
Comment 20 jason ferrer 2015-02-18 14:19:26 UTC
Hi folks,

Not sure it is related but you can print out the old version of systemd and choose from it when using the --oldpackage option from zypper.


lynx -dump -nonumbers -nolist http://download.opensuse.org/update/13.1/x86_64/ | awk '/systemd/{print $NF}'
Comment 21 Thomas Blume 2015-02-18 15:24:57 UTC
Unfortunately, I still cannot reproduce the issue myself.
It works perfectly in my VM.
However, all cores, I've seen so far were related to a call to function unit_unwatch_pid in manager.c.
I have now built some testpackages with a proposed fix for this call.
Please find them at:

http://download.opensuse.org/repositories/home:/tsaupe:/branches:/openSUSE:/13.1:/Update:/bsc918226/standard/

Would be good if you could quickly test them and report freedback.
Comment 23 michael dur 2015-02-18 17:06:05 UTC
I have your 208-31.1 packages installed (specifically systemd, systemd-sysvinit, and systemd-32bit) and everything seems fine so far.
Comment 24 Sysadmin VBI 2015-02-18 18:03:46 UTC
Got the proposed updates running on two systems as a test. One received a complete reboot after the update was applied, the second one didn't. Since it took at least two hours for the segfault to happen on either system, we may need to wait a little bit for verification.


Simple recipe

zypper ar -G http://download.opensuse.org/repositories/home:/tsaupe:/branches:/openSUSE:/13.1:/Update:/bsc918226/standard systemd-patch
zypper dup --repo systemd-patch

Note: beware that this will yank out rsyslog if you happen to use it.
Comment 25 Florian Piekert 2015-02-18 21:55:10 UTC
13.1, up_to_date updates & patches. Kernel 3.11. desktop.

Message from syslogd@bhaal at Feb 18 22:36:02 ...
 kernel:[ 6639.687047] systemd[1]: segfault at 1188a120 ip 000000000040e526 sp 00007fffb308bf00 error 4 in systemd[400000+ed000]
 kernel:[ 6639.687047] systemd[1]: segfault at 1188a120 ip 000000000040e526 sp 00007fffb308bf00 error 4 in systemd[400000+ed000]

Feb 18 22:36:02 bhaal kernel: [ 6639.687047] systemd[1]: segfault at 1188a120 ip 000000000040e526 sp 00007fffb308bf00 error 4 in systemd[400000+ed000]
Feb 18 22:36:03 bhaal systemd[1]: Caught <SEGV>, dumped core as pid 11629.
Feb 18 22:36:03 bhaal systemd[1]: Freezing execution.
Feb 18 22:36:01 bhaal systemd-logind[664]: message repeated 1130 times: [ Failed to store session release timer fd]
Feb 18 22:36:03 bhaal systemd-logind[664]: Failed to abandon scope session-1186.scope
Feb 18 22:36:03 bhaal systemd-logind[664]: Failed to abandon session scope: Message did not receive a reply (timeout by message bus)

bhaal:~ # systemctl daemon-reload
Failed to get D-Bus connection: Failed to connect to socket /run/systemd/private: Connection refused

bhaal:~ # ps auxw|grep dbus
message+   663  0.0  0.0  41964  2236 ?        Ss   20:45   0:03 /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
root     21043  0.0  0.0   9264   928 pts/2    S+   22:39   0:00 /usr/bin/grep --color=auto dbus
Comment 26 Rolf Eike Beer 2015-02-18 21:59:18 UTC
My server machine is monitored by nagios via SSH, so I get tons of new processes during a few hours. Maybe the mass of processes is part of the cause?
Comment 27 Florian Piekert 2015-02-18 22:33:09 UTC
I have had 13k+ zombified processes before the machine bailed out with "can't allocate memory" for even the simplest ls command.
Those processes had been started by e.g. cron (like logfile scanning for intrusions).
But the zombification started only AFTER the segv of systemd...
Comment 28 Tilman Sandig 2015-02-18 23:10:53 UTC
My productive servers die every day now - and must be restarted via hardware reset (due to the script-kiddies-ssh-attacks, causing 15k+ sshd-zombies in few hours). Perhaps it would be a good idea to distribute VERY SOON a patch that just resets the modfications of the last systemd-patch (2015-149?) to the prior state?
Comment 29 Sysadmin VBI 2015-02-18 23:58:35 UTC
(In reply to Sysadmin VBI from comment #24)
> Got the proposed updates running on two systems as a test. One received a
> complete reboot after the update was applied, the second one didn't. Since
> it took at least two hours for the segfault to happen on either system, we
> may need to wait a little bit for verification.

Two hours seems to be the maximum amount of time systemd survives before it segfaults on one of those hosts. This includes both not rebooting after applying the patched version, and rebooting.

Not sure if it helps, but the host in question is quite busy running nagios server, zabbix server and mysqld for zabbix. All other hosts that we've had issues with happen to run NRPE service.
Comment 30 Joe Morris 2015-02-19 02:57:02 UTC
Mine has also segfaulted and core dumped 2x in the last 2 days. First core dump attached, but likely useless since no debug info.

2015-02-19T01:09:32.081094+08:00 ntmph kernel: [61701.439204] systemd[1]: segfault at a8 ip 000000000047912e sp 00007fffe9eb2b20 error 4 in systemd[400000+ed000]
2015-02-19T01:09:32.238037+08:00 ntmph systemd[1]: Caught <SEGV>, dumped core as pid 9526.
2015-02-19T01:09:32.238530+08:00 ntmph systemd[1]: Freezing execution.

Have to reboot with reboot-f since it cannot connect to init. I have tried updating to the packages in http://download.opensuse.org/repositories/Base:/System/openSUSE_13.1/, which was mentioned in the bugzilla link in the patch announcement. I decided to try this versus going back to the last version, which was stable on this machine. Hopefully the 210 version will be stable, if not, I will be going back.
Comment 31 Joe Morris 2015-02-19 03:00:54 UTC
Created attachment 623781 [details]
core dump of first systemd crash
Comment 32 Ralf Zenklusen 2015-02-19 07:51:55 UTC
We changed yesterday one server, that crashed with 208-28.1, to your new 208-31.1. Until now no problem.
We left one virtual server with 208-23.3, but no problem until now.
We changed one server to 210-913.1 2 days ago. Until now no problem.

Well...
It seems that virtual servers are not affected. 
It seems that 208-31.1 is fixing the problem.
But both servers are not realy productive, with low usage and that makes a big difference. 

But 210-913.1 seems be fine. That server is under heavy load and crashed every 2-4 hours with 208-23.3. 
This is from http://download.opensuse.org/repositories/Base:/System/openSUSE_13.1
Comment 33 Bruno Prémont 2015-02-19 08:17:30 UTC
Got it here too on a serer (virtual, under VMWare) which is executing a rather large amount of nrpe checks (and thus sees a lot of batched forking)

glbc complains via kernel log:
systemd[1]: segfault at 1010514 ip 000000000047912e sp 00007fff9a1c5670 error 4 in systemd[400000+ed000]


Started happending with update from systemd-208-23.3.x86_64 to systemd-208-28.1.x86_64.

Looking up IP via addr2line using debuginfo&debugsource packages I get:
addr2line -e /usr/lib/systemd/systemd 0x47912e
/usr/src/debug/systemd-208/src/core/unit.c:1682

/usr/src/debug/systemd-208/src/core/unit.c
1677: 
1678: void unit_unwatch_pid(Unit *u, pid_t pid) {
1679:         assert(u);
1680:         assert(pid >= 1);
1681:
1682:         hashmap_remove_value(u->manager->watch_pids, LONG_TO_PTR(pid), u);
1683:         set_remove(u->pids, LONG_TO_PTR(pid));
1684: }
1685:

This seems to match report from comment #8 with NULL u->manager.


Once systemd has crashed nrpe zombies start piling up until kernel refuses more processes (clone() returns -1 with errno=EAGAIN) due to rlimit on per-user process count.

One possible reason why nrpe triggers this bug more than anything else is that it forks a few levels deep for each check and seems to have some of its children reparented to init. nrpe is running as a daemon and not xinetd service.
Comment 34 Roland Bernet 2015-02-19 08:54:43 UTC
Created attachment 623802 [details]
core dump of systemd-208-31.1.x86_64

I got also a core dump from version systemd-208-31.1. The dump is attached.
It's however the only dump of several machines over night. This version core 
dumps much less.
Comment 35 Thomas Blume 2015-02-19 09:17:53 UTC
Thanks for the new dumps, still investigating. We seem to have invalid pointers at random places.
We might have a duplicate of:

https://bugs.freedesktop.org/show_bug.cgi?id=81327

here.
I will provide new testpackages asap.
Comment 36 Richard Hammerl 2015-02-19 10:23:55 UTC
I have also experienced problems with systemd-208-28.1.x86_64. 
Yesterday I had a server crash, but can't provide a coredump. Currently
the server is running again, but I see many worrying error messages in 
/var/log/messages:
systemd-logind[875]: Failed to store session release timer fd

I fear that this could result in a new crash. Should I temporary downgrade 
to systemd 208-23.3 or test the new systemd package?
Comment 37 Tim Ehlers 2015-02-19 10:40:05 UTC
Hello,

I can confirm, that 208-31.1 from Test-repo does *not* fix the problem.
Comment 38 Thomas Blume 2015-02-19 11:24:15 UTC
(In reply to Richard Hammerl from comment #36)
> 
> I fear that this could result in a new crash. Should I temporary downgrade 
> to systemd 208-23.3 or test the new systemd package?

Yes, the downgrade is the current workaround.
Comment 39 Mathias Homann 2015-02-19 12:06:02 UTC
(In reply to Tilman Sandig from comment #28)
> My productive servers die every day now - and must be restarted via hardware
> reset (due to the script-kiddies-ssh-attacks, causing 15k+ sshd-zombies in
> few hours). Perhaps it would be a good idea to distribute VERY SOON a patch
> that just resets the modfications of the last systemd-patch (2015-149?) to
> the prior state?

I second this motion.
Comment 40 Adam Spiers 2015-02-19 12:48:52 UTC
I've downgraded via the command in comment #3.  Do I also need to reboot?
Comment 41 Thomas Blume 2015-02-19 13:13:38 UTC
(In reply to Mathias Homann from comment #39)
> (In reply to Tilman Sandig from comment #28)
> > My productive servers die every day now - and must be restarted via hardware
> > reset (due to the script-kiddies-ssh-attacks, causing 15k+ sshd-zombies in
> > few hours). Perhaps it would be a good idea to distribute VERY SOON a patch
> > that just resets the modfications of the last systemd-patch (2015-149?) to
> > the prior state?
> 
> I second this motion.

Agreed, removing the latest patch.
This will give you the hanging ssh sessions at system shutdown back, but you should see no more systemd crashes.
Testpackages are available at:

http://download.opensuse.org/repositories/home:/tsaupe:/branches:/openSUSE:/13.1:/Update:/bsc918226/standard/

Please double check and confirm that the systemd crash is gone now.
Comment 42 Dr. Werner Fink 2015-02-19 13:32:07 UTC
(In reply to Adam Spiers from comment #40)

As root

       systemctl daemon-reexec

may help, but this is already done by the rpm postinstall scriptlet
Comment 43 Patrick Schaaf 2015-02-19 13:37:06 UTC
The issue definitely triggers with a high number / after a high number of stuff systemd had to run. In my case it was two (uncritical....) production servers that receive a nagios check connect once per minute against a systemd socket activated script. Does not survive for more than a few hours (updated this morning, got the segv hang an hour ago)

Running one of these boxes with the updated 208-32.1 packages, will report if I see the issue again.
Comment 44 Tilman Sandig 2015-02-19 14:23:50 UTC
(In reply to Thomas Blume from comment #41)
> (In reply to Mathias Homann from comment #39)
> > (In reply to Tilman Sandig from comment #28)
> > > My productive servers die every day now - and must be restarted via hardware
> > > reset (due to the script-kiddies-ssh-attacks, causing 15k+ sshd-zombies in
> > > few hours). Perhaps it would be a good idea to distribute VERY SOON a patch
> > > that just resets the modfications of the last systemd-patch (2015-149?) to
> > > the prior state?
> > 
> > I second this motion.
> 
> Agreed, removing the latest patch.
> This will give you the hanging ssh sessions at system shutdown back, but you
> should see no more systemd crashes.
> Testpackages are available at:
> 
> bsc918226/standard/">http://download.opensuse.org/repositories/home:/tsaupe:/
> branches:/openSUSE:/13.1:/Update:/bsc918226/standard/
> 
> Please double check and confirm that the systemd crash is gone now.

I appreciate that - but I think, this bug is very critical and should be reset asap. Testing is good, but in this case it needs many hours and results (absence of a crash after X hours) may be not reliable. 
Is it possible to make a binary diff of the built libs before the patch and after the removal of the patch to confirm the correctness of the removal and then release it immediately?
Comment 45 Thomas Blume 2015-02-19 14:52:48 UTC
(In reply to Tilman Sandig from comment #44)
> (In reply to Thomas Blume from comment #41)
> > (In reply to Mathias Homann from comment #39)
> > > (In reply to Tilman Sandig from comment #28)

> 
> I appreciate that - but I think, this bug is very critical and should be
> reset asap. Testing is good, but in this case it needs many hours and
> results (absence of a crash after X hours) may be not reliable. 
> Is it possible to make a binary diff of the built libs before the patch and
> after the removal of the patch to confirm the correctness of the removal and
> then release it immediately?

Sorry, but this time I want to make sure that the update is correct.
For an immediate fix, please downgrade to the previous systemd package, e.g.:

zypper in -f systemd=208-23.3
Comment 46 Marcus Meissner 2015-02-19 16:22:47 UTC
I also just removed the systemd update from the 13.1 update repository.
Comment 47 Sysadmin VBI 2015-02-19 21:33:19 UTC
Looks like systemd-208-32.1.x86_64 may be working. Our most frequent offender lasted usually around two hours before segfaulting on the previous two packages. We're now up to close to eight hours and still fully working.
Comment 48 Joe Morris 2015-02-20 04:32:02 UTC
I have been running systemd-210-913 from http://download.opensuse.org/repositories/Base:/System/openSUSE_13.1/ for over 24 hours here, and it appears to have solved the problem caused by the update. (Thanks Marcus for pulling that update. First patch in 15 years of running SUSE/openSUSE to have our server fail because of an update). So far, for me, 210-913 seems to be running well, maybe even better than 208-23. I am sticking with 210-913, and was sure glad to see things back to normal again.
Comment 49 Patrick Schaaf 2015-02-20 07:35:15 UTC
My test system, with several once-per-minute cronjobs and incoming socket activation, now has been stable for the last 16 hours, running the 208-32.1 test package.

However, looking at the journal, I just noticed something new with this update that wasn't there before. Each and every of the once-per-minute cron runs, in addition to the usual cron related logging, now puts the following into the logs:

Feb 19 16:05:01 dev9 systemd[1]: Starting user-0.slice.
Feb 19 16:05:01 dev9 systemd[1]: Created slice user-0.slice.
Feb 19 16:05:01 dev9 systemd[1]: Starting User Manager for 0...
....
Feb 19 16:05:01 dev9 systemd[436]: Starting Default.
Feb 19 16:05:01 dev9 systemd[436]: Reached target Default.
Feb 19 16:05:01 dev9 systemd[436]: Startup finished in 3ms.
Feb 19 16:05:01 dev9 systemd[1]: Started User Manager for 0.
... cronjobs run ...
Feb 19 16:05:10 dev9 systemd[1]: Stopping User Manager for 0...
Feb 19 16:05:10 dev9 systemd[436]: Stopping Default.
Feb 19 16:05:10 dev9 systemd[436]: Stopped target Default.
Feb 19 16:05:10 dev9 systemd[436]: Starting Shutdown.
Feb 19 16:05:10 dev9 systemd[436]: Reached target Shutdown.
Feb 19 16:05:10 dev9 systemd[436]: Starting Exit the Session...
Feb 19 16:05:10 dev9 systemd[1]: Stopped User Manager for 0.
Feb 19 16:05:10 dev9 systemd[1]: Stopping user-0.slice.
Feb 19 16:05:10 dev9 systemd[1]: Removed slice user-0.slice.

This does not look particularly healthy...
Comment 50 Thomas Blume 2015-02-20 08:52:13 UTC
(In reply to Patrick Schaaf from comment #49)
> 
> Feb 19 16:05:01 dev9 systemd[1]: Starting user-0.slice.
> Feb 19 16:05:01 dev9 systemd[1]: Created slice user-0.slice.
> Feb 19 16:05:01 dev9 systemd[1]: Starting User Manager for 0...
> ....
> Feb 19 16:05:01 dev9 systemd[436]: Starting Default.
> Feb 19 16:05:01 dev9 systemd[436]: Reached target Default.
> Feb 19 16:05:01 dev9 systemd[436]: Startup finished in 3ms.
> Feb 19 16:05:01 dev9 systemd[1]: Started User Manager for 0.
> ... cronjobs run ...
> Feb 19 16:05:10 dev9 systemd[1]: Stopping User Manager for 0...
> Feb 19 16:05:10 dev9 systemd[436]: Stopping Default.
> Feb 19 16:05:10 dev9 systemd[436]: Stopped target Default.
> Feb 19 16:05:10 dev9 systemd[436]: Starting Shutdown.
> Feb 19 16:05:10 dev9 systemd[436]: Reached target Shutdown.
> Feb 19 16:05:10 dev9 systemd[436]: Starting Exit the Session...
> Feb 19 16:05:10 dev9 systemd[1]: Stopped User Manager for 0.
> Feb 19 16:05:10 dev9 systemd[1]: Stopping user-0.slice.
> Feb 19 16:05:10 dev9 systemd[1]: Removed slice user-0.slice.
> 
> This does not look particularly healthy...

Actually, this is a fix.
You will see the same with systemd-210.
crond starts a user session on each run.
These are the messages from the startup and shutdown of this session.
The changes have been implemented with the following upstream patch:

0001-login-Don-t-stop-a-running-user-manager-from-garbage.patch

Without it, you might have stale components of an already closed session laying around.
Comment 51 Johannes Weberhofer 2015-02-20 08:55:08 UTC
But that's annoying; I have many cronjobs and the logs fill up with the stuff; at least, it seems to run stable here, too.
Comment 52 Thomas Blume 2015-02-20 09:55:07 UTC
(In reply to Johannes Weberhofer from comment #51)
> But that's annoying; I have many cronjobs and the logs fill up with the
> stuff; at least, it seems to run stable here, too.

Per default, systemd only logs to its in-memory journal.
This would not affect your on-disk logfiles.
Writing the logs to disk is done by rsyslogd.
You might want to configure a filter for the cronjob messages in the rsyslog configuration.
Comment 53 Marcus Meissner 2015-02-20 11:31:04 UTC
It would be nice if a fixed systemd would be submitted today.

- either revert to the last known good state
- or added bugfix
Comment 54 Thomas Blume 2015-02-20 11:59:36 UTC
(In reply to Marcus Meissner from comment #53)
> It would be nice if a fixed systemd would be submitted today.
> 
> - either revert to the last known good state
> - or added bugfix

Ok, assuming sufficient evidence that the patch removal fixes the crash.
Submit request 286938 created.
Comment 55 Bernhard Wiedemann 2015-02-20 12:00:12 UTC
This is an autogenerated message for OBS integration:
This bug (918226) was mentioned in
https://build.opensuse.org/request/show/286938 13.1 / systemd
Comment 56 Patrick Schaaf 2015-02-20 12:19:27 UTC
(In reply to Thomas Blume from comment #50)
> (In reply to Patrick Schaaf from comment #49)
> > 
> > Feb 19 16:05:01 dev9 systemd[1]: Started User Manager for 0.
> > ... cronjobs run ...
> > Feb 19 16:05:10 dev9 systemd[1]: Stopping User Manager for 0...
>
> > This does not look particularly healthy...
> 
> Actually, this is a fix.
> You will see the same with systemd-210.

Ah thanks for the explanation.

There is a knob to selectively "revert" it per user:

loginctl enable-linger root # or other usernames

which touches /var/lib/systemd/linger/root, which makes that User Manager stay around. I'll just put that into one of our own system config packages...
Comment 57 Thomas Blume 2015-02-20 15:09:36 UTC
*** Bug 918585 has been marked as a duplicate of this bug. ***
Comment 58 Bernhard Wiedemann 2015-02-21 08:05:42 UTC
*** Bug 918507 has been marked as a duplicate of this bug. ***
Comment 59 Swamp Workflow Management 2015-02-22 19:05:16 UTC
openSUSE-RU-2015:0347-1: An update that has two recommended fixes can now be installed.

Category: recommended (moderate)
Bug References: 878853,918226
CVE References: 
Sources used:
openSUSE 13.1 (src):    systemd-208-32.1, systemd-mini-208-32.1, systemd-rpm-macros-2-32.1
Comment 60 Markus Kolb 2015-02-23 07:57:41 UTC
Which version should be installed now that there are no segfaults anymore?
It's hard to follow.
Comment 61 Marcus Meissner 2015-02-23 08:12:34 UTC
i released thomas update yesterday night
Comment 62 Michal Svec 2015-02-23 12:20:23 UTC
So far it seems to be running fine.
Comment 63 Christian Boltz 2015-02-24 22:14:32 UTC
There's a report on the german ML that it breaks in different ways :-/
http://lists.opensuse.org/opensuse-de/2015-02/msg00401.html
Comment 64 Thomas Blume 2015-02-25 08:42:26 UTC
(In reply to Christian Boltz from comment #63)
> There's a report on the german ML that it breaks in different ways :-/
> http://lists.opensuse.org/opensuse-de/2015-02/msg00401.html

The session processing messages are not bugs.
This is inline with the behaviour of sytemd-210.
For an explanation see comment #50.

The logs in this report show that there is a very frequent session creation (multiple new sessions per minute).
In this case, I would recommend to activate session lingering as described in comment #56.

The message:

systemd[26233]: Failed to open private bus connection: Failed to connect
to socket /run/user/0/dbus/user_bus_socket: No such file or directory

might point to a dead user session.
For further investigation, I would need a new bug report with verbose systemd logs when the problem appears.
Comment 65 Thomas Blume 2015-02-25 09:28:43 UTC
Btw. I've learned my lesson from this.
There will be no more backports from upstream to systemd-208.
If you experience limitations with this version, please use systemd-210 (preferrably on 13.2) instead.
Comment 66 Eric Benton 2015-02-25 09:34:18 UTC
The systemd logging is excessive IMHO, i get this about once a minute and cant seem to stop it.
Its just useless informatiomn thats being logged. This is especially not good on an SSD. 
I tried filtering it in rsyslog.conf but that doesnt seem to stop it and I tried the linger thing above and it also doesnt stop it.
Why is the default like this? I cant think of any reason.

015-02-25T01:29:01.976221-08:00 erb1 systemd[1]: Starting Session 11299 of user root.
2015-02-25T01:29:01.976977-08:00 erb1 systemd[1]: Started Session 11299 of user root.
2015-02-25T01:29:01.978171-08:00 erb1 systemd[1]: Starting Session 11301 of user xx.
2015-02-25T01:29:01.978636-08:00 erb1 systemd[1]: Started Session 11301 of user xx.
2015-02-25T01:29:01.979727-08:00 erb1 systemd[1]: Starting Session 11300 of user xx.
2015-02-25T01:29:01.980190-08:00 erb1 systemd[1]: Started Session 11300 of user xx.
Comment 67 Mathias Homann 2015-02-25 09:37:05 UTC
(In reply to Eric Benton from comment #66)
> The systemd logging is excessive IMHO

100% ACK.

That kind of logging is SPAM in my book, unless I explicitely enabled it because I wanted to have it.
Comment 68 Patrick Schaaf 2015-02-25 09:50:50 UTC
(In reply to Eric Benton from comment #66)
> The systemd logging is excessive IMHO, i get this about once a minute and
> cant seem to stop it.
> 
> 015-02-25T01:29:01.976221-08:00 erb1 systemd[1]: Starting Session 11299 of
> user root.
> 2015-02-25T01:29:01.976977-08:00 erb1 systemd[1]: Started Session 11299 of
> user root.

This is nothing new, so a bit off-topic for this bug report.

Anyway, you can get rid of it (along with any other info or debug messages from systemd), with a call to "systemd-analzye set-log-level notice".

Continuing off-topic :) that then leaves me with useless messages from cron pam_unix(crond:session) for the same events...
Comment 69 Eric Benton 2015-02-25 09:57:00 UTC
I would perfer to see ALL logging (system wide) default to warning or higher, not info, notice, information and debug
I spend a lot of time trying to stop useless logging like this
If I have a need for it i can go enable that particular item and level but in general its really not needed to have so much logging of minutiae "as a default setting"
sorry dont mean to hijack this, I'll not post more on it. I admit i am getting off topic but....
Comment 70 Manfred Schwarb 2015-02-25 14:30:10 UTC
Still OT, but:

you can suppress the PAM spam by editing /etc/pam.d/common-session-pc :

add the following _before_ the "session required pam_unix.so" line:
  session [success=1 default=ignore] pam_succeed_if.so quiet use_uid service in crond user ingroup root

which means:
- when success then skip one line (the pam_unix one), otherwise ignore
- be quiet and use job UID, not authenticated UID
- success if "service is crond" and "user is in group root"

HTH
Comment 71 Thomas Blume 2015-02-25 15:28:41 UTC
For a statement from an upstream developer about the logging, please refer to:

https://bugzilla.redhat.com/show_bug.cgi?id=995792#c25
Comment 72 Mathias Homann 2015-03-02 14:24:54 UTC
(In reply to Manfred Schwarb from comment #70)
> Still OT, but:
> 
> you can suppress the PAM spam by editing /etc/pam.d/common-session-pc :
> 
> add the following _before_ the "session required pam_unix.so" line:
>   session [success=1 default=ignore] pam_succeed_if.so quiet use_uid service
> in crond user ingroup root
> 
> which means:
> - when success then skip one line (the pam_unix one), otherwise ignore
> - be quiet and use job UID, not authenticated UID
> - success if "service is crond" and "user is in group root"
> 
> HTH

I'm assuming that the stuff that goes into /etc/pam.d/common-session-pc is one line, not two, right?
Comment 73 Manfred Schwarb 2015-03-02 16:31:27 UTC
> I'm assuming that the stuff that goes into /etc/pam.d/common-session-pc is one > line, not two, right?

Yes.
See also the man page pam_succeed_if(8) or 
http://www.linux-pam.org/Linux-PAM-html/Linux-PAM_SAG.html

You can of course also omit the second condition-triplet if you want,
and for testing purposes, you can omit "quiet" so you have detailed
information about the condition matching in your syslog.
Comment 74 Thomas Blume 2015-05-05 06:50:33 UTC
Reverting the upstream commit fixed the systemd crash.
Superfluous log messages have been fixed in bug 922536

closing