Bug 1087925 - sddm-helper not working with NIS enabled
sddm-helper not working with NIS enabled
Status: RESOLVED FIXED
Classification: openSUSE
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Basesystem
Current
Other Other
: P5 - None : Major (vote)
: ---
Assigned To: Thomas Blume
E-mail List
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2018-04-03 15:50 UTC by Giacomo Comes
Modified: 2018-07-30 09:10 UTC (History)
7 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
requested Xorg log (4.54 KB, application/x-xz)
2018-04-04 17:26 UTC, Giacomo Comes
Details
new log from tumbleweed running a real PC (6.19 KB, application/x-xz)
2018-04-04 19:51 UTC, Giacomo Comes
Details
requested strace (229.46 KB, application/x-xz)
2018-04-05 16:48 UTC, Giacomo Comes
Details
sddm+getent trace=socket (7.89 KB, application/x-xz)
2018-04-06 14:11 UTC, Giacomo Comes
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Giacomo Comes 2018-04-03 15:50:17 UTC
If NIS is enabled SDDM shows a black screen.
This is a recent bug. About two weeks ago there was no such issue.

In order to reproduce it do a fresh tumbleweed installation.
Enable NIS with:
  yast nis enable server="server-ip" domain="nis-domain"
restart the display manager:
  systemctl restart xdm

The login screen is black now. This looks like a bad interaction between sddm-helper and NIS.
 systemctl status xdm shows:

● display-manager.service - X Display Manager
   Loaded: loaded (/usr/lib/systemd/system/display-manager.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2018-04-03 11:30:48 AST; 5s ago
  Process: 2981 ExecStop=/usr/lib/X11/display-manager stop (code=exited, status=0/SUCCESS)
  Process: 2995 ExecStart=/usr/lib/X11/display-manager start (code=exited, status=0/SUCCESS)
 Main PID: 3015 (sddm)
    Tasks: 20 (limit: 4915)
   CGroup: /system.slice/display-manager.service
           ├─3015 /usr/bin/sddm
           └─3017 /usr/bin/X -nolisten tcp -auth /run/sddm/{1466d32e-d6a1-4295-a3c0-809dcc8aed5c} -background none -nore

Apr 03 11:30:48 virtual1 sddm[3015]: Loading theme configuration from "/usr/share/sddm/themes/breeze-openSUSE/theme.conf
Apr 03 11:30:48 virtual1 sddm[3015]: Greeter starting...
Apr 03 11:30:48 virtual1 sddm[3015]: Adding cookie to "/run/sddm/{1466d32e-d6a1-4295-a3c0-809dcc8aed5c}"
Apr 03 11:30:48 virtual1 sddm-helper[3063]: [PAM] Starting...
Apr 03 11:30:48 virtual1 sddm-helper[3063]: [PAM] Authenticating...
Apr 03 11:30:48 virtual1 sddm-helper[3063]: [PAM] returning.
Apr 03 11:30:48 virtual1 sddm-helper[3063]: pam_unix(sddm-greeter:session): session opened for user sddm by (uid=0)
Apr 03 11:30:48 virtual1 sddm[3015]: Greeter session started successfully
Apr 03 11:30:51 virtual1 sddm[3015]: Auth: sddm-helper exited with 6
Apr 03 11:30:51 virtual1 sddm[3015]: Greeter stopped.

If I stop ypbind and restart the display manager, the login screen works again.
Both lightdm and kdm do not have this issue.
Comment 1 Fabian Vogt 2018-04-04 07:59:13 UTC
Exit status 6 means that sddm-helper aborted - can you try to get a backtrace and debug log with QT_LOGGING_RULES=*.debug=true?

Can you also try to rcxdm restart after the black screen? It might be a race condition on startup.
Comment 2 Giacomo Comes 2018-04-04 15:47:09 UTC
(In reply to Fabian Vogt from comment #1)
> Exit status 6 means that sddm-helper aborted - can you try to get a
> backtrace and debug log with QT_LOGGING_RULES=*.debug=true?

Can you tell me how to do that?
I added the line:
Environment=QT_LOGGING_RULES=*.debug=true
to /usr/lib/systemd/system/display-manager.service, but the information in the journal did not change:

Apr 04 11:33:20 virtual1 sddm-helper[3553]: pam_unix(sddm-greeter:session): session opened for user sddm by (uid=0)
Apr 04 11:33:20 virtual1 sddm[3495]: Greeter session started successfully
Apr 04 11:33:20 virtual1 sddm-greeter[3561]: High-DPI autoscaling not Enabled
Apr 04 11:33:20 virtual1 sddm-greeter[3561]: Reading from "/usr/share/xsessions/gnome-xorg.desktop"
Apr 04 11:33:20 virtual1 sddm-greeter[3561]: Reading from "/usr/share/xsessions/gnome.desktop"
Apr 04 11:33:20 virtual1 sddm-greeter[3561]: Reading from "/usr/share/xsessions/icewm-session.desktop"
Apr 04 11:33:20 virtual1 sddm-greeter[3561]: Reading from "/usr/share/xsessions/icewm.desktop"
Apr 04 11:33:20 virtual1 sddm-greeter[3561]: Reading from "/usr/share/xsessions/kodi.desktop"
Apr 04 11:33:20 virtual1 sddm-greeter[3561]: Reading from "/usr/share/xsessions/lxde.desktop"
Apr 04 11:33:20 virtual1 sddm-greeter[3561]: Reading from "/usr/share/xsessions/lxqt.desktop"
Apr 04 11:33:20 virtual1 sddm-greeter[3561]: Reading from "/usr/share/xsessions/mwm.desktop"
Apr 04 11:33:20 virtual1 sddm-greeter[3561]: Reading from "/usr/share/xsessions/openbox.desktop"
Apr 04 11:33:20 virtual1 sddm-greeter[3561]: Reading from "/usr/share/xsessions/plasma5.desktop"
Apr 04 11:33:20 virtual1 sddm-greeter[3561]: Reading from "/usr/share/xsessions/twm.desktop"
Apr 04 11:33:20 virtual1 sddm-greeter[3561]: Reading from "/usr/share/xsessions/xfce.desktop"
Apr 04 11:33:20 virtual1 sddm-greeter[3561]: Reading from "/usr/share/wayland-sessions/plasmawayland.desktop"
Apr 04 11:33:27 virtual1 sddm-greeter[3561]: Loading theme configuration from "/usr/share/sddm/themes/breeze-openSUSE/theme.conf"
Apr 04 11:33:27 virtual1 systemd[3555]: Started D-Bus User Message Bus.
Apr 04 11:33:27 virtual1 sddm-greeter[3561]: Socket error:  "QLocalSocket::connectToServer: The socket operation is not supported"
Apr 04 11:33:27 virtual1 sddm-greeter[3561]: Cannot connect to the daemon - is it running?
Apr 04 11:33:27 virtual1 sddm-greeter[3561]: Could not initialize GLX
Apr 04 11:33:27 virtual1 kernel: Core dump to |/usr/lib/systemd/systemd-coredump 3561 173 455 6 1522856007 18446744073709551615 sddm-greeter pipe failed
Apr 04 11:33:27 virtual1 sddm-helper[3553]: pam_unix(sddm-greeter:session): session closed for user sddm
Apr 04 11:33:27 virtual1 sddm-helper[3553]: [PAM] Closing session
Apr 04 11:33:27 virtual1 sddm-helper[3553]: [PAM] Ended.
Apr 04 11:33:27 virtual1 sddm[3495]: Auth: sddm-helper exited with 6
Apr 04 11:33:27 virtual1 sddm[3495]: Greeter stopped.


> Can you also try to rcxdm restart after the black screen? It might be a race
> condition on startup.

I have restarted many many times sddm after I found the bug. The login screen never came back (except when I disabled ypbind). It doesn't look like a race condition.
Comment 3 Fabian Vogt 2018-04-04 17:12:24 UTC
> Could not initialize GLX

Looks like an X related issue.
Please upload /var/log/Xorg.0.log.
Comment 4 Giacomo Comes 2018-04-04 17:26:03 UTC
Created attachment 765994 [details]
requested Xorg log

Xorg looks fine to me.
Comment 5 Fabian Vogt 2018-04-04 17:36:32 UTC
> [    14.840] (II) glamor: OpenGL accelerated X.org driver based.
> [    14.982] (II) glamor: EGL version 1.4 (DRI2):
> [    14.982] EGL_MESA_drm_image required.
> [    14.988] (EE) modeset(0): glamor initialization failed
[...]
> [    14.994] (II) AIGLX: Screen 0 is not DRI2 capable
> [    14.994] (EE) AIGLX: reverting to software rendering
> [    15.021] (II) IGLX: enabled GLX_MESA_copy_sub_buffer
> [    15.023] (II) IGLX: Loaded and initialized swrast
> [    15.023] (II) GLX: Initialized DRISWRAST GL provider for screen 0

@msrb: Any idea how NIS could cause this?
Comment 6 Giacomo Comes 2018-04-04 19:51:27 UTC
Created attachment 766011 [details]
new log from tumbleweed running a real PC

> [    14.988] (EE) modeset(0): glamor initialization failed
[...]
> [    14.994] (II) AIGLX: Screen 0 is not DRI2 capable
> [    14.994] (EE) AIGLX: reverting to software rendering

These two error messages are unrelated to the sddm problem. They are due to the fact that I was running the X server in a virtual machine which of course is not capable of DRI.

Attached are the Xorg log file and the sddm journal entry of a tumbleweed installation running on a real PC.

X does not have errors.

The differences for sddm are:

                           ## no NIS, sddm working ##
Apr 04 19:40:07 virtual1 sddm-greeter[1949]: Loading theme configuration from "/usr/share/sddm/themes/breeze-openSUSE/theme.conf"
Apr 04 19:40:08 virtual1 sddm-greeter[1949]: Connected to the daemon.
Apr 04 19:40:08 virtual1 sddm[1829]: Message received from greeter: Connect

                          ## NIS enabled, sddm not working ##
Apr 04 19:46:23 virtual1 sddm-greeter[3214]: Loading theme configuration from "/usr/share/sddm/themes/breeze-openSUSE/theme.conf"
Apr 04 19:46:23 virtual1 sddm-greeter[3214]: Socket error:  "QLocalSocket::connectToServer: The socket operation is not supported"
Apr 04 19:46:23 virtual1 sddm-greeter[3214]: Cannot connect to the daemon - is it running?
Apr 04 19:46:23 virtual1 sddm-greeter[3214]: Could not initialize GLX
Apr 04 19:46:23 virtual1 kernel: Core dump to |/usr/lib/systemd/systemd-coredump 3214 187 467 6 1522885583 18446744073709551615 sddm-greeter pipe failed
Comment 7 Fabian Vogt 2018-04-04 20:37:46 UTC
> Attached are the Xorg log file and the sddm journal entry of a tumbleweed installation running on a real PC.
> X does not have errors.

Yup, GLX still fails.

> Apr 04 19:46:23 virtual1 sddm-greeter[3214]: Socket error:  "QLocalSocket::connectToServer: The socket operation is not supported"
> Apr 04 19:46:23 virtual1 sddm-greeter[3214]: Cannot connect to the daemon - is it running?
> Apr 04 19:46:23 virtual1 sddm-greeter[3214]: Could not initialize GLX
> Apr 04 19:46:23 virtual1 kernel: Core dump to |/usr/lib/systemd/systemd-coredump 3214 187 467 6 1522885583 18446744073709551615 sddm-greeter pipe failed

This means sddm-greeter aborted because it could not initialize GLX. If GLX worked, it wouldn't crash.

The more helpful part of this error is the "The socket operation is not supported" message.
I assume that both errors share the same root cause, but with the current information I'm not sure where to look.
So it would still be helpful to know why GLX stops working.

I'll try to reproduce the issue tomorrow.
Comment 8 Michal Srb 2018-04-05 07:47:02 UTC
(In reply to Fabian Vogt from comment #7)
> > Apr 04 19:46:23 virtual1 sddm-greeter[3214]: Socket error:  "QLocalSocket::connectToServer: The socket operation is not supported"
> > Apr 04 19:46:23 virtual1 sddm-greeter[3214]: Cannot connect to the daemon - is it running?
> > Apr 04 19:46:23 virtual1 sddm-greeter[3214]: Could not initialize GLX
> > Apr 04 19:46:23 virtual1 kernel: Core dump to |/usr/lib/systemd/systemd-coredump 3214 187 467 6 1522885583 18446744073709551615 sddm-greeter pipe failed
> 
> This means sddm-greeter aborted because it could not initialize GLX. If GLX
> worked, it wouldn't crash.

I don't think the GLX initialization fail is the culprit here. It would be strange that GLX would work without NIS and stopped with it.

When sddm-greeter prints the message "Cannot connect to the daemon - is it running?" it calls `QCoreApplication::exit`. Unlike the `exit` function from stdlib, this method does return. It only sets some flag that the application's event loop should not run.

So even that sddm-greeter is trying to exit, it still proceeds with initialization and also calls `QGuiApplication::exec()`. Then Qt tries to initialize GLX, but it fails because it did not receive any usable configs. My guess is that it did not receive them because it was in the exiting state.

So I think the GLX message is only falloff from sddm-greeter exiting. The main cause is that it failed to talk to sddm. Maybe sddm is busy/frozen/broken because of the NIS?
Comment 9 Michal Srb 2018-04-05 08:11:24 UTC
(In reply to Giacomo Comes from comment #2)
> Apr 04 11:33:27 virtual1 sddm-greeter[3561]: Socket error: 
> "QLocalSocket::connectToServer: The socket operation is not supported"

If I am reading the Qt code correctly, in this situation it can happen only if a call to `socket(PF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0)` returned -1. This could happen for various reasons, for example exhausted system resources. (Maybe the NIS client is creating huge amount of sockets?) It would be good to know what was the errno after this call failed.

Giacomo, could you run sddm in strace? For example after ssh-ing in as root from another machine and running:
# systemctl stop display-manager
# strace -f -o /tmp/sddm.strace sddm

The please attach the sddm.strace here.
Comment 10 Fabian Vogt 2018-04-05 09:20:52 UTC
(In reply to Michal Srb from comment #8)
> (In reply to Fabian Vogt from comment #7)
> > > Apr 04 19:46:23 virtual1 sddm-greeter[3214]: Socket error:  "QLocalSocket::connectToServer: The socket operation is not supported"
> > > Apr 04 19:46:23 virtual1 sddm-greeter[3214]: Cannot connect to the daemon - is it running?
> > > Apr 04 19:46:23 virtual1 sddm-greeter[3214]: Could not initialize GLX
> > > Apr 04 19:46:23 virtual1 kernel: Core dump to |/usr/lib/systemd/systemd-coredump 3214 187 467 6 1522885583 18446744073709551615 sddm-greeter pipe failed
> > 
> > This means sddm-greeter aborted because it could not initialize GLX. If GLX
> > worked, it wouldn't crash.
> 
> I don't think the GLX initialization fail is the culprit here. It would be
> strange that GLX would work without NIS and stopped with it.

It's also strange that sddm-greeter is unable to create a socket...

> When sddm-greeter prints the message "Cannot connect to the daemon - is it
> running?" it calls `QCoreApplication::exit`. Unlike the `exit` function from
> stdlib, this method does return. It only sets some flag that the
> application's event loop should not run.
> 
> So even that sddm-greeter is trying to exit, it still proceeds with
> initialization and also calls `QGuiApplication::exec()`. Then Qt tries to
> initialize GLX, but it fails because it did not receive any usable configs.
> My guess is that it did not receive them because it was in the exiting state.

Yes, this is a bug, but GLX initialization should still succeed after that. I verified that on several systems.

(In reply to Michal Srb from comment #9)
> (In reply to Giacomo Comes from comment #2)
> > Apr 04 11:33:27 virtual1 sddm-greeter[3561]: Socket error: 
> > "QLocalSocket::connectToServer: The socket operation is not supported"
> 
> If I am reading the Qt code correctly, in this situation it can happen only
> if a call to `socket(PF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0)`
> returned -1. This could happen for various reasons, for example exhausted
> system resources. (Maybe the NIS client is creating huge amount of sockets?)
> It would be good to know what was the errno after this call failed.

It's probably the same reason why GLX failed - maybe too many open fds?

> Giacomo, could you run sddm in strace? For example after ssh-ing in as root
> from another machine and running:
> # systemctl stop display-manager
> # strace -f -o /tmp/sddm.strace sddm
> 
> The please attach the sddm.strace here.

Yes, that would be very useful.
Also please provide the output of ps -ef | grep sddm during the black screen.
Comment 11 Fabian Vogt 2018-04-05 09:24:24 UTC
I just tried to reproduce the issue here - and failed.

It works fine here on both Leap 15.0 and Tumbleweed, with a black screen shown for just a short delay while sddm loads the list of all available user accounts.
No errors or warnings appear in the log files.
Comment 12 Giacomo Comes 2018-04-05 16:48:46 UTC
Created attachment 766161 [details]
requested strace

output of ps -ef | grep sddm:

root      9821  2758  2 12:36 pts/0    00:00:06 strace -f -o /tmp/sddm.strace sddm
root      9823  9821  0 12:36 pts/0    00:00:00 sddm
root      9825  9823  0 12:36 tty7     00:00:00 /usr/bin/X -nolisten tcp -auth /run/sddm/{dfe041fe-1cae-474c-9779-b7dca436464a} -background none -noreset -displayfd 17 -seat seat0 vt7
root      9985  9717  0 12:42 pts/1    00:00:00 grep --color=auto sddm

This and the strace is done in a virtual machine. If needed I can also do it on a real PC.
Comment 13 Fabian Vogt 2018-04-05 17:15:47 UTC
(In reply to Giacomo Comes from comment #12)
> Created attachment 766161 [details]
> requested strace
> 
> output of ps -ef | grep sddm:
> 
> root      9821  2758  2 12:36 pts/0    00:00:06 strace -f -o
> /tmp/sddm.strace sddm
> root      9823  9821  0 12:36 pts/0    00:00:00 sddm
> root      9825  9823  0 12:36 tty7     00:00:00 /usr/bin/X -nolisten tcp
> -auth /run/sddm/{dfe041fe-1cae-474c-9779-b7dca436464a} -background none
> -noreset -displayfd 17 -seat seat0 vt7
> root      9985  9717  0 12:42 pts/1    00:00:00 grep --color=auto sddm
> 
> This and the strace is done in a virtual machine. If needed I can also do it
> on a real PC.

The socket connection worked this time, but the GLX error remains, it's due to:

openat(AT_FDCWD, "/lib64/libGLX_mesa.so.0", O_RDONLY|O_CLOEXEC) = -1 EMFILE (Too many open files)

You have pam_limits configured, which sets the limit to 1024. As that is per-process, it should in theory be enough, but it apparently isn't.
So try to increase the nofile limit in /etc/security/limits.conf to 2048.
Comment 14 Giacomo Comes 2018-04-05 17:58:26 UTC
Yes, that was the issue. Rising the soft nofile limit to 2048 made sddm work again. (the threshold was at about 1310).

The question remains about why it is necessary (at least for me) to increase such limit.
Comment 15 Fabian Vogt 2018-04-05 18:29:22 UTC
(In reply to Giacomo Comes from comment #14)
> Yes, that was the issue. Rising the soft nofile limit to 2048 made sddm work
> again. (the threshold was at about 1310).
> 
> The question remains about why it is necessary (at least for me) to increase
> such limit.

It appears to me like the NIS user enumeration (or some other part) leaks sockets:

socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP) = 24
[...]  (never closed)
socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP) = 1000

Reassigning to tirpc bugowner and raising severity to major, as one leaked fd per user (?) can really cause issues.
Comment 16 Cristian Rodríguez 2018-04-05 21:15:00 UTC
(In reply to Fabian Vogt from comment #15)
> (In reply to Giacomo Comes from comment #14)
> > Yes, that was the issue. Rising the soft nofile limit to 2048 made sddm work
> > again. (the threshold was at about 1310).
> > 
> > The question remains about why it is necessary (at least for me) to increase
> > such limit.
> 
> It appears to me like the NIS user enumeration (or some other part) leaks
> sockets:
> 
> socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP) = 24
> [...]  (never closed)
> socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP) = 1000
> 
> Reassigning to tirpc bugowner and raising severity to major, as one leaked
> fd per user (?) can really cause issues.

huh..it also does not set SOCK_CLOEXEC..did you identified which library has this behaviour ?
Comment 17 Fabian Vogt 2018-04-05 21:31:25 UTC
(In reply to Cristian Rodríguez from comment #16)
> (In reply to Fabian Vogt from comment #15)
> > (In reply to Giacomo Comes from comment #14)
> > > Yes, that was the issue. Rising the soft nofile limit to 2048 made sddm work
> > > again. (the threshold was at about 1310).
> > > 
> > > The question remains about why it is necessary (at least for me) to increase
> > > such limit.
> > 
> > It appears to me like the NIS user enumeration (or some other part) leaks
> > sockets:
> > 
> > socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP) = 24
> > [...]  (never closed)
> > socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP) = 1000
> > 
> > Reassigning to tirpc bugowner and raising severity to major, as one leaked
> > fd per user (?) can really cause issues.
> 
> huh..it also does not set SOCK_CLOEXEC..did you identified which library has
> this behaviour ?

No, that it's tirpc is just a first guess, it's the only rpc lib loaded into the process as seen in the trace.

Tomorrow, I'll try to reproduce it with a debugger attached and get a backtrace.
Comment 18 Thorsten Kukuk 2018-04-06 04:27:55 UTC
(In reply to Fabian Vogt from comment #17)

> No, that it's tirpc is just a first guess, it's the only rpc lib loaded into
> the process as seen in the trace.

You can easily find out:
Install libtirpc from Leap 15.0. If that works and don't show the problems,
please try 
https://build.opensuse.org/package/show/home:kukuk:branches:Base:System/libtirpc
Comment 19 Thomas Blume 2018-04-06 06:22:25 UTC
(In reply to Fabian Vogt from comment #17)
> > > It appears to me like the NIS user enumeration (or some other part) leaks
> > > sockets:
> > > 
> > > socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP) = 24
> > > [...]  (never closed)
> > > socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP) = 1000
> > > 
> > > Reassigning to tirpc bugowner and raising severity to major, as one leaked
> > > fd per user (?) can really cause issues.
> > 
> > huh..it also does not set SOCK_CLOEXEC..did you identified which library has
> > this behaviour ?
> 
> No, that it's tirpc is just a first guess, it's the only rpc lib loaded into
> the process as seen in the trace.
> 
> Tomorrow, I'll try to reproduce it with a debugger attached and get a
> backtrace.

Looks like these are the sockets to connect to the remote portmapper.
libtirpc closes them via a macro, see below extract from the rpc_soc manpage:

-->
void clnt_destroy(CLIENT *clnt)
    A macro that destroys the client's RPC handle. Destruction usually involves deallocation of private data structures, including clnt itself. Use of clnt is undefined after calling clnt_destroy(). If the RPC library opened the associated socket, it will close it also. Otherwise, the socket remains open.
CLIENT *
--<
Comment 20 Fabian Vogt 2018-04-06 07:24:38 UTC
I can't reproduce the socket leak with the domain in this network on TW.
All sockets are closed properly.

Please upload the output file of "strace -fke trace=socket -o /tmp/sddm.strace sddm", it contains backtraces of the leaked socket's open calls.

It should also suffice to run "strace -ke trace=socket -o /tmp/getent.strace getent passwd >/dev/null", which is probably easier to do.

Also try out what comment 18 suggests.
Comment 21 Giacomo Comes 2018-04-06 14:05:52 UTC
(In reply to Thorsten Kukuk from comment #18)
> (In reply to Fabian Vogt from comment #17)
> 
> > No, that it's tirpc is just a first guess, it's the only rpc lib loaded into
> > the process as seen in the trace.
> 
> You can easily find out:
> Install libtirpc from Leap 15.0.

after installing libtirpc from Leap 15.0 sddm worked again.

> If that works and don't show the problems, please try 
> https://build.opensuse.org/package/show/home:kukuk:branches:Base:System/
> libtirpc

I installed https://api.opensuse.org/build/home:kukuk:branches:Base:System/openSUSE_Factory/x86_64/libtirpc/libtirpc3-1.0.3-101.1.x86_64.rpm
and sddm stopped working again.
Comment 22 Giacomo Comes 2018-04-06 14:11:40 UTC
Created attachment 766314 [details]
sddm+getent trace=socket
Comment 23 Thorsten Kukuk 2018-04-06 14:33:28 UTC
(In reply to Giacomo Comes from comment #21)

> after installing libtirpc from Leap 15.0 sddm worked again.

> I installed
> https://api.opensuse.org/build/home:kukuk:branches:Base:System/
> openSUSE_Factory/x86_64/libtirpc/libtirpc3-1.0.3-101.1.x86_64.rpm
> and sddm stopped working again.

Ok, thanks, so the socket leak must be in one of the libtirpc changes, and it is not the bindresvport() replacement.


Thomas, the new __try_protocol_version_2 function does not seem to close "client", while the old code did it via the "done" label.
Comment 24 Thomas Blume 2018-04-06 15:13:34 UTC
(In reply to Thorsten Kukuk from comment #23)
> (In reply to Giacomo Comes from comment #21)
> 
> > after installing libtirpc from Leap 15.0 sddm worked again.
> 
> > I installed
> > https://api.opensuse.org/build/home:kukuk:branches:Base:System/
> > openSUSE_Factory/x86_64/libtirpc/libtirpc3-1.0.3-101.1.x86_64.rpm
> > and sddm stopped working again.
> 
> Ok, thanks, so the socket leak must be in one of the libtirpc changes, and
> it is not the bindresvport() replacement.
> 
> 
> Thomas, the new __try_protocol_version_2 function does not seem to close
> "client", while the old code did it via the "done" label.

Hm, the straces in comment#22 don't show any call to function __try_protocol_version_2. That would be suprising anyway, because it means that the remote portmapper is a very old one that only supports rpc protocol version 2.
Comment 25 Thomas Blume 2018-04-06 15:52:58 UTC
The only recurring socket opening from libtirpc that I can find the straces comes from a function call in clnt_tli_create, e.g.:

-->
2873  socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP) = 19
 > /lib64/libc-2.27.so(socket+0x7) [0xfb057]
 > /lib64/libtirpc.so.3.0.0(__rpc_nconf2fd_flags+0x5b) [0x12f2b]
 > /lib64/libtirpc.so.3.0.0(clnt_tli_create+0x50) [0xd2a0]
--<
Comment 26 Fabian Vogt 2018-04-06 18:02:13 UTC
The getent strace confirms that it's a bug somewhere in the nss stack.

There are two different traces which lead to socket, only one of them leaks a fd:

socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP) = 9
 > /lib64/libc-2.27.so(socket+0x7) [0xfb057]
 > /lib64/libtirpc.so.3.0.0(__rpc_nconf2fd_flags+0x5b) [0x12f2b]
 > /lib64/libtirpc.so.3.0.0(clnt_tli_create+0x50) [0xd2a0]
 > /lib64/libtirpc.so.3.0.0(getclnthandle+0x3f8) [0x144f8]
 > /lib64/libtirpc.so.3.0.0(__rpcb_findaddr_timed+0x34e) [0x1503e]
 > /lib64/libtirpc.so.3.0.0(clnt_tp_create_timed+0x50) [0xd550]
 > /lib64/libtirpc.so.3.0.0(clnt_create_timed+0xa4) [0xd724]
 > /usr/lib64/libnsl.so.2.0.0(yp_bind_client_create_v3+0x7b) [0x5c2b]
 > /usr/lib64/libnsl.so.2.0.0(__yp_bind.part.1+0x2da) [0x642a]
 > /usr/lib64/libnsl.so.2.0.0(do_ypcall+0x179) [0x6749]
 > /usr/lib64/libnsl.so.2.0.0(do_ypcall_tr+0x9) [0x6859]
 > /usr/lib64/libnsl.so.2.0.0(yp_next+0xe5) [0x6ff5]
 > /lib64/libnss_nis.so.2.0.0(_nss_nis_getpwent_r+0x300) [0x7ae0]
 > /lib64/libnss_compat-2.27.so(getpwent_next_nss.part.3.constprop.9+0x73) [0x2ff3]
 > /lib64/libnss_compat-2.27.so(_nss_compat_getpwent_r+0x119) [0x3649]
 > /lib64/libc-2.27.so(__nss_getent_r+0x127) [0x11d787]
 > /lib64/libc-2.27.so(getpwent_r+0x88) [0xc5958]
 > /lib64/libc-2.27.so(__nss_getent+0x5b) [0x11d2ab]
 > /lib64/libc-2.27.so(getpwent+0x6c) [0xc53ac]
 > /usr/bin/getent() [0x3e9d]
 > /usr/bin/getent() [0x2c2a]
 > /lib64/libc-2.27.so(__libc_start_main+0xe7) [0x21a87]
 > /usr/bin/getent() [0x2cfa]
socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP) = 6
 > /lib64/libc-2.27.so(socket+0x7) [0xfb057]
 > /lib64/libtirpc.so.3.0.0(__rpc_nconf2fd_flags+0x5b) [0x12f2b]
 > /lib64/libtirpc.so.3.0.0(clnt_tli_create+0x50) [0xd2a0]
 > /lib64/libtirpc.so.3.0.0(clnt_tp_create_timed+0xa8) [0xd5a8]
 > /lib64/libtirpc.so.3.0.0(clnt_create_timed+0xa4) [0xd724]
 > /usr/lib64/libnsl.so.2.0.0(yp_bind_client_create_v3+0x7b) [0x5c2b]
 > /usr/lib64/libnsl.so.2.0.0(__yp_bind.part.1+0x2da) [0x642a]
 > /usr/lib64/libnsl.so.2.0.0(do_ypcall+0x179) [0x6749]
 > /usr/lib64/libnsl.so.2.0.0(do_ypcall_tr+0x9) [0x6859]
 > /usr/lib64/libnsl.so.2.0.0(yp_next+0xe5) [0x6ff5]
 > /lib64/libnss_nis.so.2.0.0(_nss_nis_getpwent_r+0x300) [0x7ae0]
 > /lib64/libnss_compat-2.27.so(getpwent_next_nss.part.3.constprop.9+0x73) [0x2ff3]
 > /lib64/libnss_compat-2.27.so(_nss_compat_getpwent_r+0x119) [0x3649]
 > /lib64/libc-2.27.so(__nss_getent_r+0x127) [0x11d787]
 > /lib64/libc-2.27.so(getpwent_r+0x88) [0xc5958]
 > /lib64/libc-2.27.so(__nss_getent+0x5b) [0x11d2ab]
 > /lib64/libc-2.27.so(getpwent+0x6c) [0xc53ac]
 > /usr/bin/getent() [0x3e9d]
 > /usr/bin/getent() [0x2c2a]
 > /lib64/libc-2.27.so(__libc_start_main+0xe7) [0x21a87]
 > /usr/bin/getent() [0x2cfa]

So:

 > /lib64/libtirpc.so.3.0.0(clnt_tli_create+0x50) [0xd2a0]
 > /lib64/libtirpc.so.3.0.0(getclnthandle+0x3f8) [0x144f8]
 > /lib64/libtirpc.so.3.0.0(__rpcb_findaddr_timed+0x34e) [0x1503e]
 > /lib64/libtirpc.so.3.0.0(clnt_tp_create_timed+0x50) [0xd550]
 > /lib64/libtirpc.so.3.0.0(clnt_create_timed+0xa4) [0xd724]

 > /lib64/libtirpc.so.3.0.0(clnt_tli_create+0x50) [0xd2a0]
 > /lib64/libtirpc.so.3.0.0(clnt_tp_create_timed+0xa8) [0xd5a8]
 > /lib64/libtirpc.so.3.0.0(clnt_create_timed+0xa4) [0xd724]
Comment 27 Thorsten Kukuk 2018-04-09 08:45:23 UTC
Whoever can reproduce this, please test 
https://build.opensuse.org/package/show/home:kukuk:branches:Base:System/libtirpc
Comment 28 Thomas Blume 2018-04-09 09:25:33 UTC
(In reply to Fabian Vogt from comment #26)
> The getent strace confirms that it's a bug somewhere in the nss stack.
>
> There are two different traces which lead to socket, only one of them leaks
> a fd:
>
[...]
> 
> So:
>  
>  > /lib64/libtirpc.so.3.0.0(clnt_tli_create+0x50) [0xd2a0]
>  > /lib64/libtirpc.so.3.0.0(getclnthandle+0x3f8) [0x144f8]
>  > /lib64/libtirpc.so.3.0.0(__rpcb_findaddr_timed+0x34e) [0x1503e]
>  > /lib64/libtirpc.so.3.0.0(clnt_tp_create_timed+0x50) [0xd550]
>  > /lib64/libtirpc.so.3.0.0(clnt_create_timed+0xa4) [0xd724]
>  
>  > /lib64/libtirpc.so.3.0.0(clnt_tli_create+0x50) [0xd2a0]
>  > /lib64/libtirpc.so.3.0.0(clnt_tp_create_timed+0xa8) [0xd5a8]
>  > /lib64/libtirpc.so.3.0.0(clnt_create_timed+0xa4) [0xd724]

Thanks for the hint Fabian.
The code in clnt_tp_create_timed is:

-->
clnt_tp_create_timed(const char *hostname, rpcprog_t prog, rpcvers_t vers,
    const struct netconfig *nconf, const struct timeval *tp)
{
        struct netbuf *svcaddr;                 /* servers address */
        CLIENT *cl = NULL;                      /* client handle */

        if (nconf == NULL) {
                rpc_createerr.cf_stat = RPC_UNKNOWNPROTO;
                return (NULL);
        }

        /* 
         * Get the address of the server
         */
        if ((svcaddr = __rpcb_findaddr_timed(prog, vers, 
                        (struct netconfig *)nconf, (char *)hostname,
                        &cl, (struct timeval *)tp)) == NULL) {
                /* appropriate error number is set by rpcbind libraries */
                return (NULL);
        }
        if (cl == NULL) {
                cl = clnt_tli_create(RPC_ANYFD, nconf, svcaddr,
                                        prog, vers, 0, 0);
--<

Obviously the call to __rpcb_findaddr_timed is skipped in the second trace and I'm wondering  how this can happen.
Comment 29 Giacomo Comes 2018-04-09 15:02:38 UTC
I have bisected libtirpc and found the commit that introduced the regression:

commit 5e7b57bc20bd9cadfffd6de52c3331a926ebdf92
Author: Thomas Blume <Thomas.Blume@suse.com>
Date:   Wed Mar 14 09:51:51 2018 -0400

    rpcinfo: change order of version to be tried to 4, 3, 2
    
    When specifying TCP as transport on rpcinfo getport command (-T tcp),
    the initial RPC getport packet is still sent as a UDP packet.
    This is due to rpc protocol version 2 is tried first and
    function getpmaphandle() states:
    
    /*
    * Try UDP only - there are some portmappers out
    * there that use UDP only.
    */
    
    Even on systems with newer rpc versions, this will hang when
    the UDP port is blocked.  That is a quite artifical limitation,
    because UDP only portmappert should be really rare by now.
    The attached code changes the order of versions to be tried
    to 4, then 3 and use version 2 only as fallback.
    
    Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
    Signed-off-by: Thomas Blume <Thomas.Blume@suse.com>
    Signed-off-by: Steve Dickson <steved@redhat.com>

This is the same commit that was reversed in https://api.opensuse.org/build/home:kukuk:branches:Base:System/openSUSE_Factory/x86_64/libtirpc/libtirpc3-1.0.3-102.1.x86_64.rpm
Therefore such rpm also fixes the regression.
Comment 30 Thomas Blume 2018-04-10 07:25:52 UTC
(In reply to Giacomo Comes from comment #29)
> I have bisected libtirpc and found the commit that introduced the regression:
> 
> commit 5e7b57bc20bd9cadfffd6de52c3331a926ebdf92

Ok, so it seems Thorsten was right in comment#23.
I have added a patch that should address the problem mentioned.
Unfortunately, I still cannot reproduce the issue inhouse.
Couly you please try:

https://download.opensuse.org/repositories/home:/tsaupe:/branches:/Base:/System:/libtirpc/openSUSE_Factory/

and report whether it fixes the leak?
Comment 31 Giacomo Comes 2018-04-10 16:40:25 UTC
Unfortunately installing libtirpc3-1.0.3-3.1.x86_64.rpm did not fix the problem.
Comment 32 Thomas Blume 2018-04-11 10:21:06 UTC
(In reply to Giacomo Comes from comment #31)
> Unfortunately installing libtirpc3-1.0.3-3.1.x86_64.rpm did not fix the
> problem.

I really need to reproduce the issue inhouse.
Could you please give details about the remote portmapper at 192.65.176.45 that libtirpc is contacting?
What system is installed on this machine and which rpcbind version?
Comment 33 Giacomo Comes 2018-04-11 13:11:08 UTC
We have several master/slave nis servers running redhat 9 and centos 5/6/7
I did a test with all of them and the issue arises when talking to nis running on:
redhat 9 and centos 5 (portmapper 4.0)
no issue with:
centos 6 and centos 7 (rpcbind 0.2.0)
Comment 34 Thomas Blume 2018-04-12 08:52:20 UTC
(In reply to Giacomo Comes from comment #33)
> We have several master/slave nis servers running redhat 9 and centos 5/6/7
> I did a test with all of them and the issue arises when talking to nis
> running on:
> redhat 9 and centos 5 (portmapper 4.0)
> no issue with:
> centos 6 and centos 7 (rpcbind 0.2.0)

Thanks for the feedback. I could reproduce the issue now and found the culprit.
Packages with a proposed patch are here:

https://build.opensuse.org/package/binaries/home:tsaupe:branches:Base:System/libtirpc/openSUSE_Factory

Please test and report.
Comment 35 Giacomo Comes 2018-04-12 11:29:26 UTC
(In reply to Thomas Blume from comment #34)
> Packages with a proposed patch are here:
> 
> https://build.opensuse.org/package/binaries/home:tsaupe:branches:Base:System/
> libtirpc/openSUSE_Factory
> 
> Please test and report.

The patch works. Issue solved.
Comment 36 Thomas Blume 2018-04-18 14:03:40 UTC
Patch was accepted upstream.
Submitting to factory.
Comment 37 Giacomo Comes 2018-05-26 15:55:06 UTC
(In reply to Thomas Blume from comment #36)
> Patch was accepted upstream.
> Submitting to factory.

The fix still has not appeared in tumbleweed. Is that normal?
Comment 38 Fabian Vogt 2018-05-26 16:02:58 UTC
The request got declined but nobody noticed:

https://build.opensuse.org/request/show/602530
Comment 39 Thomas Blume 2018-05-28 12:39:13 UTC
(In reply to Fabian Vogt from comment #38)
> The request got declined but nobody noticed:
> 
> https://build.opensuse.org/request/show/602530

The error is cause by the previous submit to the development repo:
https://build.opensuse.org/request/show/593768
that wasn't forwarded to factory.
Preparing a new one.
Comment 40 Thomas Blume 2018-07-30 09:10:44 UTC
patch is in tumbleweed, closing