Bug 1141597 - [Build 20190713] systemd networkd_dhcp fails to bring up the network
[Build 20190713] systemd networkd_dhcp fails to bring up the network
Status: RESOLVED FIXED
Classification: openSUSE
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Basesystem
Current
Other Other
: P5 - None : Normal (vote)
: ---
Assigned To: systemd maintainers
E-mail List
https://openqa.opensuse.org/tests/983...
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2019-07-16 08:34 UTC by Dominique Leuenberger
Modified: 2021-10-06 08:37 UTC (History)
4 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---
fbui: needinfo?


Attachments
0001-bring-lower-layer-up-before-running-networkd-setup.patch (748 bytes, patch)
2019-07-25 13:32 UTC, Thomas Blume
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Dominique Leuenberger 2019-07-16 08:34:11 UTC
## Observation

openQA test in scenario opensuse-Tumbleweed-DVD-x86_64-systemd-networkd@64bit fails in
[networkd_dhcp](https://openqa.opensuse.org/tests/983710/modules/networkd_dhcp/steps/31)

## Test suite description
Maintainer: dheidler@suse.de, okurz@suse.de
Test for systemd-networkd. These tests run on a single testmachine and spawn systemd-nspawn containers.


## Reproducible

Fails since (at least) Build [20190713](https://openqa.opensuse.org/tests/983420)


## Expected result

Last good: [20190711](https://openqa.opensuse.org/tests/983071) (or more recent)


## Further details

Always latest result in this scenario: [latest](https://openqa.opensuse.org/tests/latest?arch=x86_64&distri=opensuse&flavor=DVD&machine=64bit&test=systemd-networkd&version=Tumbleweed)
Comment 1 Franck Bui 2019-07-16 09:05:22 UTC
Additional info from Dominique:

The regression only happened in snapshot 0713 (where no changes in systemd have been submitted).

Changes can be found at https://lists.opensuse.org/opensuse-commit/2019-07/date2.html
Comment 2 Franck Bui 2019-07-18 08:18:33 UTC
I tried to reproduce the issue locally but it worked as expected.

@Dominique, it would be nice if I can get access to the system running the test. 

Do you know if that's doable ?
Comment 3 Dominique Leuenberger 2019-07-18 09:07:24 UTC
(In reply to Franck Bui from comment #2)
> I tried to reproduce the issue locally but it worked as expected.
> 
> @Dominique, it would be nice if I can get access to the system running the
> test. 
> 
> Do you know if that's doable ?

openQA runs all on discardable VMs: a qemu is fired up, openSUSE installed and the tests run. 

well, in this case actually a HDD is generated by a different test and being re-used, so you can download the HDD image from openQA and boot it. The test code is a simple script that can be 'replayed' manyally in a VM

HDD image:
https://openqa.opensuse.org/tests/983710/asset/hdd/opensuse-Tumbleweed-x86_64-20190713-textmode@64bit.qcow2

Testcodes:
networkd_init: https://openqa.opensuse.org/tests/983710/modules/networkd_init/steps/1/src
networkd_dhcpd: https://openqa.opensuse.org/tests/983710/modules/networkd_dhcp/steps/1/src
Comment 4 Dominik Heidler 2019-07-18 09:10:53 UTC
And the functions to setup and run commands within nspawn containers can be found here:

https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/lib/networkdbase.pm
Comment 5 Franck Bui 2019-07-23 09:24:42 UTC
Just started looking into this and have one question about the test: why duplicating systemd-nspawn@.service rather than using a container setting file (see systemd.nspawn(5)) ?

Also the test is starting a transient unit for each simple command it needs to executes... why not using "machinectl shell node1 <my-command>" instead ?
Comment 6 Thomas Blume 2019-07-25 12:23:30 UTC
Issue reproduced on local test server.
The nic host0 inside the container stays in state configuring and doesn't get to state configured.
Therefore the test fails when grepping for this state.
So, question is, why the nic setup gets stuck that way.
Comment 7 Thomas Blume 2019-07-25 13:32:20 UTC
Created attachment 811608 [details]
0001-bring-lower-layer-up-before-running-networkd-setup.patch
Comment 9 Dominik Heidler 2019-07-26 11:04:35 UTC
I did some manual tests on a read hardware system and it looks like a real bug to me.

I installed the latest TW (20190724) with textmode.
Then I stopped and disabled wicked and wickedd.
Then I created a config for networkd:

/etc/systemd/network/50-dhcp.network:
[Match]
Name=em1

[Network]
DHCP=ipv4


Then I ran "systemctl enable systemd-networkd" and rebooted.

The system did NOT get an IP address but stayed forever in state "configuring".

The systemd-networkd journal shows the line:

Jul 26 12:55:43 localhost systemd-networkd[1118]: em1: Could not bring up interface: Invalid argument


When I then run "ip li set em1 up; systemctl restart systemd-networkd" the network is coming up and networkd requests an IP address using DHCP.

So this seems to be a real bug.
Comment 10 Franck Bui 2019-07-26 12:34:02 UTC
(In reply to Dominik Heidler from comment #9)
> 
> So this seems to be a real bug.

Maybe but you're commenting in the wrong one. Your looks similar to https://bugzilla.opensuse.org/show_bug.cgi?id=1142901

This bug is about 2 containers using veth each and connected via a bridge (on the host). One of the container is running a DHCP server that can't be reached from the second one.
Comment 11 Thomas Blume 2019-07-26 14:22:42 UTC
(In reply to Franck Bui from comment #10)
> (In reply to Dominik Heidler from comment #9)
> > 
> > So this seems to be a real bug.
> 
> Maybe but you're commenting in the wrong one. Your looks similar to
> https://bugzilla.opensuse.org/show_bug.cgi?id=1142901
> 
> This bug is about 2 containers using veth each and connected via a bridge
> (on the host). One of the container is running a DHCP server that can't be
> reached from the second one.

Yeah, this is a different bug, I've tested it with a package that contains the networkd ifup fix, see:

http://emerson.suse.de/tests/1460#step/networkd_dhcp/35
http://emerson.suse.de/tests/1460#step/networkd_dhcp/41

but still:

http://emerson.suse.de/tests/1460#step/networkd_dhcp/46
Comment 12 Thomas Blume 2019-07-29 12:01:19 UTC
Setting up the network manually in container2, I can ping container1, so the network connection itself is ok.
It seems that the DHCP server doesn't work in container1.
Comment 18 Dominik Heidler 2019-07-30 09:35:22 UTC
So maybe the failing vlan and bridge tests were related to the other bug as they don't fail with recent builds anymore.
Comment 19 Thomas Blume 2019-08-01 07:53:50 UTC
(In reply to Dominik Heidler from comment #18)
> So maybe the failing vlan and bridge tests were related to the other bug as
> they don't fail with recent builds anymore.

Finally, it seems we've found the culprit.
The firewall was blocking the dhcp communication between the containers.
After running:

systemctl stop firewalld

on my local testmachine, node2 was set up via dhcp correctly.
Can you please add that to your test setup and check whether that works?

I've got also some recommendations from Franck to optimize the systemd commands for managing the container.
Will provide a pr to openQA therefore.
Comment 20 Dominik Heidler 2019-08-01 09:12:22 UTC
I retried my steps from https://bugzilla.suse.com/show_bug.cgi?id=1141597#c9 with build 20190730 of TW. This time the network came up as expected. I didn't had to stop firewalld though.
Comment 21 Franck Bui 2019-08-01 10:19:05 UTC
It would be nice to understand what changes made the regression go away.

Was the testing system installed with firewall enabled this time ?
Comment 22 Dominik Heidler 2019-08-01 11:23:17 UTC
yes - maybe that kernel/systemd fix resolved it.
Comment 24 Franck Bui 2019-08-01 12:24:31 UTC
(In reply to Dominik Heidler from comment #22)
> yes - maybe that kernel/systemd fix resolved it.

No since running the latest version of systemd didn't help.
Comment 25 Franck Bui 2019-08-01 12:27:27 UTC
So I'm not sure what should be done next as this seems an issue with the way the firewall is configured in the test.

I'm assigning this to Dominik so the test is modified to ensure that the firewall is setup to let the communication work between the 2 containers.
Comment 26 Thomas Blume 2019-08-01 12:33:01 UTC
(In reply to Franck Bui from comment #25)
> So I'm not sure what should be done next as this seems an issue with the way
> the firewall is configured in the test.
> 
> I'm assigning this to Dominik so the test is modified to ensure that the
> firewall is setup to let the communication work between the 2 containers.

At least the official openQA tests still fail, even with the latest Tumbleweed build, see:

https://openqa.opensuse.org/tests/latest?arch=x86_64&distri=opensuse&flavor=DVD&machine=64bit&test=systemd-networkd&version=Tumbleweed#next_previous

So, I'd bet the firewall is active there.
Comment 27 Dominik Heidler 2019-08-01 12:33:28 UTC
Fix that will stop firewalld:
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/8094
Comment 28 Franck Bui 2020-05-11 10:40:15 UTC
Unfortunately this bug resurrected in bsc#1153159, which is not big a surprise given the fact that we didn't completely understand why it was suddenly fixed without the need to stop the firewall, see comment #20.

BTW I don't really understand why a new bug was opened since a bug for this specific test was already opened...

Anyway, after spending some time to reproduce this issue locally, I noticed the following warnings in container node1:

 # journalctl -b -Mnode1 -pwarning
 May 11 04:53:15 node1 systemd-networkd[20]: host0: Failed to determine timezone: Invalid argument
 May 11 04:53:15 node1 systemd-networkd[20]: Could not process link message, ignoring: Invalid argument

This indicates something wrong with the timezone setting in node1 and indeed looking at this setting in node1 revealed that /etc/localtime symlink was missing.

This symlink is initialized during the installation of timezone package, which is missing in node1. After installing this package and restarting networkd, the DHCP server was correctly started and the test passed as expected.

The timezone is important in this test because the DHCP server has "EmitTimezone=" option implicitly set to "yes", see man systemd.network for the meaning. And if the timezone setting can't be retrieved by networkd then it won't start the DHCP server.

Dominik, can you please fix the test to make sure that timezone is installed in the container running the DHCP server ?
Comment 29 Franck Bui 2020-05-11 10:51:33 UTC
Also please consider my comment #5 where I pointed out that duplicating systemd-nspawn@.service is really not a good idea for obvious reasons.

Also the name chosen for the duplicated service "systemd-nspawn-openqa@.service" is not well chosen because by convention all units starting with "systemd-" prefix belongs to systemd itself.

But that said, you don't need the duplication at all because you should rely on the .nspawn configuration files instead which were introduced exactly to avoid that, please see systemd.nspawn man page.

Should I open a bug for that ?
Comment 30 Dominik Heidler 2020-05-11 14:35:38 UTC
Fix to install the timezone rpm as well:
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/10230


Regarding the duplicate unit file:
As it is only a cosmetic issue, feel free to open a  progress ticket and provide some sample .nspawn file and the info how to use it and where to place it.
Comment 31 Dominique Leuenberger 2020-05-11 14:41:40 UTC
(In reply to Franck Bui from comment #28)
> The timezone is important in this test because the DHCP server has
> "EmitTimezone=" option implicitly set to "yes", see man systemd.network for
> the meaning. And if the timezone setting can't be retrieved by networkd then
> it won't start the DHCP server.
> 
> Dominik, can you please fix the test to make sure that timezone is installed
> in the container running the DHCP server ?

Having the test code take care of that instead of the package only masks the problem: users will still run into the exact same bug. timezone is not required by the test, it's required by networkd (for whatever reason - timedatectl can make smart assumptions about being UTC if nothing is defined)
Comment 32 Dominik Heidler 2020-05-11 14:49:18 UTC
This indeed fixes the issue but actually I have to affirm with dimstar's comment on my PR that this would be just a workaround and systemd should depend on timezone then.
Comment 33 Franck Bui 2020-05-11 15:51:27 UTC
(In reply to Dominique Leuenberger from comment #31)
> Having the test code take care of that instead of the package only masks the
> problem: users will still run into the exact same bug. timezone is not
> required by the test, it's required by networkd (for whatever reason -
> timedatectl can make smart assumptions about being UTC if nothing is defined)

I considered making networkd require timezone but couldn't convince myself eventually because a) starting a DHCP server is optional (and this option is turned off by default) b) sending timezone information to clients is also optional (although turnred on by default).

OTOH timezone looks more a system package which should be installed at least for the pattern used to install basic systems such as containers.
Comment 34 Franck Bui 2020-05-11 15:53:35 UTC
(In reply to Dominik Heidler from comment #30)
> Regarding the duplicate unit file:
> As it is only a cosmetic issue, feel free to open a  progress ticket and
> provide some sample .nspawn file and the info how to use it and where to
> place it.

It's definitively not a "cosmetic" issue.

Duplicating a systemd unit file like you did is a call for future troubles since you'll miss any bug fixes or any evolution of this unit. BTW the 2 versions already diverged...
Comment 35 Dominique Leuenberger 2020-05-11 15:54:50 UTC
(In reply to Franck Bui from comment #34)
> (In reply to Dominik Heidler from comment #30)
> > Regarding the duplicate unit file:
> > As it is only a cosmetic issue, feel free to open a  progress ticket and
> > provide some sample .nspawn file and the info how to use it and where to
> > place it.
> 
> It's definitively not a "cosmetic" issue.
> 
> Duplicating a systemd unit file like you did is a call for future troubles
> since you'll miss any bug fixes or any evolution of this unit. BTW the 2
> versions already diverged...

Not all containers are installed by means of patterns (e.g. the one tested here); the most correct solution would of course be for networkd/dhdp to fallback to UTC, just as timedatectl does. There is no reason to handled this differently.
Comment 36 Franck Bui 2020-05-11 16:01:22 UTC
Falling back to UTC in case networkd fail to figure out the timezone used by the system might be an acceptable fallback indeed.

I'll try to improve that part.
Comment 39 Dominik Heidler 2020-06-15 10:08:37 UTC
Any update on this one?
Comment 40 Franck Bui 2020-06-18 07:16:37 UTC
I still haven't time to look at it.

But since the test is configuring the DHCP server to emit the timezone it would still makes sense to install the timezone package so this information is correct.

It's not clear to me currently if it's correct to assume UTC if this info can't be retrieved from the local host.
Comment 47 Dominik Heidler 2020-10-28 09:52:45 UTC
Any update?
Comment 48 Franck Bui 2020-10-29 13:30:46 UTC
Unfortunately not yet. I'll try to work on that next month though.
Comment 49 Franck Bui 2020-10-29 14:13:28 UTC
I just checked and it should be fixed by commit bc9ecd484f1ebfe0de8b567c9 [1] which was released since v244.

So this bug can be closed I think... or maybe not as this probably needs to be fixed in Leap/SLE.

Let's re-assign this bug to systemd maintainers until the other distros are fixed.

[1] https://github.com/openSUSE/systemd/commit/bc9ecd484f1ebfe0de8b567c90f6cd867fbd5894
Comment 50 Franck Bui 2020-10-29 14:33:07 UTC
Commit backported to SUSE/v228 and SUSE/v234, hence closing.
Comment 58 Swamp Workflow Management 2021-01-27 14:16:26 UTC
SUSE-RU-2021:0233-1: An update that has 7 recommended fixes can now be installed.

Category: recommended (moderate)
Bug References: 1141597,1174436,1175458,1177490,1179363,1179824,1180225
CVE References: 
JIRA References: 
Sources used:
SUSE Linux Enterprise Module for Basesystem 15-SP2 (src):    systemd-234-24.67.1
SUSE Linux Enterprise Installer 15-SP1 (src):    systemd-234-24.67.1

NOTE: This line indicates an update has been released for the listed product(s). At times this might be only a partial fix. If you have questions please reach out to maintenance coordination.
Comment 60 Swamp Workflow Management 2021-01-30 14:26:01 UTC
openSUSE-RU-2021:0196-1: An update that has 7 recommended fixes can now be installed.

Category: recommended (moderate)
Bug References: 1141597,1174436,1175458,1177490,1179363,1179824,1180225
CVE References: 
JIRA References: 
Sources used:
openSUSE Leap 15.1 (src):    systemd-234-lp151.26.34.1, systemd-mini-234-lp151.26.34.1
Comment 61 Swamp Workflow Management 2021-01-30 23:19:13 UTC
openSUSE-RU-2021:0210-1: An update that has 7 recommended fixes can now be installed.

Category: recommended (moderate)
Bug References: 1141597,1174436,1175458,1177490,1179363,1179824,1180225
CVE References: 
JIRA References: 
Sources used:
openSUSE Leap 15.2 (src):    systemd-234-lp152.31.16.1, systemd-mini-234-lp152.31.16.1
Comment 62 Swamp Workflow Management 2021-02-10 14:25:40 UTC
SUSE-RU-2021:0358-1: An update that has 7 recommended fixes can now be installed.

Category: recommended (moderate)
Bug References: 1141597,1174436,1179363,1179824,1180020,1180596,1180885
CVE References: 
JIRA References: 
Sources used:
SUSE Linux Enterprise Software Development Kit 12-SP5 (src):    systemd-228-157.21.2
SUSE Linux Enterprise Server 12-SP5 (src):    systemd-228-157.21.2

NOTE: This line indicates an update has been released for the listed product(s). At times this might be only a partial fix. If you have questions please reach out to maintenance coordination.
Comment 64 Oliver Kurz 2021-02-27 06:06:04 UTC
This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: systemd-networkd
https://openqa.opensuse.org/tests/1317512

To prevent further reminder comments one of the following options should be followed:
1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
2. The openQA job group is moved to "Released"
3. The label in the openQA scenario is removed
Comment 65 Oliver Kurz 2021-03-13 06:06:49 UTC
This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: systemd-networkd
https://openqa.opensuse.org/tests/1317512

To prevent further reminder comments one of the following options should be followed:
1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
2. The openQA job group is moved to "Released"
3. The label in the openQA scenario is removed
Comment 67 Swamp Workflow Management 2021-04-16 16:38:53 UTC
SUSE-RU-2021:1247-1: An update that has 11 recommended fixes can now be installed.

Category: recommended (important)
Bug References: 1141597,1174436,1178219,1179363,1179824,1180020,1180083,1180596,1180885,1183094,1183790
CVE References: 
JIRA References: 
Sources used:
SUSE OpenStack Cloud Crowbar 9 (src):    systemd-228-150.95.1
SUSE OpenStack Cloud Crowbar 8 (src):    systemd-228-150.95.1
SUSE OpenStack Cloud 9 (src):    systemd-228-150.95.1
SUSE OpenStack Cloud 8 (src):    systemd-228-150.95.1
SUSE Linux Enterprise Server for SAP 12-SP4 (src):    systemd-228-150.95.1
SUSE Linux Enterprise Server for SAP 12-SP3 (src):    systemd-228-150.95.1
SUSE Linux Enterprise Server 12-SP4-LTSS (src):    systemd-228-150.95.1
SUSE Linux Enterprise Server 12-SP3-LTSS (src):    systemd-228-150.95.1
SUSE Linux Enterprise Server 12-SP3-BCL (src):    systemd-228-150.95.1
SUSE Linux Enterprise Server 12-SP2-LTSS-SAP (src):    systemd-228-150.95.1
SUSE Linux Enterprise Server 12-SP2-LTSS-ERICSSON (src):    systemd-228-150.95.1
SUSE Linux Enterprise Server 12-SP2-BCL (src):    systemd-228-150.95.1
HPE Helion Openstack 8 (src):    systemd-228-150.95.1

NOTE: This line indicates an update has been released for the listed product(s). At times this might be only a partial fix. If you have questions please reach out to maintenance coordination.