Bug 1193889 - [Build 20211215] openQA test fails in reconnect_mgmt_console because yast.ssh can not restart the operating system
[Build 20211215] openQA test fails in reconnect_mgmt_console because yast.ssh...
Status: VERIFIED FIXED
Classification: openSUSE
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Kernel
Current
S/390-64 Other
: P2 - High : Normal (vote)
: ---
Assigned To: openSUSE Kernel Bugs
E-mail List
https://openqa.opensuse.org/tests/208...
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2021-12-18 16:13 UTC by Sarah Kriesch
Modified: 2022-04-04 21:39 UTC (History)
15 users (show)

See Also:
Found By: openQA
Services Priority:
Business Priority:
Blocker: Yes
Marketing QA Status: ---
IT Deployment: ---


Attachments
390 error info during boot process (64.71 KB, text/plain)
2021-12-21 12:19 UTC, WEI GAO
Details
Console log of LINUX211 guest booting up (35.17 KB, text/plain)
2022-01-04 17:07 UTC, Mark Post
Details
Tarball with contents of /proc/config.gz from both a SLES15 SP4 system, and a Tumbleweed system (51.37 KB, application/gzip)
2022-01-10 05:40 UTC, Mark Post
Details
dmesg (7.07 KB, text/plain)
2022-02-23 16:15 UTC, Sarah Kriesch
Details
messages (47.68 KB, text/plain)
2022-02-23 16:15 UTC, Sarah Kriesch
Details
memsample (275.63 KB, application/x-gzip)
2022-02-23 16:16 UTC, Sarah Kriesch
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Sarah Kriesch 2021-12-18 16:13:21 UTC
## Observation

openQA test in scenario opensuse-Tumbleweed-DVD-s390x-textmode@s390x-zVM-vswitch-l2 fails in
[reconnect_mgmt_console](https://openqa.opensuse.org/tests/2086854/modules/reconnect_mgmt_console/steps/16)
It seems that packages can be installed successfully on s390x. yast2.ssh can not reconnect after the reboot. I know, that you have implemented something because of the password security.
I have also tested to remove my workaround (special for s390x). But it is failing on the same way:
https://openqa.opensuse.org/tests/2090752

## Test suite description
Installation in textmode or textmode-server and selecting the textmode "desktop" during installation.


## Reproducible

Fails since (at least) Build [20211127](https://openqa.opensuse.org/tests/2061533)


## Expected result

The reconnection via yast2.ssh should work after the successful installation.

Last good: [20211123](https://openqa.opensuse.org/tests/2054786) (or more recent)


## Further details

Always latest result in this scenario: [latest](https://openqa.opensuse.org/tests/latest?arch=s390x&distri=opensuse&flavor=DVD&machine=s390x-zVM-vswitch-l2&test=textmode&version=Tumbleweed)
Comment 1 Sarah Kriesch 2021-12-18 17:18:10 UTC
Log files are ending after the reboot. The connection has worked, but it is disconnecting after the try to start Yast.
Comment 2 Stefan Hundhammer 2021-12-20 12:36:46 UTC
The usual obvious problems with that are:

- ssh not installed and enabled
- ssh port blocked by the firewall

But AFAICS that's correct in this setup: ssh is installed and enabled, and the ssh port is enabled in the firewall. OK so far.
Comment 3 Stefan Hundhammer 2021-12-20 12:40:56 UTC
I know there was a change with remote root logins via password no longer enabled; but when I look at those openQA screenshots, I see

  ssh: connect to host ... port 22: No route to host

so this appears to be a network problem on a lower level; or the machine didn't boot at all or not properly.
Comment 4 Sarah Kriesch 2021-12-20 14:45:33 UTC
(In reply to Stefan Hundhammer from comment #3)
> I know there was a change with remote root logins via password no longer
> enabled; but when I look at those openQA screenshots, I see
> 
>   ssh: connect to host ... port 22: No route to host
> 
> so this appears to be a network problem on a lower level; or the machine
> didn't boot at all or not properly.
That is the reason, that I wanted to watch two different builds, whether that exists continuously.
Firstly I thought about a temporary network issue.
Then I thought about the "no remote login via password any more" thing and tested for any compatibility issues. I have removed my workaround function in the openQA code and tested it based on the PR with one job and it exists continuously.

Did the team develop anything additionally with the "no password any more", that should have been used with this yast.ssh execution? It seems, that the start would work and the connection will be closed after executing yast.ssh.
Comment 5 Stefan Hundhammer 2021-12-20 15:07:21 UTC
This looks like a problem:

https://openqa.opensuse.org/tests/2093164#step/reconnect_mgmt_console/7

  'Performing \'kexec -e -x\'01: HCPGSP2629I The virtual machine is placed in CP mode',
  ' due to a SIGP stop from                                                        ',
  ' CPU 00.                                                                        '
Comment 6 Stefan Hundhammer 2021-12-20 15:17:07 UTC
I don't know what that message means. It might be s/390 host related; it might be related to the kernel.

Mark, do you have a good idea how to continue here?
Comment 7 Stefan Hundhammer 2021-12-20 15:26:18 UTC
For completeness:

(In reply to Sarah Kriesch from comment #4)
> Did the team develop anything additionally with the "no password any more",
> that should have been used with this yast.ssh execution? It seems, that the
> start would work and the connection will be closed after executing yast.ssh.

It's a bit different.

For an ssh installation, you ssh to the machine, then invoke "yast.ssh". That starts the installation workflow which then runs the normal way until packages are installed, Dracut built the initrd, some more one-time initializations are done; and then, as the final step, it boots the newly installed system.

In some cases, it might start a "firstboot" workflow, but that's something very exotic, usually done only for certain business products that need to do more setup after the base installation is done.

OpenQA however will always try to monitor that boot to test if everything went as expected; so it pings the new system and tries to establish a connection for those additional tests in the installed system. And that is what went wrong here: It could not connect.

I suspected that the system didn't even boot correctly, and that seems to be confirmed with the message from comment #5.
Comment 8 Michal Filka 2021-12-20 15:32:43 UTC
(In reply to Stefan Hundhammer from comment #3)
> I know there was a change with remote root logins via password no longer
> enabled; but when I look at those openQA screenshots, I see
> 
>   ssh: connect to host ... port 22: No route to host

in 99% of cases this means an issue with firewall blocking the port. So, I'd start with checking firewall(d) configuration

I'd start with

firewalld-cmd --list-services

if we have access to commandline.
Comment 9 Stefan Hundhammer 2021-12-20 15:45:20 UTC
Michal, I fear right now the machine doesn't boot at all. See comment #5.
Comment 10 Sarah Kriesch 2021-12-20 20:25:41 UTC
That is the IBM documentation for this error type with z/VM: https://www.ibm.com/docs/en/zvm/7.1?topic=messages-hcp2629i

But we do not use z/VM in this case...
Comment 12 WEI GAO 2021-12-21 12:16:16 UTC
Similar issue happen on following case, lot of error msg popup during boot process
https://openqa.suse.de/tests/7888594#step/reconnect_mgmt_console/7
Comment 13 WEI GAO 2021-12-21 12:19:34 UTC
Created attachment 854728 [details]
390 error info during boot process
Comment 14 Sarah Kriesch 2021-12-22 14:12:05 UTC
It seems that SLE has got a bootloader problem, because it can not identify a btrfs file system.
Comment 15 WEI GAO 2021-12-23 07:55:50 UTC
(In reply to Sarah Kriesch from comment #14)
> It seems that SLE has got a bootloader problem, because it can not identify
> a btrfs file system.

I have created another issue for SLE https://bugzilla.suse.com/show_bug.cgi?id=1193972
Comment 16 Sarah Kriesch 2021-12-23 11:31:43 UTC
Where I am not authorized to watch it.
In general, both error messages (Tumbleweed and SLE) seem to happen before the bootloader can start. I reassign this bug from YaST to bootloader now.
You can correct it, if I am wrong.
Comment 17 Sarah Kriesch 2021-12-27 16:59:16 UTC
Complete output:
last output:
[
  '      Use the ^ and v keys to select which entry is highlighted. Press          ',
  '       enter to boot the selected OS, `e\' to edit the commands before           ',
  '       booting or `c\' for a command-line.                                       ',
  '  *(1) openSUSE Tumbleweed                                                      ',
  '   (2) Advanced options for openSUSE Tumbleweed                                 ',
  '   (s) Start bootloader from a read-only snapshot                               ',
  'Welcome to GRUB!                                                                ',
  '                            GNU GRUB  version 2.06                              ',
  '      Use the ^ and v keys to select which entry is highlighted. Press          ',
  '       enter to boot the selected OS, `e\' to edit the commands before           ',
  '       booting or `c\' for a command-line.                                       ',
  '  *(1) openSUSE Tumbleweed                                                      ',
  '   (2) Advanced options for openSUSE Tumbleweed                                 ',
  '   (s) Start bootloader from a read-only snapshot                               ',
  ' Loading Linux 5.15.8-1-default ...                                             ',
  ' Loading initial ramdisk ...                                                    ',
  ' Performing \'kexec -la /boot/image-5.15.8-1-default                             ',
  ' --initrd=/boot/initrd-5.15.8-1-default                                         ',
  ' --command-line=root=UUID=6bfd9444-e956-49d5-9fbf-c2b575a8cb68                  ',
  ' hvc_iucv=8 TERM=dumb crashkernel=195M mitigations=auto\'                        ',
  ' Performing \'systemctl kexec\' (just-in-case) Running in chroot, ignoring command',
  ' \'kexec\'                                                                        ',
  'Performing \'kexec -e -x\'01: HCPGSP2629I The virtual machine is placed in CP mode',
  ' due to a SIGP stop from                                                        ',
  ' CPU 00.                                                                        '
]
Comment 18 Sarah Kriesch 2021-12-28 10:14:45 UTC
@Mark: Can it be that this bug is related to the latest update of s390-tools together with the new kernel?
Comment 19 Mark Post 2022-01-04 17:07:59 UTC
Created attachment 854968 [details]
Console log of LINUX211 guest booting up

I just got back from vacation and started looking at this. The messages you're showing all look very normal. I'm attaching a z/VM console log from a SLES15 SP3 system that I just booted, for reference.

I'm somewhat confused by the openQA session referenced in comment#0. It appears that the system was installed, and then rebooted, and then the SSH session was re-established, and when yast.ssh was executed, the ssh connection was disconnected? Is that the case? Or are those images "left over" from the original connection to start the install?
Comment 20 Mark Post 2022-01-04 17:15:07 UTC
(In reply to Sarah Kriesch from comment #1)
> Log files are ending after the reboot. The connection has worked, but it is
> disconnecting after the try to start Yast.

This would imply that the kernel and bootloader are working OK, and something is wrong with YaST.
Comment 21 Mark Post 2022-01-04 17:15:59 UTC
(In reply to Stefan Hundhammer from comment #5)
> This looks like a problem:
> 
> https://openqa.opensuse.org/tests/2093164#step/reconnect_mgmt_console/7
> 
>   'Performing \'kexec -e -x\'01: HCPGSP2629I The virtual machine is placed
> in CP mode',
>   ' due to a SIGP stop from                                                 
> ',
>   ' CPU 00.                                                                 
> '

It's not a problem, per se. Whatever is going wrong is happening after this.
Comment 22 Mark Post 2022-01-04 18:55:45 UTC
(In reply to Sarah Kriesch from comment #10)
> That is the IBM documentation for this error type with z/VM:
> https://www.ibm.com/docs/en/zvm/7.1?topic=messages-hcp2629i
> 
> But we do not use z/VM in this case...

You most certainly are using z/VM. CP (z/VM's Control Program) is the source of that HCP message.
Comment 23 Mark Post 2022-01-04 22:43:56 UTC
I just finished a test install of openSUSE-Tumbleweed-DVD-s390x-Build2380.4-Media. The install went just fine, the reboot seemed to go OK, but then after the kexec command, nothing appeared to be happening.

I started a z/VM instruction trace, and what I got was this:
CP TR I RUN
00: HCPTRI1027I An active trace set has turned RUN off.
00:  -> 000000000187D074   BRASL   C0E500000000 -> 000000000187D074     CC 1
00:  -> 000000000187D074   BRASL   C0E500000000 -> 000000000187D074     CC 1
00:  -> 000000000187D074   BRASL   C0E500000000 -> 000000000187D074     CC 1
00:  -> 000000000187D074   BRASL   C0E500000000 -> 000000000187D074     CC 1
00:  -> 000000000187D074   BRASL   C0E500000000 -> 000000000187D074     CC 1
00:  -> 000000000187D074   BRASL   C0E500000000 -> 000000000187D074     CC 1
00:  -> 000000000187D074   BRASL   C0E500000000 -> 000000000187D074     CC 1
00:  -> 000000000187D074   BRASL   C0E500000000 -> 000000000187D074     CC 1
00:  -> 000000000187D074   BRASL   C0E500000000 -> 000000000187D074     CC 1
00:  -> 000000000187D074   BRASL   C0E500000000 -> 000000000187D074     CC 1
00:  -> 000000000187D074   BRASL   C0E500000000 -> 000000000187D074     CC 1
00:  -> 000000000187D074   BRASL   C0E500000000 -> 000000000187D074     CC 1
00:  -> 000000000187D074   BRASL   C0E500000000 -> 000000000187D074     CC 1
00:  -> 000000000187D074   BRASL   C0E500000000 -> 000000000187D074     CC 1
00:  -> 000000000187D074   BRASL   C0E500000000 -> 000000000187D074     CC 1
00:  -> 000000000187D074   BRASL   C0E500000000 -> 000000000187D074     CC 1
00:  -> 000000000187D074   BRASL   C0E500000000 -> 000000000187D074     CC 1
00:  -> 000000000187D074   BRASL   C0E500000000 -> 000000000187D074     CC 1
00:  -> 000000000187D074   BRASL   C0E500000000 -> 000000000187D074     CC 1
00:  -> 000000000187D074   BRASL   C0E500000000 -> 000000000187D074     CC 1
00:  -> 000000000187D074   BRASL   C0E500000000 -> 000000000187D074     CC 1
00:  -> 000000000187D074   BRASL   C0E500000000 -> 000000000187D074     CC 1
00:  -> 000000000187D074   BRASL   C0E500000000 -> 000000000187D074     CC 1
00:  -> 000000000187D074   BRASL   C0E500000000 -> 000000000187D074     CC 1
00:  -> 000000000187D074   BRASL   C0E500000000 -> 000000000187D074     CC 1
00:  -> 000000000187D074   BRASL   C0E500000000 -> 000000000187D074     CC 1

That went on as long as I was willing to let it. So, this definitely looks like a kernel problem of some sort. The relevant part of the System.map file looks like this:
000000000187cf68 b inet_rcv_compat
000000000187cf70 b sock_diag_handlers
000000000187d0e0 b broadcast_wq
000000000187d0e8 B reuseport_lock
000000000187d0ec b fib_notifier_net_id

So, the kernel is somewhere in the sock_diag_handlers routine.

If I'm reading the z/Architecture Principles of Operation correctly, that instruction is saying "Branch to the location 0 bytes relative from here." So, an infinite loop.
Comment 24 Sarah Kriesch 2022-01-05 10:27:24 UTC
(In reply to Mark Post from comment #22)
> (In reply to Sarah Kriesch from comment #10)
> > That is the IBM documentation for this error type with z/VM:
> > https://www.ibm.com/docs/en/zvm/7.1?topic=messages-hcp2629i
> > 
> > But we do not use z/VM in this case...
> 
> You most certainly are using z/VM. CP (z/VM's Control Program) is the source
> of that HCP message.

That was my mistake. I thought about the new containerization test environment for openQA and did not think about z/VM as a foundation at this moment. Bugzilla has not the possibility to remove wrong entries. Sorry for that.
Comment 25 Sarah Kriesch 2022-01-05 10:33:51 UTC
(In reply to Mark Post from comment #23)
> I just finished a test install of
> openSUSE-Tumbleweed-DVD-s390x-Build2380.4-Media. The install went just fine,
> the reboot seemed to go OK, but then after the kexec command, nothing
> appeared to be happening.
...
> That went on as long as I was willing to let it. So, this definitely looks
> like a kernel problem of some sort. The relevant part of the System.map file
> looks like this:
> 000000000187cf68 b inet_rcv_compat
> 000000000187cf70 b sock_diag_handlers
> 000000000187d0e0 b broadcast_wq
> 000000000187d0e8 B reuseport_lock
> 000000000187d0ec b fib_notifier_net_id
> 
> So, the kernel is somewhere in the sock_diag_handlers routine.
> 
> If I'm reading the z/Architecture Principles of Operation correctly, that
> instruction is saying "Branch to the location 0 bytes relative from here."
> So, an infinite loop.

Then I switch it from bootloader to kernel.
I had a discussion with Stefan Hundhammer about what will be done with yast.ssh.
Comment 26 Sarah Kriesch 2022-01-05 10:39:13 UTC
(In reply to Mark Post from comment #23)
> I just finished a test install of
> openSUSE-Tumbleweed-DVD-s390x-Build2380.4-Media. The install went just fine,
> the reboot seemed to go OK, but then after the kexec command, nothing
> appeared to be happening.
...
> That went on as long as I was willing to let it. So, this definitely looks
> like a kernel problem of some sort. The relevant part of the System.map file
> looks like this:
> 000000000187cf68 b inet_rcv_compat
> 000000000187cf70 b sock_diag_handlers
> 000000000187d0e0 b broadcast_wq
> 000000000187d0e8 B reuseport_lock
> 000000000187d0ec b fib_notifier_net_id
> 
> So, the kernel is somewhere in the sock_diag_handlers routine.
> 
> If I'm reading the z/Architecture Principles of Operation correctly, that
> instruction is saying "Branch to the location 0 bytes relative from here."
> So, an infinite loop.

Then I switch it to a kernel bug.
I had a discussion with Stefan Hundhammer about what is happening during yast.ssh.
Multiple processes for finishing the installation (incl. bootloader start) are there. 
As he is saying (what should be important):

> For an ssh installation, you ssh to the machine, then invoke "yast.ssh".
> That starts the installation workflow which then runs the normal way until
> packages are installed, Dracut built the initrd, some more one-time
> initializations are done; and then, as the final step, it boots the newly
> installed system.
> 
> In some cases, it might start a "firstboot" workflow, but that's something
> very exotic, usually done only for certain business products that need to do
> more setup after the base installation is done.
> 
> OpenQA however will always try to monitor that boot to test if everything
> went as expected; so it pings the new system and tries to establish a
> connection for those additional tests in the installed system. And that is
> what went wrong here: It could not connect.
> 
> I suspected that the system didn't even boot correctly, and that seems to be
> confirmed with the message from comment #5.
Comment 27 Mark Post 2022-01-06 00:40:08 UTC
After thinking about this some more, I realized that the kernel that is used to run grub2 is the exact same kernel that grub2 tries to boot with the kexec command. So, I played around with the zipl config so that grub would not be started. The system came up fine. By using the z/VM CP TRACE command, I could tell that the routine that seems to be involved in this, was not executed during the boot process.

So, something about grub being started and then kexec being called is causing this problem. I have no idea what that might be.
Comment 28 Sarah Kriesch 2022-01-06 14:08:40 UTC
Thank you for your nice analyzing and work!
Therefore, my first pointing to the bootloader was correct? I was a little bit surprised, that openSUSE did not come up. The Linux kernel was looking like started correctly.

Can we solve this problem on our own or do we need support from IBM?
Besides this issue, something equal to that (file system issue instead of that) has happened at SLES. Is there any relationship between these bugs? I can not watch the status of that from outside as a community member...
Comment 29 Mark Post 2022-01-06 20:53:45 UTC
I don't know if IBM will be willing to help. SUSE added grub2 into the boot process, not IBM. So if the problem is somewhere in grub, that's all on us.

Is the grub package in Tumbleweed from openSUSE, or from SLES? It might be worth experimenting with an older version to see if that helps at all.
Comment 30 Sarah Kriesch 2022-01-07 10:43:24 UTC
Grub2 is a core component provided/maintained by SUSE also for openSUSE:
https://build.opensuse.org/package/show/Base:System/grub2

If I have got issues or questions in this direction, I would ask Ruediger Oertel
or one of the maintainers of Grub2.
Comment 31 Sarah Kriesch 2022-01-07 10:47:28 UTC
Perhaps Michael Chang is the better choice for support in this case.
Comment 32 Mark Post 2022-01-10 05:40:32 UTC
Created attachment 855094 [details]
Tarball with contents of /proc/config.gz from both a SLES15 SP4 system, and a Tumbleweed system

OK, things keep getting weirder. I re-ran the Tumbleweed installation, and after it failed to reboot, as expected.

I booted the system from another disk that had SLES installed on it, and then
replaced the kernel and initrd in /boot/zipl/ with files from a SLES15 SP4 system, re-ran zipl, and then tried to boot Tumbleweed again. It worked just fine.

So, using image-5.14.21-150400.3-default and initrd-5.14.21-150400.3-default from SLES15 SP4 I was able to start grub2, and then boot the image-5.15.8-1-default and initrd-5.15.8-1-default from Tumbleweed. Using image-5.15.8-1-default and initrd-5.15.8-1-default from Tumbleweed does not succeed in getting to the point of starting grub2. I think that means we're back to looking at the kernel, and not the bootloader.

I'm attaching a tarball that has the /proc/config.gz file from both the Tumbleweed system, and the SLES15 SP4 system that I used to copy the kernel and initrd. Hopefully there might be some hint in there as to what might be causing this.
Comment 33 Sarah Kriesch 2022-01-10 20:16:23 UTC
We had this topic today in our kick-off meeting of the Linux Distributions Working Group of the Open Mainframe Project, because Debian has got also boot problems after the package installation.

There was the hint from IBM, that this problem exists also at Fedora Rawhide and the issue is based on systemd:
https://bugzilla.redhat.com/show_bug.cgi?id=1986176

They did the same analysis, had additionally a kernel bug and came upstream on this way.

Perhaps I should switch it to Basesystem.
Comment 34 openQA Review 2022-01-25 00:35:34 UTC
This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: install_ltp_s390x
https://openqa.opensuse.org/tests/2107785

To prevent further reminder comments one of the following options should be followed:
1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
3. The bugref in the openQA scenario is removed or replaced, e.g. `label:wontfix:boo1234`
Comment 35 Sarah Kriesch 2022-01-26 07:57:15 UTC
I change the topic to kernel after the hint with https://bugzilla.opensuse.org/show_bug.cgi?id=1194501, that there are equal issues.
Comment 36 Martin Wilck 2022-01-26 08:15:26 UTC
(In reply to Sarah Kriesch from comment #35)
> I change the topic to kernel after the hint with
> https://bugzilla.opensuse.org/show_bug.cgi?id=1194501, that there are equal
> issues.

Could you please explain why you think this is the same issue as bug 1194501? I don't see the words "BTF" or "BPF", which are the reason for the failure in 1194501, anywhere here.
Comment 37 Sarah Kriesch 2022-01-28 17:24:52 UTC
I thought that it can be an equal issue, if openSUSE can not be booted after the restart.

Info for IBM:
- You can download latest Tumbleweed iso images (not released because of this bug) from here:
https://build.opensuse.org/package/binaries/openSUSE:Factory:zSystems/000product:openSUSE-dvd5-dvd-s390x/images


The kernel is here:
https://build.opensuse.org/package/binaries/openSUSE:Factory:zSystems/kernel-default/standard
Comment 38 Sarah Kriesch 2022-01-29 10:41:53 UTC
This bug is preventing us to release openSUSE Tumbleweed for s390x and is affecting all important Linux distributions (as a result of our meeting together).
Comment 39 LTC BugProxy 2022-02-14 16:00:39 UTC
------- Comment From Ulrich.Weigand@de.ibm.com 2022-02-14 10:54 EDT-------
@Andreas:

This looks like a relocation was not resolved:
00:  -> 000000000187D074   BRASL   C0E500000000 -> 000000000187D074     CC 1
00:  -> 000000000187D074   BRASL   C0E500000000 -> 000000000187D074     CC 1
00:  -> 000000000187D074   BRASL   C0E500000000 -> 000000000187D074     CC 1
00:  -> 000000000187D074   BRASL   C0E500000000 -> 000000000187D074     CC 1
00:  -> 000000000187D074   BRASL   C0E500000000 -> 000000000187D074     CC 1
00:  -> 000000000187D074   BRASL   C0E500000000 -> 000000000187D074     CC 1
00:  -> 000000000187D074   BRASL   C0E500000000 -> 000000000187D074     CC 1

I think we recently had similar problems due to new PIE relocations that were not handled.  Could this be the same root cause?
Comment 40 LTC BugProxy 2022-02-15 21:20:43 UTC
------- Comment From Andreas.Krebbel@de.ibm.com 2022-02-15 16:14 EDT-------
Looks very much like it. I see PLT32DBL relocs all over the place. These are generated since GCC 11.3 also for local calls  and the kernel kexec code was not able to deal with it.

https://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git/commit/?h=fixes&id=abf0e8e4ef25478a4390115e6a953d589d1f9ffd

We probably should assign this to someone in the kernel team.
Comment 41 LTC BugProxy 2022-02-17 15:20:40 UTC
------- Comment From Ulrich.Weigand@de.ibm.com 2022-02-17 10:17 EDT-------
Thanks Andreas!   I guess that could indeed explain the hang in the kexec.

Mark/Sarah, can you check whether that initial kernel (the one directly loaded by grub2, which then does the kexec to load another kernel) has the patch applied that Andreas mentioned?
Comment 42 Sarah Kriesch 2022-02-17 16:43:11 UTC
Our SUSE kernel developers are on CC and they know more about which patches are included.

@SUSE Kernel Developers: Is this patch included in our Kernel?
https://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git/commit/?h=fixes&id=abf0e8e4ef25478a4390115e6a953d589d1f9ffd
Comment 43 Takashi Iwai 2022-02-21 09:39:22 UTC
(In reply to Sarah Kriesch from comment #42)
> Our SUSE kernel developers are on CC and they know more about which patches
> are included.
> 
> @SUSE Kernel Developers: Is this patch included in our Kernel?
> https://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git/commit/
> ?h=fixes&id=abf0e8e4ef25478a4390115e6a953d589d1f9ffd

This commit was included in 5.16-rc6, so the current TW 5.16.x kernel already contains it, at least.

Or was the question about other SLE/Leap kernels?
Comment 44 LTC BugProxy 2022-02-21 10:30:48 UTC
------- Comment From Ulrich.Weigand@de.ibm.com 2022-02-21 05:26 EDT-------
(In reply to comment #15)
> (In reply to Sarah Kriesch from comment #42)
> > Our SUSE kernel developers are on CC and they know more about which patches
> > are included.
> >
> > @SUSE Kernel Developers: Is this patch included in our Kernel?
> > https://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git/commit/
> > ?h=fixes&id=abf0e8e4ef25478a4390115e6a953d589d1f9ffd
>
> This commit was included in 5.16-rc6, so the current TW 5.16.x kernel
> already contains it, at least.
>
> Or was the question about other SLE/Leap kernels?

The question was about the initial kernel loaded by grub2 (which then performs the kexec to the final kernel) - see the comment on 2021-12-27 16:59:16.  I'm not sure how exactly this initial boot process is set up in OpenSUSE, but if the root cause of the problem is that kexec bug described above, then the specific question is whether whatever kernel is used to perform that kexec has the fix or not.
Comment 45 Sarah Kriesch 2022-02-21 11:00:59 UTC
We are speaking about Tumbleweed. This issue has happened since around Christmas in openSUSE Tumbleweed.
It seems, that the patch has been included in our repo 28 days ago:
https://github.com/openSUSE/kernel/commit/f3b7e73b2c6619884351a3a0a7468642f852b8a2 

Then it has to be built and included into Tumbleweed.
I have to check, whether it is fixed. Thank you all for the support!
(Another openSUSE bug with s390-tools is keeping us away at the moment. We are working on it.)
Comment 46 Sarah Kriesch 2022-02-23 09:59:51 UTC
The restart is failing continuously, but it seems, that there are more problems with the kernel.

I can identify in the logs (memsample.zcat) this part with processes:
075 ?           15416    14407     1008     2192     1200  1645     sshd: root@pts/3
 2081 pts/3        8472     7424     1047     1548     1936  2075       -bash
 2118 pts/3        4540     3492     1047     1092      612  2081         /bin/bash /sbin/yast.ssh
 2119 pts/3        7628     6580     1047     1544     1132  2118           /bin/bash /usr/lib/YaST2/startup/YaST2.First-Stage
 3267 pts/3        7768     6720     1047     1544     1272  2119             /bin/bash /usr/lib/YaST2/startup/YaST2.call installation initial
 3688 pts/3      678452   678448        3   260064   350692  3267               /usr/bin/ruby.ruby3.1 --encoding=utf-8 /usr/lib/YaST2/bin/y2start installation --arg initial ncurses
43560 pts/3           0        0        0        0        0  3688                 [get_kernel_vers] <defunct>
43627 pts/3           0        0        0        0        0  3688                 [get_kernel_vers] <defunct>

Mark or IBM: We need somebody with mainframe system access to reproduce and debug the kernel in this case now. 

You can find the download links in https://bugzilla.opensuse.org/show_bug.cgi?id=1193889#c37
Comment 47 Sarah Kriesch 2022-02-23 10:00:52 UTC
The new openQA result is here: https://openqa.opensuse.org/tests/2204125#step/reconnect_mgmt_console/9
Comment 48 Sarah Kriesch 2022-02-23 16:15:20 UTC
Created attachment 856495 [details]
dmesg
Comment 49 Sarah Kriesch 2022-02-23 16:15:45 UTC
Created attachment 856496 [details]
messages
Comment 50 Sarah Kriesch 2022-02-23 16:16:18 UTC
Created attachment 856497 [details]
memsample
Comment 51 LTC BugProxy 2022-02-23 16:50:36 UTC
------- Comment From Ulrich.Weigand@de.ibm.com 2022-02-23 11:42 EDT-------
From what I can see, this new problem doesn't have anything to do with the kernel, at least not directly.  It's simply a user-space program crashing with a divide-by-zero exception it doesn't recover from.  (From the code and register contents shown, it does indeed divide by zero at the point of the exception.)
Comment 52 Mark Post 2022-02-23 19:09:32 UTC
I'm assuming you're referring to this:
[ T3679] User process fault: interruption code 0009 ilc:2 in libc.so.6[3ff8f900000+1be000]
[ T3679] CPU: 1 PID: 3679 Comm: read_values Not tainted 5.16.10-1-default #1 openSUSE Tumbleweed b7a195a619e720887b07574f94894806a00f4b63

If so, that command is part of the s390-tools package, and the source code for it hasn't been changed in quite a while. Sigh.

But, I don't know if that is what is causing the problem. The installation process continues for another 6 minutes or so. I don't know if the end of the log means there was no more logging, or something else.

I'll take a look, just to rule in or rule out that read_values is causing a problem.
Comment 53 Mark Post 2022-02-23 19:49:26 UTC
Just an update... I still don't know if read_values is causing the problem, but it definitely does have a problem:
# read_values -s
Floating point exception

Copying that file to a working SLES15 SP2 system, and I get this:
# ./read_values -s
read_values: dl-call-libc-early-init.c:37: _dl_call_libc_early_init: Assertion `sym != NULL' failed.
Aborted (core dumped)

Time to break out gdb I guess.
Comment 54 Mark Post 2022-02-24 19:16:46 UTC
I think I've figured out what the problem is, although not its cause. It appears that the guest is not bringing up the enc800 interface. It seems that NetworkManager is having issues.

This may because NetworkManager is trying to use dhclient to get an IP address for the interface, instead of setting a static IP and route. I have no idea why this might be happening, but perhaps someone else will.

So, over to the NetworkManager people to figure this out.

In the meantime, the problem with read_values is puzzling, but not contributing to this problem. I will keep looking at it, however, since rebuilding it from source on the system under test results in a working binary.
Comment 55 Mark Post 2022-02-24 21:38:21 UTC
Hmm. I don't know if this means anything, but I installed s390-tools-2.19.0-3.1 from openSUSE:Factory:zSystems, and read_values now works as expected.

I would pursue this further, but first I would need to know how to get my hands on all the various debuginfo and debugsource RPMs for things like s390-tools, qclib, glibc, etc.
Comment 56 Sarah Kriesch 2022-02-26 15:50:57 UTC
@Stefan Hundhammer:
I found following error message in the Y2lol:
There is only this small error message:
2022-02-26 03:09:34 <1> s390linux146(3679) [Ruby] clients/inst_finish.rb(block (2 levels) in report_hooks):208 Hook file: /var/lib/YaST2/hooks/installation/before_instsys_cleanup_10_zram_swap
2022-02-26 03:09:34 <1> s390linux146(3679) [Ruby] clients/inst_finish.rb(block (2 levels) in report_hooks):209 Hook output: STDERR: ++ wc -l
+ '[' 3 '!=' 2 ']'
+ swapoff /dev/zram1
+ echo 1

Can you tell, what will be counted here exactly and isn`t matching?
Thank you!
Comment 57 Sarah Kriesch 2022-03-07 10:41:30 UTC
Thank you for the collaboration to all included participants!
This bug is resolved because Berthold has identified the issue, that the static IP address has not been used after the restart. This bug has been resolved with bug #1196582. We have got a new Tumbleweed release now.
Comment 58 Ihno Krumreich 2022-04-04 21:39:33 UTC
Verified.