Bug 1026344 - [Build 20170220] openQA test fails in grub_test
[Build 20170220] openQA test fails in grub_test
Status: RESOLVED FIXED
Classification: openSUSE
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Installation
Current
Other Other
: P2 - High : Major (vote)
: ---
Assigned To: Michael Chang
Jiri Srain
http://openqa.opensuse.org/tests/3596...
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2017-02-21 19:49 UTC by Dominique Leuenberger
Modified: 2021-09-23 18:36 UTC (History)
5 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
Proposed upstream patch (2.42 KB, patch)
2017-02-24 06:45 UTC, Andrei Borzenkov
Details | Diff
Proper fix for this problem (856 bytes, patch)
2017-02-24 08:31 UTC, Andrei Borzenkov
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Dominique Leuenberger 2017-02-21 19:49:39 UTC
This is marked as blocker (holding back snapshots for TW until understood)

A system, using UEFI, installed from the DVD image copied on a USB stick results in a non-booting system. Other scenarios with UEFI seem to work - but as our coverage for UEFI is rather low, the risk of releasing this snapshot appears to high


We need at least an analysis to understand what is going on - why this scenario fails.

Please CC relevant people you can think of

## Observation

openQA test in scenario opensuse-Tumbleweed-DVD-x86_64-uefi@USBboot_64 fails in
[grub_test](http://openqa.opensuse.org/tests/359620/modules/grub_test/steps/4)


## Reproducible

Fails since (at least) Build [20170220](http://openqa.opensuse.org/tests/359267)


## Expected result

Last good: [20170219](http://openqa.opensuse.org/tests/358222) (or more recent)


## Further details

Always latest result in this scenario: [latest](http://openqa.opensuse.org/tests/latest?arch=x86_64&version=Tumbleweed&test=uefi&flavor=DVD&machine=USBboot_64&distri=opensuse)
Comment 1 Dominique Leuenberger 2017-02-21 19:51:37 UTC
CC Gary - your knowledge about UEFI can surely be of help
Comment 2 Dominique Leuenberger 2017-02-21 21:48:04 UTC
I suspect this submission to have caused it:

https://build.opensuse.org/request/show/458606

it has been reverted for now - further debugging pending (but with hackweek I suspect people want to give this, rightfully, a lower prio; hence the revert)
Comment 3 Gary Ching-Pang Lin 2017-02-22 02:10:58 UTC
Hi Michael,

Could you help me to check my request for shim-leap?
https://build.opensuse.org/request/show/458606

It's based on your submission:
https://build.opensuse.org/request/show/457353

Maybe I missed something.
Comment 4 Michael Chang 2017-02-22 04:16:19 UTC
Hi Dominique,

Would you confirm reverting that submission fixed it ? I find no obvious clue that would cause the failure .. 

Thanks.
Comment 5 Dominique Leuenberger 2017-02-22 07:08:56 UTC
Hi Michael,

a new snapshot should be ready for openQA shortly - I will monitor it and report back; neither did I see anything obviosiy wrong with this SR, but nevertheless it was about the only thing I could find that would impact UEFI boot, checked in in the last days.

Of course if this won't fix things, I'll revert the revert...

Stay tuned for results
Comment 6 Dominique Leuenberger 2017-02-22 09:09:13 UTC
https://openqa.opensuse.org/tests/359909#step/grub_test/4

Test still fails; so somewhat good, the sr in question did not break it

bad thing: we still have a snapshot that does not have working uefi when installed from dvdimage on usb stick
Comment 7 Dominique Leuenberger 2017-02-22 09:15:15 UTC
Michael,

I  just see that 0220 was also the snapshot that got the latest grub2 changes implemented, based on sr 458616

Potential candidate?
Comment 8 Dominique Leuenberger 2017-02-22 10:12:13 UTC
https://build.opensuse.org/request/show/458616

This did touch some efi parts at least according to the .changes file:
Comment 9 Michael Chang 2017-02-22 11:25:56 UTC
Hi Dominique,

Is it possible to access failed vm remotely ? Thanks.
Comment 10 Dominique Leuenberger 2017-02-22 11:33:18 UTC
(In reply to Michael Chang from comment #9)
> Hi Dominique,
> 
> Is it possible to access failed vm remotely ? Thanks.

no; openQA disposes all VMs post testing; the easiest is to grab an iso from openQA and perform a quick local install.

Ensure to use the same params though as openQA does for the tests, specifically to use the iso as usb stick
Comment 11 Michael Chang 2017-02-23 03:14:12 UTC
Looked recorded video on openQA, the attempt to "Boot from Hard Disk" somehow ended up with staying in grub command line. It could be telling us that "chainloader" actually successfully in loading target grub.efi on hard disk, but after that the chainloaded grub.efi on the disk cannot find it's config and forth to proceed.

That actually confused me because the (reverted) sr should not cause the side effect. It did have some changes to chainloader, but is more to the point to get it "work", and in this case it does work in loading and running efi image. It seems more likely a regression on how it passed over needed parameters to the loaded images, in which the change did not reveal ever done that.

So, it turns out to be this commit upstream:

 http://git.savannah.gnu.org/cgit/grub.git/commit/?id=ce95549cc54b5d6f494608a7c390dba3aab4fba7

   fp->path_name[size++] = '\0';
   fp->header.length = size * sizeof (grub_efi_char16_t) + sizeof (*fp);

The outcome could be a non null terminated for utf-16, depending on what remains in last char of fp->path_name. If is also '\0', we are fine, but else would lead to trouble as it constructs a (random) utf-16 character in the end, which would result in either loading or loaded image to fail in looking up image or determine the boot location (in Secure Boot Enabled loading path ..)
Comment 12 Dominique Leuenberger 2017-02-23 11:26:28 UTC
Thanks Michael

I wonder how this surface even now then, as this must have been in TW for a while already - but openQA only recently tripped over this test scenario

I also tested reverting grub2 to rev 153 (the one just before the submission mentioned) and openQA still stumbles over this issue.

Any idea how to progress short-term to get TW snapshots out again? Currently we're blocked on this issue
Comment 13 Michael Chang 2017-02-23 14:59:24 UTC
(In reply to Dominique Leuenberger from comment #12)
> Thanks Michael
> 
> I wonder how this surface even now then, as this must have been in TW for a
> while already - but openQA only recently tripped over this test scenario

The change was introduced in 2.02~rc1, and we recently updated to that version. 

> I also tested reverting grub2 to rev 153 (the one just before the submission
> mentioned) and openQA still stumbles over this issue.

I think it did not go beyond to revert the 2.02~rc1 version ..

> Any idea how to progress short-term to get TW snapshots out again? Currently
> we're blocked on this issue

As a temporary fix, we could include a revert patch to that upstream commit ce95549cc54b5d6f494608a7c390dba3aab4fba7 and see if it pass the staging test. Still we could revert that revert patch if it did not solve the problem at all.

I will create SR for that shortly.
Comment 14 Dominique Leuenberger 2017-02-23 15:45:32 UTC
(In reply to Michael Chang from comment #13)
> (In reply to Dominique Leuenberger from comment #12)
> > Thanks Michael
> > 
> > I wonder how this surface even now then, as this must have been in TW for a
> > while already - but openQA only recently tripped over this test scenario
> 
> The change was introduced in 2.02~rc1, and we recently updated to that
> version. 

the RC1 update was in TW since snapshot 0213 - Still make me wonder why we only saw an issue since 0220
Comment 15 Andrei Borzenkov 2017-02-23 15:56:12 UTC
(In reply to Dominique Leuenberger from comment #10)
> no; openQA disposes all VMs post testing; the easiest is to grab an iso from
> openQA and perform a quick local install.
> 
> Ensure to use the same params though as openQA does for the tests,
> specifically to use the iso as usb stick

Could you give some pointers how to do it?

(In reply to Michael Chang from comment #11)
> 
> So, it turns out to be this commit upstream:
> 
>  http://git.savannah.gnu.org/cgit/grub.git/commit/
> ?id=ce95549cc54b5d6f494608a7c390dba3aab4fba7
> 
>    fp->path_name[size++] = '\0';
>    fp->header.length = size * sizeof (grub_efi_char16_t) + sizeof (*fp);
> 
> The outcome could be a non null terminated for utf-16, depending on what
> remains in last char of fp->path_name. If is also '\0', we are fine, but
> else would lead to trouble as it constructs a (random) utf-16 character in
> the end,

Well, fp->path_name is grub_uint16_t, not char, so we assign 16 bit zero here.
Comment 16 Dominique Leuenberger 2017-02-23 16:02:54 UTC
(In reply to Andrei Borzenkov from comment #15)
> (In reply to Dominique Leuenberger from comment #10)
> > no; openQA disposes all VMs post testing; the easiest is to grab an iso from
> > openQA and perform a quick local install.
> > 
> > Ensure to use the same params though as openQA does for the tests,
> > specifically to use the iso as usb stick
> 
> Could you give some pointers how to do it?

Sure; the test run linked in comment #0 (https://openqa.opensuse.org/tests/359620#step/grub_test/4) contains all this info. On the tab 'Logs & Assets' you can find a file called autoinst.log

This contains the information about the parameters used to start qemu:

/usr/bin/qemu-kvm -serial file:serial0 -soundhw ac97 -global isa-fdc.driveA= -vga cirrus -m 1024 -cpu qemu64 -netdev user,id=qanet0 -device virtio-net,netdev=qanet0,mac=52:54:00:12:34:56 -device virtio-scsi-pci,id=scsi0 -device virtio-blk,drive=hd1 -drive file=raid/l1,cache=unsafe,if=none,id=hd1,format=qcow2 -drive if=none,id=usbstick,file=/var/lib/openqa/share/factory/iso/openSUSE-Tumbleweed-DVD-x86_64-Snapshot20170220-Media.iso,snapshot=on -device usb-ehci,id=ehci -device usb-storage,bus=ehci.0,drive=usbstick,id=devusb,bootindex=0 -drive if=pflash,format=qcow2,file=ovmf.bin -device usb-ehci -device usb-tablet -smp 1 -enable-kvm -no-shutdown -vnc :94,share=force-shared -qmp unix:qmp_socket,server,nowait -monitor unix:hmp_socket,server,nowait -S -monitor telnet:127.0.0.1:20042,server,nowait
Comment 17 Andrei Borzenkov 2017-02-23 16:14:08 UTC
(In reply to Dominique Leuenberger from comment #16)
>file=/var/lib/openqa/share/factory/iso/openSUSE-Tumbleweed-DVD-x86_64-Snapshot20170220-Media.iso

How can I fetch this ISO? I try http://openqa.opensuse.org/assets/repo/openSUSE-Tumbleweed-oss-i586-x86_64-Snapshot20170220 but get "Forbidden".
Comment 18 Dominique Leuenberger 2017-02-23 16:22:31 UTC
(In reply to Andrei Borzenkov from comment #17)
> (In reply to Dominique Leuenberger from comment #16)
> >file=/var/lib/openqa/share/factory/iso/openSUSE-Tumbleweed-DVD-x86_64-Snapshot20170220-Media.iso
> 
> How can I fetch this ISO? I try
> http://openqa.opensuse.org/assets/repo/openSUSE-Tumbleweed-oss-i586-x86_64-
> Snapshot20170220 but get "Forbidden".

Take the newer one: https://openqa.opensuse.org/tests/360632/asset/iso/openSUSE-Tumbleweed-DVD-x86_64-Snapshot20170222-Media.iso

the old test has 'expired' as newer snapshots were tested (with the same failure)

note: the newer snapshot had grun2 reverted to rev 152 and shim-leap reverted (those were the two that were assumed to cause the issue in snapshot 0220)
Comment 19 Andrei Borzenkov 2017-02-24 06:18:39 UTC
We get wrong $cmdpath with filename (only directory)

1
2
3
/efi/opensuse

I cannot easily test it (as it requires changing bootloader in ISO) but if reverting my commit fixes it, it sounds like firmware bug really. I think the only proper fix here is to use single File Path node.
Comment 20 Andrei Borzenkov 2017-02-24 06:45:49 UTC
Created attachment 715356 [details]
Proposed upstream patch

@Michael, as far as I can test it, this patch works (after reverting your revert patch). WDYT? It was never clear to me why we use two File Path nodes in the first place.
Comment 21 Michael Chang 2017-02-24 06:48:19 UTC
I didn't catch that, sorry. That's indeed a very cursory inspection from me as we are working on our own hackweek project (only once or twice a year, so people usually dont want to spend time on other stuff.)

But neverthelast, it's a blocker awaiting a (rather urgent) fix and also much likely a rc1 regression, reverting that patch as it also seems to be the only notable changes between beta2 and rc1, so that's the best option for me to get things going.
Comment 22 Michael Chang 2017-02-24 06:57:19 UTC
Oh My previous comment was meat to reply comment#18.

(In reply to Andrei Borzenkov from comment #20)
> Created attachment 715356 [details]
> Proposed upstream patch
> 
> @Michael, as far as I can test it, this patch works (after reverting your
> revert patch). WDYT? It was never clear to me why we use two File Path nodes
> in the first place.

Actually I just setup a test and came to the same conclusion with you (I'm not really confident a firmware issue though, but the cmdpath did really go wrong, and reverting that patch helps). Using single device path is definitely helpful here, as its less confused to process in general ...

Anyway if that works for you, feel free to submit, the upstream could continue as well.

Thanks.
Comment 23 Michael Chang 2017-02-24 06:58:35 UTC
(In reply to Michael Chang from comment #22)
> Oh My previous comment was meat to reply comment#18.

Sorry again, comment #15.
Comment 24 Michael Chang 2017-02-24 07:09:01 UTC
(In reply to Andrei Borzenkov from comment #20)
> Created attachment 715356 [details]
> Proposed upstream patch
> 
> @Michael, as far as I can test it, this patch works (after reverting your
> revert patch). WDYT? It was never clear to me why we use two File Path nodes
> in the first place.

Maybe firmware would just concatenating the FilePath Nodes without taking care the null string, in this case like

 /efi/opensuse\0/grub2.efi\0

But who really knows. :)
Comment 25 Andrei Borzenkov 2017-02-24 08:31:50 UTC
Created attachment 715365 [details]
Proper fix for this problem

(In reply to Michael Chang from comment #22)
> I'm not really confident a firmware issue though

You are right, grub did not strip final NULL when building load image path.
Comment 26 Dominique Leuenberger 2017-02-24 11:00:39 UTC
Just a note: The submission as done yesterday by Michael has been checked in last night (bypassing most of the process) - and test results for a snapshot with thus grub package are in:

https://openqa.opensuse.org/tests/361159 - PASSED

I'm thus removing the 'blocker' flag of the bug - seems TW is unblocked with this workaround
Comment 27 Josef Reidinger 2017-02-28 09:54:42 UTC
ok, reassigning to mchang, as it do not look like yast issue ( if I am wrong, please write what exactly is wrong and reassign back ).
Thanks
Comment 28 Michael Chang 2017-03-01 15:22:29 UTC
Hi Andrei,

Please check srid#461364, I forwarded your upstream patch to openSUSE Factory.
Thanks.
Comment 30 Michael Chang 2017-03-02 08:05:34 UTC
Case closed.