Bug 1186256 - qemu-linux-user: hardcoded binfmt handler doesn't play well with containers
Summary: qemu-linux-user: hardcoded binfmt handler doesn't play well with containers
Status: RESOLVED FIXED
Alias: None
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: KVM (show other bugs)
Version: Current
Hardware: Other Other
: P5 - None : Normal (vote)
Target Milestone: ---
Assignee: Dario Faggioli
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-05-19 15:16 UTC by Martin Wilck
Modified: 2022-10-27 18:38 UTC (History)
7 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
Proposed patch for qemu-binfmt-conf.sh (2.29 KB, patch)
2021-05-19 15:16 UTC, Martin Wilck
Details | Diff
Proposed patch for qemu-binfmt-conf.sh (2.58 KB, patch)
2021-05-19 15:27 UTC, Martin Wilck
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Wilck 2021-05-19 15:16:20 UTC
Created attachment 849481 [details]
Proposed patch for qemu-binfmt-conf.sh

Since abbc0ce ("qemu-binfmt-conf: use qemu-ARCH-binfmt"),
qemu-binfmt-conf.sh under openSUSE automatically replaces the default qemu binfmt wrapper "qemu-$ARCH" with "qemu-$ARCH-binfmt" in order to ensure that argv[0] is preserved; qemu-$ARCH-binfmt is a link to qemu-binfmt, which is just a simple wrapper that mangles argv to achieve the desired result.
This is a SUSE-specific modification which isn't used upstream.

This approach is inconvenient in some situations. In particular for running
foreign-arch containers, it's useful to use the binfmt_misc "F" ("fix
binary") flag to pre-load the qemu wrapper in the kernel. That way,
foreign-arch containers can be run just like native containers, without
having to bind-mount interpreters into the container. But that's impossible
with the SUSE binfmt wrapper that needs to exec() a different (native)
executable.

In the openSUSE default mode of qemu-binfmt-conf.sh, the user needs to bind-mount both the -binfmt executable and the actual emulator into the container:

> $ podman run -it --rm \
>       -v /usr/bin/qemu-ppc64le-binfmt:/usr/bin/qemu-ppc64le-binfmt \
>       -v /usr/bin/qemu-ppc64le:/usr/bin/qemu-ppc64le \
>       ppc64le/busybox uname -m
> ppc64le

Otherwise, he gets

> $ podman run -t --rm ppc64le/busybox uname -m
> standard_init_linux.go:219: exec user process caused: no such file or directory

If qemu-binfmt-conf.sh is used with the --persistent flag, qemu-ppc64le-binfmt is loaded into the kernel, but qemu-ppc64le must still be bind-mounted.
If qemu-ppc64le was used directly as persistent binfmt_misc helper, it would be sufficient to run the container as if it was a native one:

> $ podman run -it --rm  ppc64le/busybox uname -m
> ppc64le

I can see why it makes sense to try to preserve argv[0], but for me at least, the "foreign container" use case is more important. Therefore I'd like to be able to switch the behavior of the qemu binfmt_misc helper back to the upstream default.

So far I've worked around the issue by simply using the upstream container "docker.io/multiarch/qemu-user-static", but I'd like to be able to do this easily with openSUSE on-board tools.

The attached patch allows the user to override the default "-binfmt" suffix by running "qemu-binfmt-conf.sh --qemu-suffix ''".

(Note: "qemu-binfmt-conf.sh -F ''" doesn't work, that's a different issue).
Comment 1 Martin Wilck 2021-05-19 15:27:08 UTC
Created attachment 849483 [details]
Proposed patch for qemu-binfmt-conf.sh
Comment 2 Martin Wilck 2021-05-19 15:28:35 UTC
wrt "-F", I just posted a patch to qemu-devel, subject "qemu-binfmt-conf.sh: fix -F option".
Comment 3 Martin Wilck 2021-05-19 15:29:29 UTC
Note: I tried to create an OBS request with these two patches, but I failed to make update_git.sh work.
Comment 4 José Ricardo Ziviani 2021-08-27 17:39:32 UTC
Hello Martin,

Just added your patch in our stage repo (https://build.opensuse.org/package/revisions/Virtualization/qemu). I'll send a SR to Factory as soon as they finish the QEMU v6.1 update. (https://build.opensuse.org/request/show/914458).

Thank you!

Jose
Comment 5 Martin Wilck 2021-08-27 19:21:33 UTC
Great, thank you!
Comment 6 OBSbugzilla Bot 2021-09-09 06:40:06 UTC
This is an autogenerated message for OBS integration:
This bug (1186256) was mentioned in
https://build.opensuse.org/request/show/917638 Factory / qemu
Comment 7 Martin Wilck 2021-09-12 21:46:03 UTC
José,

we're not there yet because an upstream bot rejected my -F patch (comment 2) because of a style issue which was definitely not my fault. The overlong line was there before my patch already. I never got this reply (spam folder? no idea), so I was also never able to fix this non-issue.

https://lists.gnu.org/archive/html/qemu-devel/2021-05/msg06012.html

I'll re-post the patch and cc you. I'd be glad if you could pull it into opensuse before upstream gets to it.
Comment 8 José Ricardo Ziviani 2021-09-13 13:52:33 UTC
(In reply to Martin Wilck from comment #7)
> José,
> 
> we're not there yet because an upstream bot rejected my -F patch (comment 2)
> because of a style issue which was definitely not my fault. The overlong
> line was there before my patch already. I never got this reply (spam folder?
> no idea), so I was also never able to fix this non-issue.
> 
> https://lists.gnu.org/archive/html/qemu-devel/2021-05/msg06012.html
> 
> I'll re-post the patch and cc you. I'd be glad if you could pull it into
> opensuse before upstream gets to it.

Hello Martin,

Sure, I'll add it here.

By the way, your -F patch is in Factory, should be available in this next update.

Thanks
Comment 10 OBSbugzilla Bot 2021-09-20 20:40:07 UTC
This is an autogenerated message for OBS integration:
This bug (1186256) was mentioned in
https://build.opensuse.org/request/show/920365 Factory / qemu
Comment 13 Martin Wilck 2021-11-29 13:52:05 UTC
The upstream  v2 submission fell through the cracks again, it seems. Trying once more. Perhaps an acked-by: of one of you guys might help...
Comment 14 Martin Wilck 2021-11-29 15:45:19 UTC
Laurent has reviewed my -F patch now ... 
https://lists.gnu.org/archive/html/qemu-devel/2021-11/msg05530.html
Comment 15 Martin Wilck 2021-11-29 15:46:19 UTC
But FTR, the patch from comment 1 is not yet in factory (qemu-linux-user-6.1.0-34.1.x86_64).
Comment 16 Dario Faggioli 2021-12-03 19:01:40 UTC
(In reply to Martin Wilck from comment #15)
> But FTR, the patch from comment 1 is not yet in factory
> (qemu-linux-user-6.1.0-34.1.x86_64).
>
Mmm... indeed. And I'm not sure I understand why. Especially, I don't know why https://build.opensuse.org/request/show/920365 contains:

* Patches dropped:
  qemu-binfmt-conf.sh-allow-overriding-SUS.patch

So, it seems that the patch was there (I'm guessing added by Jose?) and was removed.

In any case, I am adding/reinstating it. It will appear here: https://build.opensuse.org/package/show/home:dfaggioli:devel:Virtualization/qemu and I'll submit to Factory after a quick test.
Comment 17 Dario Faggioli 2021-12-03 19:11:31 UTC
(In reply to Martin Wilck from comment #14)
> Laurent has reviewed my -F patch now ... 
> https://lists.gnu.org/archive/html/qemu-devel/2021-11/msg05530.html
>
Interesting. I have to look properly at the code, but the change mentioned here ("linux-user: manage binfmt-misc preserve-arg[0] flag") is already in 6.0.0 and 6.1.0.

I think that means we can adjust things in such a way that we then could drop both the old patch from Alex, and also your one from comment 1, at least for distros that have > 5.12 kernel (i.e., TW and 15.4).

What do you think?

While your -F patch will stay, until we ship a QEMU version that has it, I guess.
Comment 18 Martin Wilck 2021-12-03 20:00:38 UTC
Thanks, I finally start to understand. I have to say I only partially understood Laurent's response so far.

If Alex' patch is dropped, my patch from comment 1 almost certaintly won't be necessary any more.

I don't care about preserving argv[0]. All I'm interested in is not to have to bind-mount a qemu executable into foreign arch containers. But if you drop Alex' patch, you may have to talk to some of the people who are interested in argv[0] preservation.
Comment 19 Dario Faggioli 2021-12-07 12:59:25 UTC
(In reply to Martin Wilck from comment #18)
> Thanks, I finally start to understand. I have to say I only partially
> understood Laurent's response so far.
> 
Hey, so, sorry this is taking a while. As said in comment 16, I have a build with both your patches in.

I have installed qemu and qemu-linux-user from that repo, and I can see the patches there. I.e., from the changelog (obtained with `rpm -q --changelog qemu-linux-user`):

* Fri Dec 03 2021 Dario Faggioli <dfaggioli@suse.com>
* Patches added:
  qemu-binfmt-conf.sh-allow-overriding-SUS.patch

- Replace patch to fix hardcoded binfmt handler
  (bsc#1186256)
  * Patches dropped:
  qemu-binfmt-conf.sh-allow-overriding-SUS.patch
  * Patches added:
  qemu-binfmt-conf.sh-should-use-F-as-shor.patch

[1]

I also have manually checked the /usr/sbin/qemu-binfmt-conf.sh file that is installed on the system, and it looks correct (it has both patches applied).

Now I'm doing the following:

virt136:~ # qemu-binfmt-conf.sh -F ''                                                                                   
Setting /usr/bin/qemu-alpha as binfmt interpreter for alpha 
Setting /usr/bin/qemu-arm as binfmt interpreter for arm
Setting /usr/bin/qemu-armeb as binfmt interpreter for armeb
... ... ...

Which results in:

virt136:~ # cat /proc/sys/fs/binfmt_misc/qemu-ppc64le 
enabled
interpreter /usr/bin/qemu-ppc64le
flags: P
offset 0
magic 7f454c4602010100000000000000000002001500
mask ffffffffffffff00fffffffffffffffffeffff00

But I still see this:

virt136:~ # podman run -t --rm ppc64le/busybox uname -m
standard_init_linux.go:228: exec user process caused: no such file or directory

Do you happen to see what I might be doing wrong?

[1] I finally think I understand what happened... It seems like Jose had added the "override SUSE workaround patch" but then he misunderstood one of the comments and, instead of just adding the "fix -F" patch, he replaced the previously added "override SUSE workaround patch" with it. Well, that does not matter much now, but just FTR...
Comment 20 Dario Faggioli 2021-12-07 13:06:08 UTC
In fact, if I do just:

virt136:~ # qemu-binfmt-conf.sh

I.e., I don't take advantage of your patches, I then see this:

virt136:~ # cat /proc/sys/fs/binfmt_misc/qemu-ppc64le 
enabled
interpreter /usr/bin/qemu-ppc64le-binfmt
flags: P
offset 0
magic 7f454c4602010100000000000000000002001500

And, consistently:

virt136:~ # podman run -t --rm ppc64le/busybox uname -m
standard_init_linux.go:228: exec user process caused: no such file or directory

virt136:~ # podman run -it --rm \
  -v /usr/bin/qemu-ppc64le-binfmt:/usr/bin/qemu-ppc64le-binfmt \
  -v /usr/bin/qemu-ppc64le:/usr/bin/qemu-ppc64le \
  ppc64le/busybox uname -m
ppc64le
Comment 21 Dario Faggioli 2021-12-07 13:10:13 UTC
Mmm... I also see this:

virt136:~ # ls /usr/bin/qemu-ppc64le* -l
-rwxr-xr-x 1 root root 3940664 Dec  6 14:35 /usr/bin/qemu-ppc64le
lrwxrwxrwx 1 root root      11 Dec  6 14:33 /usr/bin/qemu-ppc64le-binfmt -> qemu-binfmt

Not sure if/how it matters yet, I need to check...
Comment 22 Martin Wilck 2021-12-07 15:58:01 UTC
What I want to achieve (being able to simply start a foreign-arch container without having to bind-mount anything from the native environment into it) only works with with "fix binary" settings, where the statically linked interpreter binary is loaded into the kernel (--persistent flag of qemu-binfmt-conf.sh, "F" flag in the kernel). 

So you need to run e.g. 

qemu-binfmt-conf.sh --systemd s390x --persistent yes --qemu-suffix ""

to make this work. The result looks like this:

 # cat /proc/sys/fs/binfmt_misc/qemu-s390x 
enabled
interpreter /usr/bin/qemu-s390x
flags: PF
offset 0
magic 7f454c4602020100000000000000000000020016
mask ffffffffffffff00fffffffffffffffffffeffff

Hope this makes sense.

(In reply to Dario Faggioli from comment #21)
> Mmm... I also see this:
> 
> virt136:~ # ls /usr/bin/qemu-ppc64le* -l
> -rwxr-xr-x 1 root root 3940664 Dec  6 14:35 /usr/bin/qemu-ppc64le
> lrwxrwxrwx 1 root root      11 Dec  6 14:33 /usr/bin/qemu-ppc64le-binfmt ->
> qemu-binfmt

This is the normal SUSE setup.
Comment 23 Dario Faggioli 2021-12-07 18:52:41 UTC
(In reply to Martin Wilck from comment #22)
> So you need to run e.g. 
> 
> qemu-binfmt-conf.sh --systemd s390x --persistent yes --qemu-suffix ""
> 
Ah, right!

> to make this work. The result looks like this:
> 
>  # cat /proc/sys/fs/binfmt_misc/qemu-s390x 
> enabled
> interpreter /usr/bin/qemu-s390x
> flags: PF
>
Indeed, I was missing one of the flags.

> offset 0
> magic 7f454c4602020100000000000000000000020016
> mask ffffffffffffff00fffffffffffffffffffeffff
> 
> Hope this makes sense.
> 
It does. Sorry again, but I have not much experience with qemu-binfmt-conf.sh. In fact, I used to set things up manually, and am only now getting familiar with the code.

Like you said, it works. SR coming!
Comment 24 Dario Faggioli 2021-12-09 16:21:53 UTC
SR 936373 (https://build.opensuse.org/request/show/936373) is in Factory now, and it had both the patches, and according to my tests, things work as wanted now, so I'm closing this.

Thanks for the patches and for the help reproducing and debugging this!
Comment 25 OBSbugzilla Bot 2022-10-07 16:05:04 UTC
This is an autogenerated message for OBS integration:
This bug (1186256) was mentioned in
https://build.opensuse.org/request/show/1008827 Factory / qemu