Bug 1071521 - [Build 20171205] openQA test fails in force_cron_run: stalling process
[Build 20171205] openQA test fails in force_cron_run: stalling process
Status: VERIFIED FIXED
Classification: openSUSE
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Kernel
Current
Other Other
: P5 - None : Major (vote)
: ---
Assigned To: E-mail List
E-mail List
http://openqa.opensuse.org/tests/5521...
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2017-12-06 12:46 UTC by Dominique Leuenberger
Modified: 2018-02-09 15:30 UTC (History)
3 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
snapper.log covering all the entries since the machine was updated (9.46 KB, text/plain)
2017-12-12 08:57 UTC, Dominique Leuenberger
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Dominique Leuenberger 2017-12-06 12:46:24 UTC
## Observation

openQA test in scenario opensuse-Tumbleweed-DVD-x86_64-update_Leap_42.1_gnome@64bit fails in
[force_cron_run](http://openqa.opensuse.org/tests/552148/modules/force_cron_run/steps/10)


## Reproducible

Fails since (at least) Build [20171205](http://openqa.opensuse.org/tests/551940)


## Expected result

Last good: [20171204](http://openqa.opensuse.org/tests/550866) (or more recent)


## Further details

Always latest result in this scenario: [latest](http://openqa.opensuse.org/tests/latest?distri=opensuse&arch=x86_64&machine=64bit&version=Tumbleweed&flavor=DVD&test=update_Leap_42.1_gnome)


## There is no clear answer yet, but the PackageKit service is a reasonable candidate to be causing this block (does not explain why we did not see this ever until now)

PK does this in the cron script:

# wait a random amount of time to avoid hammering the servers
[ -z "$SLEEP_MAX" ] && SLEEP_MAX=$RANDOM
sleep $(( $RANDOM % $SLEEP_MAX + 1 ))

With MAX_SLEEP being set to 3600 in the default configuration

This sleep is not nescessary on openSUSE, since our cron.daily tasks are not scheduled for 0:00, but '24 hours after last exectuion', which depends largely on the sytem itself
Comment 1 Bernhard Wiedemann 2017-12-08 11:40:09 UTC
This is an autogenerated message for OBS integration:
This bug (1071521) was mentioned in
https://build.opensuse.org/request/show/555203 Factory / PackageKit
Comment 2 Dominique Leuenberger 2017-12-11 20:27:25 UTC
The PK hit was 'too easy and obvious' - and is not it.

I reproduced this locally in a VM now, and the cron task stalling is actually
suse.de-snapper, which hangs at:

snapper --config=root --quiet cleanup empty-pre-post

looking at 'ps' the process state is S+ (interruptible sleep, foreground process)

attaching to 'snapper' using 'strace' shows:
strace: Process 4054 attached
restart_syscall(<... resuming interrupted restart_syscall ...>

ltrace shows nothing happening
Comment 3 Dominique Leuenberger 2017-12-12 08:46:39 UTC
attaching strace to the snapperd process, I see

futex(0x5575b37c7c68, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1513068326, tv_nsec=406057000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0x5575b37c7c18, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x5575b37b4798, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1513068326, tv_nsec=411450000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0x5575b37b4748, FUTEX_WAKE_PRIVATE, 1) = 0
poll([{fd=4, events=POLLIN}, {fd=3, events=POLLIN}], 2, 1000) = 0 (Timeout)
Comment 4 Dominique Leuenberger 2017-12-12 08:57:48 UTC
Created attachment 752531 [details]
snapper.log covering all the entries since the machine was updated
Comment 5 Dominique Leuenberger 2017-12-12 08:59:36 UTC
After discussion with Arvin, the believe is this to be kernel/btrfs related.

This is somewhat underlined by the fact that the issue appeared since the 1205 snapshot, which is when kernel 4.14.3 landed (4.14.4 later made no change)
Comment 6 Oliver Kurz 2018-01-07 07:22:05 UTC
I conducted a statistical analysis with openQA tests locally and could confirm this issue to appear in 9 out of 15 runs so roughly 60% probability. Also, in case of no error the script execution of all the cron jobs finishes within 2 seconds, in case of failure we give up after 1000s but Dimstar confirmed that the call is stalled over night as well so multiple hours.
Comment 7 Oliver Kurz 2018-01-09 07:03:19 UTC
Would it help to find the stack trace of snapperd? Or what else would be necessary to investigate this?
Comment 8 Jiri Slaby 2018-02-08 14:33:48 UTC
(In reply to Oliver Kurz from comment #7)
> Would it help to find the stack trace of snapperd? Or what else would be
> necessary to investigate this?

Definitely -- content of /proc/<$$$>/stack

The only commit about btrfs in 4.14.3 was:
commit f1117628310079bbdd4a0fa951602419f258bead
Author: Josef Bacik <jbacik@fb.com>
Date:   Tue Aug 22 16:00:39 2017 -0400

    btrfs: change how we decide to commit transactions during flushing
    

Does this still happen with 4.15?
Comment 9 Dominique Leuenberger 2018-02-08 20:02:31 UTC
This issue has been observed reliably until snashot 20180119, and has not been observed a single time since 0120, as can be seen at

https://openqa.opensuse.org/tests/604030?version=Tumbleweed&flavor=DVD&machine=64bit&arch=x86_64&distri=opensuse&limit_previous=50&test=update_Leap_42.1_gnome#previous

0120 happens to be the snapshot that brought those, possibly related updates:

  btrfsmaintenance (0.3.1 -> 0.4)
  kernel-source (4.14.13 -> 4.14.14)
  snapper (0.5.2 -> 0.5.3)

Since the problem has not been observed since Jan 20, I think we will have a hard time to closer examine what exactly solved the issue - or provide stacks at this moment
Comment 10 Jiri Slaby 2018-02-09 08:02:22 UTC
Which is on the other hand good as the bug is fixed.
Comment 11 Oliver Kurz 2018-02-09 15:30:15 UTC
one can say so :)

https://openqa.opensuse.org/tests/604741