Bug 1079747 - Kernel 4.15 seems often to stall the OBS workers
Kernel 4.15 seems often to stall the OBS workers
Status: RESOLVED FIXED
Classification: openSUSE
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Kernel
Current
Other Other
: P5 - None : Normal (vote)
: ---
Assigned To: E-mail List
E-mail List
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2018-02-07 07:43 UTC by Dominique Leuenberger
Modified: 2022-07-21 17:20 UTC (History)
4 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
Complete log from OBS for Staging:B/tar/standard/i586 (322.46 KB, text/plain)
2018-02-09 14:05 UTC, Dominique Leuenberger
Details
genfile used in the tests (152.79 KB, application/x-executable)
2018-02-09 16:33 UTC, Jiri Slaby
Details
backtraces of all processes (179.53 KB, text/plain)
2018-02-20 14:12 UTC, Jiri Slaby
Details
[PATCH] ext4: Fix offset overflow on 32-bit archs in ext4_iomap_begin() (1.51 KB, patch)
2018-02-20 17:22 UTC, Jan Kara
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Dominique Leuenberger 2018-02-07 07:43:16 UTC
Since the upgrade to Kernel 4.15, I see a much increased number of 'stalled' OBS workers.

The stall happens in various packages, most prominently, llvm5, ceph, libreoffice and java-9-openjdk. Most of the time, they are 'fine' after several attempts.

There seems to be one package reliably reproducing it: tar on i586

https://build.opensuse.org/package/live_build_log/openSUSE:Factory/tar/standard/i586

I'm CCing Adrian (OBS Admin) here, as he will likely be the only person that can provide traces from the VM, at the time this happens.
Comment 1 Jiri Slaby 2018-02-07 16:19:16 UTC
Yes, without some traces, we cannot do much about that.
Comment 2 Adrian Schröter 2018-02-07 16:24:03 UTC
I will enable project owners to send (some) sysrq to the running builds for cases like these ... code is almost ready.
Comment 3 Adrian Schröter 2018-02-09 08:39:41 UTC
build.opensuse.org is ready for handling some sysrq now..

You can use the latest osc (unstable build) from devel:tools:scm project and run eg:

  osc sendsysrq $project $package $repo $arch 9
  osc sendsysrq $project $package $repo $arch t 
  osc sendsysrq $project $package $repo $arch w

to trigger sysrequests (note, only a subset is allowed via a whitelist, tell if you miss one).

Alternative is to do it via the api, eg:

 osc api -X POST '/build/science:unstable?cmd=sendsysrq&arch=x86_64&sysrq=9&repository=openSUSE_Tumbleweed&package=FreeCAD'

or use your favourite tool to speak HTTP...

/me hopes that I am not the only one anymore who can debug guest kernels :)
Comment 4 Adrian Schröter 2018-02-09 08:40:34 UTC
ah, please note that this will only work when the build has started within last 15minutes, otherwise it will still use a tool old worker....
Comment 5 Dominique Leuenberger 2018-02-09 14:03:45 UTC
I gave this a try on Staging:B/tar/standard/i586, and the log spit out:

[  598s] 141: storing long sparse file names                  ok
[17760s] 142: listing sparse files bigger than 2^33 B        [17744.267539] sysrq: SysRq : This sysrq operation is disabled.


Send sysrq 9 to Job
[17796s] [17780.647309] sysrq: SysRq : Show State
[17796s] [17780.648108]   task                PC stack   pid father
[17796s] [17780.649201] build           S    0     1      0 0x00000000
[17796s] [17780.650363] Call Trace:
[17796s] [17780.650885]  ? __schedule+0x2a5/0x920
[17796s] [17780.651642]  ? schedule+0x2d/0x80
[17796s] [17780.652332]  ? do_wait+0x1af/0x220
[17796s] [17780.653038]  ? kernel_wait4+0x70/0x110
[17796s] [17780.653810]  ? task_stopped_code+0x60/0x60
[17796s] [17780.654666]  ? do_int80_syscall_32+0x51/0x100
[17796s] [17780.655560]  ? entry_INT80_32+0x36/0x36
[17796s] [17780.656351] kthreadd        S    0     2      0 0x80000000
[17796s] [17780.658023] Call Trace:
[17796s] [17780.658555]  ? __schedule+0x2a5/0x920
[17796s] [17780.659312]  ? schedule+0x2d/0x80
[17796s] [17780.660000]  ? kthreadd+0x195/0x1b0
[17796s] [17780.660720]  ? kthread_create_on_cpu+0xa0/0xa0
[17796s] [17780.661635]  ? ret_from_fork+0x2e/0x38
[17796s] [17780.662447] kworker/0:0H    I    0     4      2 0x80000000
[17796s] [17780.663578] Call Trace:
[17796s] [17780.664098]  ? __schedule+0x2a5/0x920
[17796s] [17780.664871]  ? schedule+0x2d/0x80
[17796s] [17780.665569]  ? worker_thread+0xa6/0x400
[17796s] [17780.666402]  ? kthread+0xf0/0x110
[17796s] [17780.667096]  ? process_one_work+0x3d0/0x3d0
[17796s] [17780.668750]  ? kthread_create_worker_on_cpu+0x20/0x20
[17796s] [17780.669795]  ? ret_from_fork+0x2e/0x38
[17796s] [17780.670607] mm_percpu_wq    I    0     6      2 0x80000000
[17796s] [17780.671743] Call Trace:
[17796s] [17780.672273]  ? __schedule+0x2a5/0x920
[17796s] [17780.673043]  ? schedule+0x2d/0x80
[17796s] [17780.673743]  ? rescuer_thread+0x2ce/0x310
[17796s] [17780.674608]  ? preempt_schedule_common+0x11/0x30
[17796s] [17780.675567]  ? kthread+0xf0/0x110
[17796s] [17780.676262]  ? cancel_delayed_work_sync+0x20/0x20
[17796s] [17780.677239]  ? kthread_create_worker_on_cpu+0x20/0x20
[17796s] [17780.678574]  ? ret_from_fork+0x2e/0x38
[17796s] [17780.679351] ksoftirqd/0     S    0     7      2 0x80000000
[17796s] [17780.680475] Call Trace:
[17796s] [17780.680998]  ? __schedule+0x2a5/0x920
[17796s] [17780.681754]  ? schedule+0x2d/0x80
[17796s] [17780.682459]  ? smpboot_thread_fn+0x1b4/0x200
[17796s] [17780.683348]  ? kthread+0xf0/0x110
[17796s] [17780.684044]  ? sort_range+0x20/0x20
[17796s] [17780.684779]  ? kthread_create_worker_on_cpu+0x20/0x20
[17796s] [17780.685825]  ? ret_from_fork+0x2e/0x38
[17796s] [17780.686917] rcu_preempt     I    0     8      2 0x80000000
[...]
Comment 6 Dominique Leuenberger 2018-02-09 14:05:24 UTC
Created attachment 759586 [details]
Complete log from OBS for Staging:B/tar/standard/i586
Comment 7 Jiri Slaby 2018-02-09 14:58:24 UTC
The interesting part is:

tar             R  running task        0 24780  24590 0x00000000 
Call Trace:
 ? ext4_map_blocks+0x2fa/0x5f0
 ? ext4_map_blocks+0x2fa/0x5f0
 ? ext4_map_blocks+0x348/0x5f0
 ? ext4_iomap_begin+0x39b/0x4d0
 ? ext4_evict_inode+0x5c0/0x5c0
 ? iomap_apply+0x61/0x160
 ? iomap_to_fiemap+0xd0/0xd0
 ? iomap_seek_data+0xac/0x110
 ? iomap_seek_hole_actor+0x80/0x80
 ? ext4_llseek+0x13a/0x140 
 ? ext4_file_mmap+0x90/0x90
 ? SyS_llseek+0x7b/0xc0
 ? do_int80_syscall_32+0x51/0x100
 ? entry_INT80_32+0x36/0x36

Honza, any idea, what could take so long in ext4_map_blocks in 4.15?
Comment 8 Jiri Slaby 2018-02-09 16:33:53 UTC
Created attachment 759609 [details]
genfile used in the tests

It can be easily reproduced on a 32bit kernel using the attached binary:
./genfile --length 1000 --file begin
./genfile --length 1000 --file end
./genfile --sparse --file sparsefile --block-size 512 8G A
tar -c -f archive --sparse begin sparsefile end

It has to be some 32-bit overflow in the newly added code in 4.15 makes it spinning for a while.
Comment 9 Dominique Leuenberger 2018-02-09 17:22:04 UTC
https://build.opensuse.org/package/live_build_log/openSUSE:Factory:Staging:G/kernel-vanilla/standard/x86_64

The kernel stall case is less promising:

[ 7976s] RPMLINT report:
[ 7976s] ===============


Send sysrq 9 to Job


Send sysrq t to Job


Send sysrq w to Job
[13451s] qemu-system-x86_64: terminating on signal 15 from pid 12851 ()
[13451s] qemu-system-x86_64: Failed to unlink socket /var/cache/obs/worker/root_6/root.monitor: Permission denied


Killed Job
[13451s] ### VM INTERACTION END ###
[13451s] No buildstatus set, either the base system is broken (kernel/initrd/udev/glibc/bash/perl)
[13451s] or the build host has a kernel or hardware problem...

There were no further reactions on the sysreq calls
Comment 10 Jiri Slaby 2018-02-12 13:41:57 UTC
And the winner is:

--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3523,7 +3523,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
                iomap->flags |= IOMAP_F_DIRTY;
        iomap->bdev = inode->i_sb->s_bdev;
        iomap->dax_dev = sbi->s_daxdev;
-       iomap->offset = first_block << blkbits;
+       iomap->offset = offset;
        iomap->length = (u64)map.m_len << blkbits;
 
        if (ret == 0) {


first_block is ulong. On 32bit, it is 32 bit. And it indeed got cropped. I will push something to our trees when upstream comments on this (reported already).

https://marc.info/?l=linux-kernel&m=151844270631097&w=2
Comment 11 Jan Kara 2018-02-14 14:44:40 UTC
You are correct, the above is a bug and I think it can lead to us effectively end up in a loop trying to reach the end of file which never happens. Thanks for debugging this!
Comment 12 Adrian Schröter 2018-02-19 17:56:42 UTC
We have now a memory dump from KVM in the situation where the build did hang and even no sysrq did work anymore.

You can find the dump in my NFS home as ~adrian/lamb72_8.gz

It got created via "migrate exec:pigz>/tmp/file.gz" on qemu-kvm console.
Comment 13 Adrian Schröter 2018-02-20 08:12:21 UTC
another attempt with a leap 15.0 i586 kernel. Same situation that process is hanging and no response on sysrq.

The local files including memory dump, root, swap, kernel and initrd can be found in my nfs home in ~adrian/lamb04_4/

qemu command line used to start the VM:

/usr/bin/qemu-system-x86_64 -machine accel=kvm -nodefaults -no-reboot -nographic -vga none -object rng-random,filename=/dev/random,id=rng0 -device virtio-rng-pci,rng=rng0 -runas qemu -cpu host -net none -kernel /var/cache/obs/worker/root_4/.mount/boot/kernel -initrd /var/cache/obs/worker/root_4/.mount/boot/initrd -append root=/dev/disk/by-id/virtio-0 rootfstype=ext3 rootflags=data=writeback,nobarrier,commit=150,noatime panic=1 quiet no-kvmclock nmi_watchdog=0 rw rd.driver.pre=binfmt_misc elevator=noop console=ttyS0 init=/.build/build -m 8192 -drive file=/var/cache/obs/worker/root_4/root,format=raw,if=none,id=disk,serial=0,cache=unsafe -device virtio-blk-pci,drive=disk -drive file=/var/cache/obs/worker/root_4/swap,format=raw,if=none,id=swap,serial=1,cache=unsafe -device virtio-blk-pci,drive=swap -serial stdio -chardev socket,id=monitor,server,nowait,path=/var/cache/obs/worker/root_4/root.monitor -mon chardev=monitor,mode=readline -smp 8
Comment 14 Adrian Schröter 2018-02-20 08:19:21 UTC
root.monitor is just a fifo to talk to qemu process
Comment 15 Jiri Slaby 2018-02-20 14:12:26 UTC
Created attachment 760799 [details]
backtraces of all processes

At last, I was able to extract stack traces of all proces via crash -- attaching.
Comment 16 Jiri Slaby 2018-02-20 14:39:24 UTC
I don't see anything abnormal in there. There is 'build' calling 'sleep'. There is 'rpm' installing some packages and waiting for 'fillup' to finish I/O.
Comment 17 Jan Kara 2018-02-20 17:22:36 UTC
Created attachment 760901 [details]
[PATCH] ext4: Fix offset overflow on 32-bit archs in ext4_iomap_begin()

This is a fix for ext4 failure as tracked down by Jiri. I'll take care of pushing it upstream.
Comment 21 Jiri Slaby 2018-06-16 12:27:12 UTC
The original issue (ext4 bug causing stall on 32bit) was resolved. If there are any other issues, they deserve a new report.
Comment 24 Swamp Workflow Management 2018-07-18 06:07:08 UTC
This is an autogenerated message for OBS integration:
This bug (1079747) was mentioned in
https://build.opensuse.org/request/show/623532 15.0 / kernel-source
Comment 25 Swamp Workflow Management 2018-07-27 17:53:51 UTC
SUSE-SU-2018:2092-1: An update that solves 22 vulnerabilities and has 246 fixes is now available.

Category: security (important)
Bug References: 1046303,1046305,1046306,1046307,1046540,1046542,1046543,1048129,1050242,1050252,1050529,1050536,1050538,1050545,1050549,1050662,1051510,1052766,1055968,1056427,1056643,1056651,1056653,1056657,1056658,1056662,1056686,1056787,1058115,1058513,1058659,1058717,1060463,1061024,1061840,1062897,1064802,1065600,1066110,1066129,1068032,1068054,1071218,1071995,1072829,1072856,1073513,1073765,1073960,1074562,1074578,1074701,1074741,1074873,1074919,1075006,1075007,1075262,1075419,1075748,1075876,1076049,1076115,1076372,1076830,1077338,1078248,1078353,1079152,1079747,1080039,1080542,1081599,1082485,1082504,1082869,1082962,1083647,1083900,1084001,1084570,1085308,1085539,1085626,1085933,1085936,1085937,1085938,1085939,1085941,1086282,1086283,1086286,1086288,1086319,1086323,1086400,1086652,1086739,1087078,1087082,1087084,1087092,1087205,1087210,1087213,1087214,1087284,1087405,1087458,1087939,1087978,1088354,1088690,1088704,1088722,1088796,1088804,1088821,1088866,1089115,1089268,1089467,1089608,1089663,1089664,1089667,1089669,1089752,1089753,1089878,1090150,1090457,1090605,1090643,1090646,1090658,1090734,1090888,1090953,1091158,1091171,1091424,1091594,1091666,1091678,1091686,1091781,1091782,1091815,1091860,1091960,1092100,1092472,1092710,1092772,1092888,1092904,1092975,1093023,1093027,1093035,1093118,1093148,1093158,1093184,1093205,1093273,1093290,1093604,1093641,1093649,1093653,1093655,1093657,1093663,1093721,1093728,1093904,1093990,1094244,1094356,1094420,1094541,1094575,1094751,1094825,1094840,1094912,1094978,1095042,1095094,1095115,1095155,1095265,1095321,1095337,1095467,1095573,1095735,1095893,1096065,1096480,1096529,1096696,1096705,1096728,1096753,1096790,1096793,1097034,1097105,1097234,1097356,1097373,1097439,1097465,1097468,1097470,1097471,1097472,1097551,1097780,1097796,1097800,1097941,1097961,1098016,1098043,1098050,1098174,1098176,1098236,1098401,1098425,1098435,1098599,1098626,1098706,1098983,1098995,1099029,1099041,1099109,1099142,1099183,1099715,1099792,1099918,1099924,1099966,1100132,1100209,1100340,1100362,1100382,1100394,1100416,1100418,1100491,1100602,1100633,1100843,1101296,1101315,1101324,971975,975772
CVE References: CVE-2017-5715,CVE-2017-5753,CVE-2018-1000200,CVE-2018-1000204,CVE-2018-10087,CVE-2018-10124,CVE-2018-1092,CVE-2018-1093,CVE-2018-1094,CVE-2018-1118,CVE-2018-1120,CVE-2018-1130,CVE-2018-12233,CVE-2018-13053,CVE-2018-13405,CVE-2018-13406,CVE-2018-3639,CVE-2018-5803,CVE-2018-5848,CVE-2018-7492,CVE-2018-8781,CVE-2018-9385
Sources used:
SUSE Linux Enterprise Workstation Extension 15 (src):    kernel-default-4.12.14-25.3.1
SUSE Linux Enterprise Module for Live Patching 15 (src):    kernel-default-4.12.14-25.3.1, kernel-livepatch-SLE15_Update_1-1-1.3.1
SUSE Linux Enterprise Module for Legacy Software 15 (src):    kernel-default-4.12.14-25.3.1
SUSE Linux Enterprise Module for Development Tools 15 (src):    kernel-docs-4.12.14-25.3.1, kernel-obs-build-4.12.14-25.3.1, kernel-source-4.12.14-25.3.1, kernel-syms-4.12.14-25.3.1, kernel-vanilla-4.12.14-25.3.1
SUSE Linux Enterprise Module for Basesystem 15 (src):    kernel-default-4.12.14-25.3.1, kernel-source-4.12.14-25.3.1, kernel-zfcpdump-4.12.14-25.3.1
SUSE Linux Enterprise High Availability 15 (src):    kernel-default-4.12.14-25.3.1
Comment 26 Swamp Workflow Management 2018-07-28 13:29:25 UTC
openSUSE-SU-2018:2119-1: An update that solves 23 vulnerabilities and has 283 fixes is now available.

Category: security (important)
Bug References: 1022476,1046303,1046305,1046306,1046307,1046540,1046542,1046543,1048129,1050242,1050252,1050529,1050536,1050538,1050545,1050549,1050662,1051510,1052766,1055117,1055186,1055968,1056427,1056643,1056651,1056653,1056657,1056658,1056662,1056686,1056787,1058115,1058513,1058659,1058717,1059336,1060463,1061024,1061840,1062897,1064802,1065600,1065729,1066110,1066129,1068032,1068054,1068546,1071218,1071995,1072829,1072856,1073513,1073765,1073960,1074562,1074578,1074701,1074741,1074873,1074919,1074984,1075006,1075007,1075262,1075419,1075748,1075876,1076049,1076115,1076372,1076830,1077338,1078248,1078353,1079152,1079747,1080039,1080157,1080542,1081599,1082485,1082504,1082869,1082962,1083647,1083684,1083900,1084001,1084570,1084721,1085308,1085341,1085400,1085539,1085626,1085933,1085936,1085937,1085938,1085939,1085941,1086224,1086282,1086283,1086286,1086288,1086319,1086323,1086400,1086467,1086652,1086739,1087084,1087088,1087092,1087205,1087210,1087213,1087214,1087284,1087405,1087458,1087939,1087978,1088273,1088354,1088374,1088690,1088704,1088713,1088722,1088796,1088804,1088821,1088866,1088872,1089074,1089086,1089115,1089141,1089198,1089268,1089271,1089467,1089608,1089644,1089663,1089664,1089667,1089669,1089752,1089753,1089762,1089878,1089889,1089977,1090098,1090150,1090457,1090522,1090534,1090535,1090605,1090643,1090646,1090658,1090717,1090734,1090818,1090888,1090953,1091101,1091158,1091171,1091264,1091424,1091532,1091543,1091594,1091666,1091678,1091686,1091781,1091782,1091815,1091860,1091960,1092100,1092289,1092472,1092566,1092710,1092772,1092888,1092904,1092975,1093023,1093027,1093035,1093118,1093148,1093158,1093184,1093205,1093273,1093290,1093604,1093641,1093649,1093653,1093655,1093657,1093663,1093721,1093728,1093904,1093990,1094244,1094356,1094420,1094541,1094575,1094751,1094825,1094840,1094978,1095042,1095094,1095104,1095115,1095155,1095265,1095321,1095337,1095467,1095573,1095735,1095893,1096065,1096480,1096529,1096696,1096705,1096728,1096753,1096790,1096793,1097034,1097105,1097234,1097356,1097373,1097439,1097465,1097468,1097470,1097471,1097472,1097551,1097780,1097796,1097800,1097941,1097961,1098016,1098043,1098050,1098174,1098176,1098236,1098401,1098425,1098435,1098599,1098626,1098706,1098983,1098995,1099029,1099041,1099109,1099142,1099183,1099715,1099792,1099918,1099924,1099966,1100132,1100209,1100340,1100362,1100382,1100416,1100418,1100491,1100602,1100633,1100734,1100843,1101296,1101315,1101324,971975,975772
CVE References: CVE-2017-5715,CVE-2017-5753,CVE-2018-1000200,CVE-2018-1000204,CVE-2018-10087,CVE-2018-10124,CVE-2018-10323,CVE-2018-1092,CVE-2018-1093,CVE-2018-1094,CVE-2018-1108,CVE-2018-1118,CVE-2018-1120,CVE-2018-1130,CVE-2018-12233,CVE-2018-13053,CVE-2018-13405,CVE-2018-13406,CVE-2018-5803,CVE-2018-5848,CVE-2018-7492,CVE-2018-8781,CVE-2018-9385
Sources used:
openSUSE Leap 15.0 (src):    kernel-debug-4.12.14-lp150.12.7.1, kernel-default-4.12.14-lp150.12.7.1, kernel-docs-4.12.14-lp150.12.7.1, kernel-kvmsmall-4.12.14-lp150.12.7.1, kernel-obs-build-4.12.14-lp150.12.7.1, kernel-obs-qa-4.12.14-lp150.12.7.1, kernel-source-4.12.14-lp150.12.7.1, kernel-syms-4.12.14-lp150.12.7.1, kernel-vanilla-4.12.14-lp150.12.7.1