Bug 1073212 - kernel 4.14.6 hard freeze
kernel 4.14.6 hard freeze
Status: RESOLVED FIXED
Classification: openSUSE
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Kernel
Current
Other Other
: P5 - None : Major (vote)
: ---
Assigned To: E-mail List
E-mail List
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2017-12-17 20:28 UTC by Bruno Friedmann
Modified: 2017-12-28 10:08 UTC (History)
4 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
Kernel config (47.69 KB, application/octet-stream)
2017-12-18 06:22 UTC, Bruno Friedmann
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Bruno Friedmann 2017-12-17 20:28:09 UTC
Just after installing kernel 4.14.6 (TW snapshot 20171215, The first reboot has failed with some kernel trace and the computer reboot (nmi interupt?)

On the second reboot, I've asked multi-user not X11. I was trying to access /var/tmp to remove plasma cache there and I got this backtrace

Dec 17 21:05:45 kernel: ------------[ cut here ]------------
Dec 17 21:05:45 kernel: kernel BUG at ../mm/slab.c:2972!
Dec 17 21:05:45 kernel: invalid opcode: 0000 [#1] PREEMPT SMP
Dec 17 21:05:45 kernel: Modules linked in: fuse af_packet vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) bnep msr snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic arc4 nls_iso8859_1 nls_cp437 vfat fat cdc_mbim cdc_wdm cdc_ncm qcserial usbnet usb_wwan usbserial mii snd_usb_audio snd_usbmidi_lib snd_rawmidi snd_seq_device dell_wmi mxm_wmi sparse_keymap wmi_bmof ppdev iTCO_wdt iTCO_vendor_support mei_wdt iwlmvm dell_laptop snd_hda_intel snd_hda_codec dell_smbios dcdbas snd_hda_core mac80211 snd_hwdep snd_pcm uvcvideo dell_smm_hwmon intel_rapl x86_pkg_temp_thermal hci_uart intel_powerclamp videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 coretemp rtsx_pci_ms btusb btrtl kvm_intel iwlwifi snd_timer serdev btbcm videobuf2_core kvm btqca btintel videodev e1000e bluetooth parport_pc joydev irqbypass
Dec 17 21:05:45 kernel:  memstick snd ptp i2c_i801 cfg80211 processor_thermal_device pcspkr pps_core mei_me int3403_thermal ecdh_generic mei parport intel_pch_thermal soundcore intel_soc_dts_iosf video dell_smo8800 shpchp pinctrl_sunrisepoint intel_lpss_acpi ie31200_edac pinctrl_intel intel_lpss tpm_tis int3400_thermal tpm_tis_core int3402_thermal dell_rbtn int340x_thermal_zone acpi_thermal_rel thermal tpm wmi battery rfkill acpi_als acpi_pad kfifo_buf ac industrialio button dm_crypt algif_skcipher af_alg hid_generic hid_logitech_hidpp hid_logitech_dj usbhid nvidia_drm(PO) nvidia_modeset(PO) rtsx_pci_sdmmc mmc_core crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel nvidia_uvm(PO) nvidia(PO) aesni_intel aes_x86_64 crypto_simd cryptd glue_helper serio_raw nvme drm_kms_helper nvme_core syscopyarea sysfillrect
Dec 17 21:05:45 kernel:  sysimgblt xhci_pci fb_sys_fops rtsx_pci xhci_hcd usbcore drm i2c_hid dm_mirror dm_region_hash dm_log dm_mod l2tp_ppp l2tp_netlink l2tp_core ip6_udp_tunnel udp_tunnel pppox ppp_generic slhc sg efivarfs
Dec 17 21:05:45 kernel: CPU: 3 PID: 3590 Comm: cut Tainted: P           O    4.14.6-1-default #1
Dec 17 21:05:45 kernel: Hardware name: Dell Inc. Precision 7510/0YH43H, BIOS 1.14.4 07/28/2017
Dec 17 21:05:45 kernel: task: ffff936c7650c040 task.stack: ffffb7d386940000
Dec 17 21:05:45 kernel: RIP: 0010:kmem_cache_alloc_trace+0x544/0x5a0
Dec 17 21:05:45 kernel: RSP: 0018:ffffb7d386943c28 EFLAGS: 00010086
Dec 17 21:05:45 kernel: RAX: 000000000000007c RBX: fffff197218bbea0 RCX: dead000000000200
Dec 17 21:05:45 kernel: RDX: ffff9367d7c01088 RSI: 0000000000000000 RDI: ffff9367d7c01080
Dec 17 21:05:45 kernel: RBP: ffff9367d7c01080 R08: fffff19721882760 R09: 0000000000024a88
Dec 17 21:05:45 kernel: R10: 000000000000003c R11: fffff19721882760 R12: ffff9367d7c00400
Dec 17 21:05:45 kernel: R13: 0000000000000032 R14: ffff936ec44e4a88 R15: fffff1971d8b8000
Dec 17 21:05:45 kernel: FS:  0000000000000000(0000) GS:ffff936ec44c0000(0000) knlGS:0000000000000000
Dec 17 21:05:45 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 17 21:05:45 kernel: CR2: 00007ff3924c4008 CR3: 000000062e806004 CR4: 00000000003606e0
Dec 17 21:05:45 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dec 17 21:05:45 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Dec 17 21:05:45 kernel: Call Trace:
Dec 17 21:05:45 kernel:  apparmor_file_alloc_security+0x47/0x240
Dec 17 21:05:45 kernel:  security_file_alloc+0x2e/0x40
Dec 17 21:05:45 kernel:  get_empty_filp+0x8d/0x1b0
Dec 17 21:05:45 kernel:  path_openat+0x2d/0x1550
Dec 17 21:05:45 kernel:  ? walk_component+0x38/0x310
Dec 17 21:05:45 kernel:  ? path_init+0x19c/0x330
Dec 17 21:05:45 kernel:  ? terminate_walk+0x62/0x100
Dec 17 21:05:45 kernel:  ? path_lookupat+0x9b/0x1d0
Dec 17 21:05:45 kernel:  do_filp_open+0x8c/0xf0
Dec 17 21:05:45 kernel:  ? __alloc_fd+0xaf/0x160
Dec 17 21:05:45 kernel:  ? do_sys_open+0x1a6/0x230
Dec 17 21:05:45 kernel:  do_sys_open+0x1a6/0x230
Dec 17 21:05:45 kernel:  entry_SYSCALL_64_fastpath+0x1e/0xa9
Dec 17 21:05:45 kernel: RIP: 0033:0x7ff3922bbf80
Dec 17 21:05:45 kernel: RSP: 002b:00007ffc6333d628 EFLAGS: 00000287 ORIG_RAX: 0000000000000101
Dec 17 21:05:45 kernel: RAX: ffffffffffffffda RBX: 0000000000000050 RCX: 00007ff3922bbf80
Dec 17 21:05:45 kernel: RDX: 0000000000080000 RSI: 00007ff3922c064b RDI: ffffffffffffff9c
Dec 17 21:05:45 kernel: RBP: 00007ffc6333de10 R08: 0000000000000000 R09: 0000000000000000
Dec 17 21:05:45 kernel: R10: 0000000000000000 R11: 0000000000000287 R12: 0000000000000008
Dec 17 21:05:45 kernel: R13: 0000000000000000 R14: 00007ff3924c8c00 R15: 0000000000000000
Dec 17 21:05:45 kernel: Code: 24 28 48 8b 44 24 30 49 89 5b 08 4d 89 5f 20 49 89 47 28 48 89 5d 18 41 f6 44 24 23 40 74 8a 49 c7 47 10 00 00 00 00 eb 80 0f 0b <0f> 0b 48 85 c0 0f 84 b3 fd ff ff e9 98 fd ff ff 48 8b 74 24 08
Dec 17 21:05:45 kernel: RIP: kmem_cache_alloc_trace+0x544/0x5a0 RSP: ffffb7d386943c28
Comment 1 John Johansen 2017-12-17 23:32:21 UTC
Can you provide some more information?

What is your kernel config?
Did you see this in a previous 4.14 kernel?
What is your mount info?

Can you reliably reproduce?

Does this reproduce with apparmor completely disabled?
  apparmor=0 on the kernel grub boot line?

If not with apparmor=0, what of with apparmor enabled at the kernel level but not policy being loaded?
  systemctl disable apparmor.service
Comment 2 Bruno Friedmann 2017-12-18 06:22:52 UTC
Created attachment 753426 [details]
Kernel config

Hi John

I didn't see any problems with previous kernel in the 4.14.x serie (I've the 4.14.3,4,5 installed before I purge them yesterday)

cmdline is 

BOOT_IMAGE=/vmlinuz-4.14.6-1-default root=/dev/mapper/vg0-lvtwroot root=LABEL=TWROOT rootfstype=ext4 rootflags=data=writeback noresume crashkernel=512M-:256M rd.vconsole.keymap=ch-fr rd.vconsole.font=ter-v32b.psfu rd.locale.LANG=en_US.UTF-8 rd.luks.allow-discard luks=yes quiet audit=0 plymouth.enabled=0 blacklist=nouveau nvidia-drm.modeset=1 libahci.ignore_sss=1

As I will reboot 2 or 3 times today I will inform you about if I can reproduce it or a variant crash. 
apparmor.service is disable already here. But if I reproduce it I will try the apparmor=0 on cmdline
Comment 3 Jiri Slaby 2017-12-18 07:09:10 UTC
(In reply to John Johansen from comment #1)
> What is your kernel config?

He uses the standard openSUSE's kernel, so this would be:
https://github.com/openSUSE/kernel-source/blob/stable/config/x86_64/default
Comment 4 Jiri Slaby 2017-12-18 07:15:36 UTC
(In reply to Bruno Friedmann from comment #0)
> Dec 17 21:05:45 kernel: kernel BUG at ../mm/slab.c:2972!

static __always_inline int alloc_block(struct kmem_cache *cachep,
                struct array_cache *ac, struct page *page, int batchcount)
{
        /*
         * There must be at least one object available for
         * allocation.
         */
        BUG_ON(page->active >= cachep->num);

Michal, am I smelling memory corruption correctly here?
Comment 5 Vlastimil Babka 2017-12-18 07:23:16 UTC
(In reply to Jiri Slaby from comment #4)
> (In reply to Bruno Friedmann from comment #0)
> > Dec 17 21:05:45 kernel: kernel BUG at ../mm/slab.c:2972!
> 
> static __always_inline int alloc_block(struct kmem_cache *cachep,
>                 struct array_cache *ac, struct page *page, int batchcount)
> {
>         /*
>          * There must be at least one object available for
>          * allocation.
>          */
>         BUG_ON(page->active >= cachep->num);
> 
> Michal, am I smelling memory corruption correctly here?

Typically these things are due to slab object double-free. If it's an apparmor-specific slab cache, then apparmor would be the main suspect.
Comment 6 John Johansen 2017-12-18 07:58:55 UTC
Right now in the upstream kernel there aren't any apparmor specific slab caches. So this is going to be in the generic slab cache. Nor am I aware of any apparmor changes in 4.14.6 that should affect this code path.

This does look like memory corruption (double free or other wise). If this is repeatable it would be good if we could try on a kernel with some slab debugging enable.d

I do have some apparmor patches that setup some apparmor specific slabs, and we could use those to test for an apparmor double free, but the generic slab debugging would be a better way to go if possible.
Comment 7 Bruno Friedmann 2017-12-18 08:45:16 UTC
I've perhaps another clue which can affect myself.

At time 4.14 enter TW I've rebuild the nvidia on my own obs (same spec used for openSUSE).

This morning I forced a reinstall of all nvidia and virtualbox, so the compilation occur against the only one kernel I have 4.14.6.

Just to be able to do my day work, I've actually added a apparmor=0 in boot line.
And the 2 following boot, doesn't crash.

Will try tonight, what happen if I remove it. I guess that if it's slab cache or something like this I shouldn't be the only one affected. (I've reported this case on factory mailing list, it should ease draining other report if any to this bug entry)
Comment 8 Bruno Friedmann 2017-12-19 06:24:54 UTC
Yesterday evening when shutting down the system I also got a jdb2_journal_destroy_revoke_table RIP trace 
(Sorry I've taken a picture with my phone, but stuck there for the moment)

I feel that Jiry is right about memory corruption, I've found that my 1TB nvme PM951 Samsung disk is just dying. Some config files content binary block and the speed has dropped from the initial 1800Mbps to 55 at worst to 500 at best.

I've asked Dell a replacement part which will come in a few days.
I will reinstall everything from scratch as I can't be sure at 100% that all data are good.


The strange things was the smart report with 24.5TB read and 135TB write in 22 months. The ratio worry me a bit. I was using vg luks then lvm/ext4 on top
Perhaps I should avoid luks.discard, lv discard and discard on mount.

As a conclusion, it seems the kernel is the victim here of bad hardware. I will close if I can't reproduce once reinstalled.

If you want any other information of the hardware, or setup,just ask before I remove the part.

Thanks for your comments, and advice
Comment 9 Jiri Slaby 2017-12-19 07:31:49 UTC
And if it is not HW, retry with banned nvidia and virtualbox. They both are suspects in cases like these.
Comment 10 Bruno Friedmann 2017-12-28 10:08:01 UTC
Ok will close it. 
After 3 days operating the new nvme drive given by Dell

Model Number:                       THNSN51T02DUK NVMe TOSHIBA 1024GB
Serial Number:                      874S10AHTAVT
Firmware Version:                   5KDA4103

I don't see any kind of kernel backtrace with the new hardware (hopefully)
Thanks for your ideas.