Bugzilla – Bug 919836
failsafe (nosmp) required to avoid unrecoverable kernel panic during boot
Last modified: 2016-02-23 10:22:05 UTC
Without selecting failsafe boot option, or otherwise including nosmp on cmdline, boot either floods tty1 with a mix of normal and call traces perpetually, or floods for a while, then announces automatic reboot in 90 seconds, which subsequently does not occur. To recover requires reset button or power switch.
Initial opensuse-kernel mailing list post that generated no response except my own follow-ups:
Reproducible here only on one of more than 20 13.2 and Tumbleweed installations on multiboot systems, of which only this and one other boot from or otherwise use mdraid. The other raid system uses a slightly newer E7600 Wolfdale Core2Duo on Intel 82801G/G31/NM10/ICH7 vs. this E6700 Conroe Core2Duo on Intel 82801HR/82P965/ICH8R. Knoppix 7.4.2/kernel 3.16.3 DVD, installed 13.1 with 3.11.x and 3.12.36, and TW with vanilla 3.19.0 work as expected. $SUBJECT occurs with installed 3.16.x, 3.18.3, 3.18.6 and 3.19.0 desktop kernels on 13.2 and TW.
The system on which this occurs is my main LAN-server/web-server/everyday 24/7 system, so availability for testing is necessarily limited, and delay of feedback requiring rebooting for data collection or testing is to be expected.
Created attachment 624716 [details]
y2logs from TW
Created attachment 624717 [details]
tgz of output from systemctl -k [-b -1|-b] (kernel-desktop|kernel-vanilla)
I tried to upgrade RAM from original 4G to all new 8G. Passing memtest proved no problem beyond the 2+ hours consumed running it, but 13.1, 13.2 and 13.3 all would lock up eventually, anywhere between several seconds into init, and having started SeaMonkey and Firefox in KDE, even after reinstalling the original RAM. Swapped in motherboard, CPU, NIC from another host, using the brand new RAM (4G max supported by motherboard). All problems reported here apparently gone now.
I do wonder how 13.1 managed to mask hardware failure exposed by both 13.2 and TW.
Created attachment 648046 [details]
32 bit 13.2 3.16.7-24 panic screen photograph
I no longer believe this invalid. In June the comment 0 machine ceased to be able to POST. I gave its behavioral history some more thought yesterday, and "re-engineered" its P9657AB-8EKRS2H motherboard, replacing 3 of its 4 2200uF 16V voltage regulator capacitors that I had previously replaced with new electrical equivalents, using 3300uF 16V instead.
I was then able to boot its 32 bit 13.1, 13.2 and TW installations as when I filed this, still needing nosmp on cmdline in 13.2's 3.16.7-7 and TW's 3.19.0 in order to be able to boot without panic. Then I did zypper up on all three installations. 13.2's 3.16.7-24 still needs nosmp to be able to boot without panic, but, like Kubecek's 13.1 3.12.44, TW's 4.1.6-3 produces no sign of any boot time errors that I can see.
With a HD out of another machine in the comment 0 machine, I booted 64 bit 13.1 with 3.12.44, 13.2 with 3.16.7-24, and TW with 4.1.6-1. None of them required or require nosmp to boot without any apparent errors.
I did some disk swapping in order to try other 32 bit kernels between 3.12.44 and 4.1.6. I was unable to reproduce need for nosmp with Mageia 4's 3.14.41 or Fedora 21's 3.17.7. I then installed kernel-vanilla-3.16.7-24. It too works without nosmp, so it looks like this is both valid and specific to openSUSE's 32 bit kernel-desktop.
Is this still worth to track, or did you already switch to newer system?
When I filed this, I had just replaced a RAID1 320G HD pair, hosting 11.0, 11.2 and 11.4, with a 1T RAID1 pair hosting 13.1, 13.2 and TW. All the behaviors then described were with the new RAID.
In recent months, the motherboard has been working OK without using nosmp, with a single HD and 64 bit kernels desktop-3.16.7-24 & 29 with a different HD and 13.2 installation. I had to loan out its 500G RAID1 pair for burn-in of the Haswell motherboard destined for this machine with the 1T RAID with which the problem arose. The 500G RAID1 pair was configured to be able to substitute for the 1T RAID1, with 13.1, 13.2 and TW. While in the temporary home of the Haswell, I replaced its TW installation with a 64 bit 13.2. Now that I've returned the 500G RAID to the problem motherboard, *and* done more zypper ups, this is continuing to be non-reproducible in 13.2 in either 32 bit or 64 bit.
All that said, I still think there could be a bug, but one that's either BIOS-setting-related, or in the BIOS itself. I've had to spend a lot of time in BIOS setup in order to acquire stability. Simply setting BIOS to global failsafe isn't sufficient. The problem with tweaking BIOS is I never logged what I was doing, so can't be sure which BIOS setting(s) need to be non-default without disturbing the now comfortable dog. Selected current settings are:
RAM fully manual 6-6-6-18 @ 800Mhz
Sooth over clock disabled
Intelligent stepping manual
VCore over voltage +0.0125
RAM voltage +0.200
On-Chip serial ATA setting: SATA Mode
JMB361 controller disabled
Azalia/HD Audio enabled
Floppy controller enabled
Serial port disabled
IrDA port disabled
Parallel port disabled
All USB enabled except keyboard
If I ever stumble onto which setting(s), and think the problem is in software rather than BIOS or hardware, then I can revisit this, but I'm not planning to spend more time changing BIOS settings before the new battery goes dead. Discovery probably won't happen, as I'm about to replace the 32 bit 13.2 with Leap, and the comment 0 motherboard will probably not be coupled with the comment 0 motherboard again anyway. So given the many OS updates and the parts & HD swapping and BIOS tweaking have masked or eliminated the trouble:
For the record, with currently installed kernel-desktop-3.16.7-7.1.i686, kernel-desktop-3.16.7-29.1.i686 & kernel-desktop-3.16.7-35.1.i686, and just removed kernel-desktop-3.16.7-24.1.i686, nosmp continues to be required to prevent lockup (or with -7.1, endless loop of Call Traces) trying to boot.
All 13.1 i686 desktop kernels and TW i686 default kernels, and all 13.1, 13.2, 42.1 and TW x86_64 kernels have no such need. Neither does 13.2's kernel-default-3.16.7-35.1.i586, and neither does 13.2's kernel-pae-3.16.7-35.1.i686.