Bug 919836 - failsafe (nosmp) required to avoid unrecoverable kernel panic during boot
failsafe (nosmp) required to avoid unrecoverable kernel panic during boot
Status: RESOLVED WORKSFORME
Classification: openSUSE
Product: openSUSE Distribution
Classification: openSUSE
Component: Kernel
13.2
x86 Other
: P5 - None : Normal (vote)
: ---
Assigned To: E-mail List
E-mail List
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2015-02-26 23:20 UTC by Felix Miata
Modified: 2016-02-23 10:22 UTC (History)
1 user (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
y2logs from TW (1.10 MB, application/octet-stream)
2015-02-26 23:22 UTC, Felix Miata
Details
tgz of output from systemctl -k [-b -1|-b] (kernel-desktop|kernel-vanilla) (38.51 KB, application/octet-stream)
2015-02-26 23:26 UTC, Felix Miata
Details
32 bit 13.2 3.16.7-24 panic screen photograph (120.97 KB, image/jpeg)
2015-09-20 01:58 UTC, Felix Miata
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Felix Miata 2015-02-26 23:20:53 UTC
Without selecting failsafe boot option, or otherwise including nosmp on cmdline, boot either floods tty1 with a mix of normal and call traces perpetually, or floods for a while, then announces automatic reboot in 90 seconds, which subsequently does not occur. To recover requires reset button or power switch.

Initial opensuse-kernel mailing list post that generated no response except my own follow-ups:
http://lists.opensuse.org/opensuse-kernel/2015-02/msg00003.html

Reproducible here only on one of more than 20 13.2 and Tumbleweed installations on multiboot systems, of which only this and one other boot from or otherwise use mdraid. The other raid system uses a slightly newer E7600 Wolfdale Core2Duo on Intel 82801G/G31/NM10/ICH7 vs. this E6700 Conroe Core2Duo on Intel 82801HR/82P965/ICH8R. Knoppix 7.4.2/kernel 3.16.3 DVD, installed 13.1 with 3.11.x and 3.12.36, and TW with vanilla 3.19.0 work as expected. $SUBJECT occurs with installed 3.16.x, 3.18.3, 3.18.6 and 3.19.0 desktop kernels on 13.2 and TW.

The system on which this occurs is my main LAN-server/web-server/everyday 24/7 system, so availability for testing is necessarily limited, and delay of feedback requiring rebooting for data collection or testing is to be expected.
Comment 1 Felix Miata 2015-02-26 23:22:52 UTC
Created attachment 624716 [details]
y2logs from TW
Comment 2 Felix Miata 2015-02-26 23:26:22 UTC
Created attachment 624717 [details]
tgz of output from systemctl -k [-b -1|-b] (kernel-desktop|kernel-vanilla)
Comment 3 Felix Miata 2015-03-05 04:40:42 UTC
I tried to upgrade RAM from original 4G to all new 8G. Passing memtest proved no problem beyond the 2+ hours consumed running it, but 13.1, 13.2 and 13.3 all would lock up eventually, anywhere between several seconds into init, and having started SeaMonkey and Firefox in KDE, even after reinstalling the original RAM. Swapped in motherboard, CPU, NIC from another host, using the brand new RAM (4G max supported by motherboard). All problems reported here apparently gone now.

I do wonder how 13.1 managed to mask hardware failure exposed by both 13.2 and TW.
Comment 4 Felix Miata 2015-09-20 01:58:26 UTC
Created attachment 648046 [details]
32 bit 13.2 3.16.7-24 panic screen photograph

I no longer believe this invalid. In June the comment 0 machine ceased to be able to POST. I gave its behavioral history some more thought yesterday, and "re-engineered" its P9657AB-8EKRS2H motherboard, replacing 3 of its 4 2200uF 16V voltage regulator capacitors that I had previously replaced with new electrical equivalents, using 3300uF 16V instead.

I was then able to boot its 32 bit 13.1, 13.2 and TW installations as when I filed this, still needing nosmp on cmdline in 13.2's 3.16.7-7 and TW's 3.19.0 in order to be able to boot without panic. Then I did zypper up on all three installations. 13.2's 3.16.7-24 still needs nosmp to be able to boot without panic, but, like Kubecek's 13.1 3.12.44, TW's 4.1.6-3 produces no sign of any boot time errors that I can see.

With a HD out of another machine in the comment 0 machine, I booted 64 bit 13.1 with 3.12.44, 13.2 with 3.16.7-24, and TW with 4.1.6-1. None of them required or require nosmp to boot without any apparent errors.
Comment 5 Felix Miata 2015-09-20 22:29:12 UTC
I did some disk swapping in order to try other 32 bit kernels between 3.12.44 and 4.1.6. I was unable to reproduce need for nosmp with Mageia 4's 3.14.41 or Fedora 21's 3.17.7. I then installed kernel-vanilla-3.16.7-24. It too works without nosmp, so it looks like this is both valid and specific to openSUSE's 32 bit kernel-desktop.
Comment 6 Takashi Iwai 2015-11-19 13:30:00 UTC
Is this still worth to track, or did you already switch to newer system?
Comment 7 Felix Miata 2015-11-19 23:38:02 UTC
When I filed this, I had just replaced a RAID1 320G HD pair, hosting 11.0, 11.2 and 11.4, with a 1T RAID1 pair hosting 13.1, 13.2 and TW. All the behaviors then described were with the new RAID.

In recent months, the motherboard has been working OK without using nosmp, with a single HD and 64 bit kernels desktop-3.16.7-24 & 29 with a different HD and 13.2 installation. I had to loan out its 500G RAID1 pair for burn-in of the Haswell motherboard destined for this machine with the 1T RAID with which the problem arose. The 500G RAID1 pair was configured to be able to substitute for the 1T RAID1, with 13.1, 13.2 and TW. While in the temporary home of the Haswell, I replaced its TW installation with a 64 bit 13.2. Now that I've returned the 500G RAID to the problem motherboard, *and* done more zypper ups, this is continuing to be non-reproducible in 13.2 in either 32 bit or 64 bit.

All that said, I still think there could be a bug, but one that's either BIOS-setting-related, or in the BIOS itself. I've had to spend a lot of time in BIOS setup in order to acquire stability. Simply setting BIOS to global failsafe isn't sufficient. The problem with tweaking BIOS is I never logged what I was doing, so can't be sure which BIOS setting(s) need to be non-default without disturbing the now comfortable dog. Selected current settings are:

RAM fully manual 6-6-6-18 @ 800Mhz
Sooth over clock disabled
Intelligent stepping manual
VCore over voltage +0.0125
RAM voltage +0.200
APIC enabled
On-Chip serial ATA setting: SATA Mode
JMB361 controller disabled
Azalia/HD Audio enabled
1394 disabled
Floppy controller enabled
Serial port disabled
IrDA port disabled
Parallel port disabled
All USB enabled except keyboard

If I ever stumble onto which setting(s), and think the problem is in software rather than BIOS or hardware, then I can revisit this, but I'm not planning to spend more time changing BIOS settings before the new battery goes dead. Discovery probably won't happen, as I'm about to replace the 32 bit 13.2 with Leap, and the comment 0 motherboard will probably not be coupled with the comment 0 motherboard again anyway. So given the many OS updates and the parts & HD swapping and BIOS tweaking have masked or eliminated the trouble:
	->
		worksforme
Comment 8 Felix Miata 2016-02-23 10:22:05 UTC
For the record, with currently installed kernel-desktop-3.16.7-7.1.i686, kernel-desktop-3.16.7-29.1.i686 & kernel-desktop-3.16.7-35.1.i686, and just removed  kernel-desktop-3.16.7-24.1.i686, nosmp continues to be required to prevent lockup (or with -7.1, endless loop of Call Traces) trying to boot.

All 13.1 i686 desktop kernels and TW i686 default kernels, and all 13.1, 13.2, 42.1 and TW x86_64 kernels have no such need. Neither does 13.2's  kernel-default-3.16.7-35.1.i586, and neither does 13.2's kernel-pae-3.16.7-35.1.i686.