Bug 1028575 - System randomly freezes or crashes to the login screen, glitches until rebooted
System randomly freezes or crashes to the login screen, glitches until rebooted
Status: RESOLVED DUPLICATE of bug 1084767
Classification: openSUSE
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Kernel
Current
x86-64 Other
: P2 - High : Critical (vote)
: ---
Assigned To: Michal Srb
E-mail List
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2017-03-08 23:04 UTC by Mircea Kitsune
Modified: 2018-04-17 10:57 UTC (History)
3 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
Xorg.0.log.old (48.94 KB, text/plain)
2017-03-21 19:53 UTC, Mircea Kitsune
Details
Xorg.0.log (106.66 KB, text/plain)
2017-03-21 19:53 UTC, Mircea Kitsune
Details
xsession-errors-:0 (137.64 KB, text/plain)
2017-03-21 19:54 UTC, Mircea Kitsune
Details
journalctl (902.02 KB, text/x-vhdl)
2017-03-21 19:56 UTC, Mircea Kitsune
Details
dmesg (209.44 KB, text/plain)
2017-03-21 19:56 UTC, Mircea Kitsune
Details
lspci (5.81 KB, text/plain)
2017-03-21 19:57 UTC, Mircea Kitsune
Details
zypper history (169.74 KB, text/plain)
2017-05-10 19:48 UTC, Mircea Kitsune
Details
Photo of the corrupt image on the screen (1.75 MB, image/jpeg)
2017-06-18 18:59 UTC, Mircea Kitsune
Details
Screenshot of "top" (204.40 KB, image/png)
2017-07-05 13:14 UTC, Mircea Kitsune
Details
Memtest86 screenshot (2.34 MB, image/jpeg)
2017-08-04 12:32 UTC, Mircea Kitsune
Details
Output of "dmesg -w" (89.04 KB, text/plain)
2017-08-05 12:24 UTC, Mircea Kitsune
Details
Output of "dmesg -w" (full) (463.67 KB, text/plain)
2017-08-07 20:52 UTC, Mircea Kitsune
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Mircea Kitsune 2017-03-08 23:04:15 UTC
Approximately once every 1 to 3 days of uptime, the system experiences a sudden and inexplicable crash: The image completely freezes in place, although unlike similar crashes in the past I can keep moving the mouse pointer around. A few seconds afterward, I find myself in a black console... and a few seconds after that, I'm back at the login screen. If I attempt to log back in however, the image either freezes again or desktop effects are no longer working without any error message as to why. Not even forcefully restarting X11 (control + alt + backspace twice) fixes the remaining glitches, and the only way to truly recover the system is to also reboot.

Although the crashes are completely random, I vaguely get the impression they might be happening when an event triggers certain desktop effects. Several times the crash occurred as a system tray notification popped up, whereas just now the system crashed while I was switching desktops in the middle of the desktop cube animation. Certain games might also have a probability of causing this.

I use the free video drivers and default system packages, all latest versions of openSUSE Tumbleweed. My card is a Radeon R7 370, GCN 1.0 on RadeonSI.
Comment 1 Mircea Kitsune 2017-03-14 13:27:37 UTC
This continues to be a huge issue with the latest Tumbleweed packages! The system now freezes and crashes on a hourly basis. It's likely something triggering a GPU crash, as it only seems to happen the moment certain desktop effects occur or given panels pop up; Even bringing up the taskbar (from auto-hide) or clicking the arrow which expands the system tray can cause it.

The only way to recover the system without restarting is to immediately hit "control + alt + backspace" to kill x11 once noticing the freeze, before it has time to become permanent or the glitches I mentioned take place. Can anyone else confirm this please, and let us know whether a solution is being worked on?
Comment 2 Mircea Kitsune 2017-03-18 02:12:45 UTC
The issue seems to have been maintained from kernel 4.10.1 to 4.10.3. The new kernel has updated drivers, including for radeon which is what my card uses. This implies it might not be a driver issue, although I'm not sure where else such a trigger may be hidden.
Comment 3 Mircea Kitsune 2017-03-21 17:11:22 UTC
I discovered an important detail today: The crash is not limited to KDE desktop effects, unlike most crashes of this sort and what I initially suspected! I've had compositing disabled for several hours (alt + shift + F12) yet the exact same crash occurred just now: All windows and buttons froze in place while only the mouse cursor could be moved, the monitor then turned itself on and off a few times, and I found myself back to the login manager. Upon killing X11 and logging back in, I was presented with a window asking to check my compositor settings... indicating that KDE might have detected compositing as a source, although it's been entirely turned off so that couldn't be true.

Could a maintainer or developer familiar with this part of the system please be dispatched to the issue? I have reported it two weeks ago, and so far there has been no official response here. While I understand the developers are busy with other concerns, this is a major problem due to which I can't keep my system from being shot down at random daily intervals. Considering that such a crash may occur anytime, including during a package update or while other processes are handling data, it leaves my system at risk of data corruption!

Additionally, could I please request that updates for x11 and the AMD video driver are prioritized? http://tumbleweed.boombatower.com indicates that Devel has the following versions pending: xorg-x11-server 1.19.3 (currently 1.19.2), xf86-video-ati 7.9.0 (currently 7.8.0), Mesa 17.0.2 (currently 17.0.1). Yesterday's Factory snapshot appears to have ignored these packages. As Kernel updates don't seem to affect the issue, I'm hoping a new version of one of these components might include a solution. Thank you.
Comment 4 Mircea Kitsune 2017-03-21 19:53:27 UTC
Created attachment 718222 [details]
Xorg.0.log.old
Comment 5 Mircea Kitsune 2017-03-21 19:53:56 UTC
Created attachment 718223 [details]
Xorg.0.log
Comment 6 Mircea Kitsune 2017-03-21 19:54:53 UTC
Created attachment 718224 [details]
xsession-errors-:0
Comment 7 Mircea Kitsune 2017-03-21 19:56:11 UTC
Created attachment 718225 [details]
journalctl
Comment 8 Mircea Kitsune 2017-03-21 19:56:59 UTC
Created attachment 718226 [details]
dmesg
Comment 9 Mircea Kitsune 2017-03-21 19:57:33 UTC
Created attachment 718227 [details]
lspci
Comment 10 Mircea Kitsune 2017-03-21 20:09:53 UTC
A note regarding the logs I just posted: The last crash of this sort (with desktop effects disabled) occurred around 18:00 21-03-2017 on my local time. Look around that hour, for logs that have timestamps available.
Comment 11 Mircea Kitsune 2017-03-21 20:46:55 UTC
I apologize for the chain of comments! I'm also discussing this issue on IRC (#dri-devel on irc.freenode.org) and someone seemingly familiar with the code took a look at the logs I posted. Apparently this is a GPU lockup after all. They suggested I tell the llvm packagers to revert r280589. Someone else also mentioned a regression in llvm 3.9.1.

The log of our conversation can also be found online, in case I happened to miss any significant details: https://people.freedesktop.org/~cbrill/dri-log/?channel=dri-devel&date=2017-03-21
Comment 12 Takashi Iwai 2017-03-27 13:36:58 UTC
OK, let's reassign to LLVM maintainer.  Feel free to assign back once when it turns out to be a kernel problem.
Comment 13 Mircea Kitsune 2017-03-29 19:42:53 UTC
I'm happy to announce that since xorg-x11-server 1.19.3 and / or xf86-video-ati 7.9.0 the issue appears to have been remedied. I now have over 5 days of uptime, with not a single GPU freeze caused by the desktop. If by chance the problem returns, I'll definitely post an update and let everyone know. It would be very appreciated if in the future, the openSUSE team and driver developers could please consider more in-depth testing for GPU lockups, to prevent this sort of thing from repeating.

I imagine this issue can be marked as resolved. I will reopen it if I see the problem happening again. Feel free to otherwise reopen it if anyone believes there's still something to investigate.
Comment 14 Mircea Kitsune 2017-04-02 15:04:33 UTC
To my stupefaction, the exact same issue has been re-introduced and is happening again. After another openSUSE Tumbleweed snapshot and several package updates, I believe approximately 2 days ago, the freeze occurs once more. I can however confirm that for approximately 8 days before that, the problem went away as I had one week of uninterrupted uptime. Can someone please analyze what changes fit this time pattern?
Comment 15 Mircea Kitsune 2017-04-06 12:09:56 UTC
I feel that at this point, I should express my disappointment regarding the lack of attention this report has received over the course of a month. I marked it as high priority (which I believe it is), posted about this on the forum, and wrote about it to the Factory mailing list (where it was completely ignored). To this day, I still have no idea what this is or when and how it might be fixed.

While I understand that openSUSE is a free project and its team has many things to deal with, this is a major stability issue that literally makes it impossible to keep a machine running safely! I'm not asking anyone to fix it right this instant... however I expected the developers to have at least taken a look through the logs, potentially suggested something safe I could try, and offered insight as to what a plausible suspected cause might be.

I'll also link the same bug report on freedesktop.org, in case any info posted there might be helpful: http://bugs.freedesktop.org/show_bug.cgi?id=100306
Comment 16 Ismail Dönmez 2017-04-06 12:18:04 UTC
Hi Mircea,

(In reply to Mircea Kitsune from comment #15)
> I feel that at this point, I should express my disappointment regarding the
> lack of attention this report has received over the course of a month. I
> marked it as high priority (which I believe it is), posted about this on the
> forum, and wrote about it to the Factory mailing list (where it was
> completely ignored). To this day, I still have no idea what this is or when
> and how it might be fixed.

Sorry for my silence but I am already working on upgrading llvm to 4.0 which I hope will fix this issue, if not then we can ask for a revert on the llvm mailing list for the problematic commit. Randomly reverting commits on llvm can have bad consequences. Sadly the Open Build Service seems to be slow these days and it takes a lot of back and forth to see if my commits work or not. You can see the progress in https://build.opensuse.org/package/show/devel:tools:compiler/llvm4

Thanks,
ismail
Comment 17 Mircea Kitsune 2017-04-06 13:03:00 UTC
(In reply to Ismail Donmez from comment #16)

Thanks for the info! A llvm update sounds like something that might fix this issue, let's hope llvm 4.0 does. If not, I don't know what the problematic commit might be, so apart from safely doing a reversion we'd also have to find that out. The system takes at least 3 days of uninterrupted uptime to even slightly confirm the absence of this freeze, so testing it is also hard.

One of my suggestions in the other bug report was dumping my package installation history over the last month. That way we could see which package changes match my time line (problem first starting around 08-03-2017, stopping near 22-03-2017, then happening again on 02-04-2017). However there doesn't seem to be a zypper command to dump package installation history over a fixed amount of time... if anyone knows of one please let me know.
Comment 18 kevin Zhu 2017-04-14 04:35:48 UTC
Hello 

Since I last month installed Tumbleweed I have experienced the same issue at least 3 times: my monitor screen would randomly  freeze when I was using Firefox but mouse cursor could still move. Ctrl+Alt+F1 didn't work.

Fingers crossed.

Regards,
Kevin
Comment 19 Mircea Kitsune 2017-04-14 12:24:52 UTC
(In reply to kevin Zhu from comment #18)

I'm sorry to hear that you're also dealing with this, but glad to know I'm not alone here. Still waiting for LLVM 4.0 myself, but so far it hasn't been pulled from Devel into Factory and is therefore not planned for Tumbleweed in the foreseeable future.

Not much has changed on my end otherwise: I actually managed to get 3 days of uptime, then last night it happened again. I think updates to various packages slightly change the frequency of the freeze, whereas the issue itself has never been fixed since my report.
Comment 20 Mircea Kitsune 2017-04-14 14:01:21 UTC
Very important note: While investigating a completely unrelated bug, I remembered that KWin was configured to use egl over glx on my machine. I believe glx is the old render architecture, whereas egl is the new renderer which uses OpenGL and is visibly a lot faster.

Considering that desktop activity was always the cause of the freezes in some form, I have a strong suspicion that this might have something to do with the GPU freezes as well! It's a very likely candidate because egl involves experimental OpenGL rendering, and since it's not enabled by default that would also explain why few other people are able to reproduce the GPU lockups. I have just now switched back to glx and therefore haven't had the time to confirm this, but I'm willing to bet it just might make the problem go away... I will immediately post an update if and when I'm proven wrong obviously.

If anyone else wants to test this theory and help in reproducing the system freeze, consider switching to egl rendering. Obviously this means your machine might start freezing as well, so only do this if there's no risk of data loss or major annoyance. The switch is done by opening ~/.config/kwinrc in a text editor and changing:

GLPlatformInterface=egl

to

GLPlatformInterface=glx

Note that if you're using an Aurorae theme, you might experience the bug I mentioned above, which involves KWin no longer rendering window decorations. Here are the reports for that in case anyone is curious:

https://bugzilla.opensuse.org/show_bug.cgi?id=1033598
https://bugs.kde.org/show_bug.cgi?id=378663
Comment 21 Mircea Kitsune 2017-04-18 01:31:25 UTC
Nope, still happens in glx as well. Switching away from egl rendering does not make the freezes go away.
Comment 22 Ismail Dönmez 2017-04-18 10:02:15 UTC
llvm4 is now accepted into Factory and should be in the next update.
Comment 23 kevin Zhu 2017-04-18 10:12:06 UTC
(In reply to Ismail Donmez from comment #22)
> llvm4 is now accepted into Factory and should be in the next update.

Thanks Ismail. As an enthusiastic non-developer Linux user, I wonder whether there is any way to apply the patch myself without waiting for a build? Is there any guide, or just roughly the direction, that you may kindly show me? :)

Many thanks,
Kevin
Comment 24 Ismail Dönmez 2017-04-18 10:13:43 UTC
(In reply to kevin Zhu from comment #23)
> (In reply to Ismail Donmez from comment #22)
> > llvm4 is now accepted into Factory and should be in the next update.
> 
> Thanks Ismail. As an enthusiastic non-developer Linux user, I wonder whether
> there is any way to apply the patch myself without waiting for a build? Is
> there any guide, or just roughly the direction, that you may kindly show me?
> :)

Since more than one package is involved here (llvm, Mesa, ...) I would really want you to expect the next update. It should happen this week, if not I'll create a test repo for you people.
Comment 25 Mircea Kitsune 2017-04-18 12:59:52 UTC
(In reply to Ismail Donmez from comment #22)
> llvm4 is now accepted into Factory and should be in the next update.

Thank you very much, that is great news! I'll monitor its progress at http://tumbleweed.boombatower.com and immediately update once it hits Tumbleweed. Let's just hope it truly makes the lockup go away... it's been happening for so long now, I'll only draw a verdict after I see it gone for at least a month.
Comment 26 Mircea Kitsune 2017-04-20 13:23:12 UTC
llvm 4.0.0 is now in openSUSE Tumbleweed: I have preformed a 'zypper dup', installed it, and restarted. Now it's time to see if this really makes the freeze go away.

Please allow me to keep this issue open for another month to make sure it's truly gone: If through some stretch of imagination I see the problem again, I'll immediately post a reply here and let everyone know! I need to take at least several days to have no doubt it has been solved, due to how far its probability seems to have ranged. Thank you.
Comment 27 Ismail Dönmez 2017-04-20 13:25:25 UTC
(In reply to Mircea Kitsune from comment #26)
> llvm 4.0.0 is now in openSUSE Tumbleweed: I have preformed a 'zypper dup',
> installed it, and restarted. Now it's time to see if this really makes the
> freeze go away.

Sadly Mesa was built with llvm 3.9 still, I asked the Factory maintainers to recompile all llvm dependencies and that should hopefully will be available soon.

Thank you!
Comment 28 Mircea Kitsune 2017-04-20 13:43:08 UTC
(In reply to Ismail Donmez from comment #27)

Oh... thank you for this info. I also noticed that Mesa 17.0.4 is waiting in Devel... maybe an update on that can also be pooled in the process? Anyway I'll wait until the recompilation happens, and will ignore the freeze until then if I see it again.
Comment 29 Mircea Kitsune 2017-04-24 18:11:33 UTC
A 'zypper dup' earlier today upgraded my machine to Kernel 4.10.10 and reinstalled Mesa 17.0.3. I understand this means that llvm 4.0.0 should be in effect on both of these components, and if llvm was at fault the issue should now disappear. I've had more freezes during the last few days, and if somehow they continue even after this we'll seriously need to dig further. I will immediately let everyone know if I see another freeze from this point on.
Comment 30 Ismail Dönmez 2017-04-25 12:24:07 UTC
(In reply to Mircea Kitsune from comment #29)
> A 'zypper dup' earlier today upgraded my machine to Kernel 4.10.10 and
> reinstalled Mesa 17.0.3. I understand this means that llvm 4.0.0 should be
> in effect on both of these components, and if llvm was at fault the issue
> should now disappear. I've had more freezes during the last few days, and if
> somehow they continue even after this we'll seriously need to dig further. I
> will immediately let everyone know if I see another freeze from this point
> on.

Yes, please.
Comment 31 Mircea Kitsune 2017-05-09 22:16:54 UTC
The same freeze happened again today, after a two week period of not seeing the problem any more. This is reaching the point where it's becoming outright ridiculous: I've used openSUSE for years, and have never seen such a thing spanning over such a long time period.

Unfortunately I can't risk breaking my system by installing custom versions of llvm or the system wide Mesa. I can only hope the developers know what to bisect to find out where this fault lies, based on the logs I've attached here. Let me know if there is more info I could somehow help with.
Comment 32 kevin Zhu 2017-05-09 23:53:32 UTC
I had 'downgraded' the system to 42.2 Gnome and it had been ok for first a few days until there was a couple of freeze s. 

I guess the issue have to do with the Nividia video card's open source video driver. I will try to set up the prioritory driver and see what happens.
Comment 33 Mircea Kitsune 2017-05-10 19:48:07 UTC
Created attachment 724597 [details]
zypper history

I'm going to include one additional piece of information: This month's entries of /var/log/zypp/history Like I said, the problem does seem to have went away once llvm 4.0 was installed and Mesa was recompiled with it... it's yesterday that it suddenly came back, after over 2 weeks of no problems. Feel free to look at what was updated in the last approximately 3 days, which I imagine might hold a clue as to where this keeps being constantly solved then reintroduced again.
Comment 34 Mircea Kitsune 2017-05-19 12:57:57 UTC
Once again, I'm dealing with at least one system crash per day. The latest one happens even after upgrading to the 4.11.0 Kernel, meaning the error was ported to it as well.

Why are the developers so incapable this time? This has been happening for nearly 3 darn months! The problem has been fixed twice, and each time it's returned after over 2 weeks. Can someone please explain what the heck is going on here? At this stage, it feels like someone is actively developing and updating this freeze against the latest system components... I don't even see how it could survive through this many package updates by chance alone, it is ridiculous.

And please don't lecture me on how this is free software, and I should only be complaining if I was actually paying the developers: There is a limit beyond which an important piece of software, be it a free Linux distribution or component, can break and stay broken. To literally be unable to keep a system running without the image suddenly freezing and the monitor shutting down every single day, for over a quarter of an year... that goes far beyond that limit.

I'm sorry for the outburst, but at this stage I think something needed to be said. I did not expect something like this to get dragged so far, and that I'd be unable to keep my system running for months. I'm going to bump the severity of this issue again, in hopes that someone can please take a look at it so I can run my system normally again! Thank you.
Comment 35 Ismail Dönmez 2017-05-19 13:08:43 UTC
(In reply to Mircea Kitsune from comment #34)
> Once again, I'm dealing with at least one system crash per day. The latest
> one happens even after upgrading to the 4.11.0 Kernel, meaning the error was
> ported to it as well.

I understand your frustration but basically logs show that you are getting random GPU lockups, searching Google for "radeon r7 370 lockup" shows a lot of results even from Windows. So at this point I believe this might be a hardware problem.
Comment 36 Takashi Iwai 2017-05-19 13:16:36 UTC
If it's a GPU lockup, and it's in the fairly recent upstream code, could you rather raise the report to upstream (e.g. bugzilla.freedesktop.org)?  There you may get more attention.

Also, please don't touch the priority field of bugzilla.  It's not the field the reporter may modify.  You can adjust the severity, but not the priority.  Thanks.
Comment 37 Mircea Kitsune 2017-05-19 13:21:37 UTC
(In reply to Ismail Donmez from comment #35)
> I understand your frustration but basically logs show that you are getting
> random GPU lockups, searching Google for "radeon r7 370 lockup" shows a lot
> of results even from Windows. So at this point I believe this might be a
> hardware problem.

I considered a hardware problem, but that seems unlikely for many reasons: The video card is still very new (about an year now) and a solid Gigabyte model... I've had another Radeon / Gigabyte card before it, which lasted me for nearly a decade. This issue is also triggered only by the KDE desktop, when certain desktop effects play or I select some windows (even with effects turned off)... I play high-end games on my machine, yet even they don't trigger this exact crash! The behavior of the crash also closely indicates a software problem... hardware issues usually cause a total freeze (Kernel panic), whereas this only freezes the desktop (without mouse pointer) then causes the monitor to shut down.

(In reply to Takashi Iwai from comment #36)
> If it's a GPU lockup, and it's in the fairly recent upstream code, could you
> rather raise the report to upstream (e.g. bugzilla.freedesktop.org)?  There
> you may get more attention.
> 
> Also, please don't touch the priority field of bugzilla.  It's not the field
> the reporter may modify.  You can adjust the severity, but not the priority.
> Thanks.

I have a report there as well, mirroring this one: https://bugs.freedesktop.org/show_bug.cgi?id=100306 Since only windows in KDE seem to trigger this specific crash, I'll probably make a third mirror on KDE's bug tracker as well... need to consider if it's related closely enough to KDE for that.
Comment 38 Ismail Dönmez 2017-05-19 13:29:05 UTC
(In reply to Mircea Kitsune from comment #37)
> (In reply to Ismail Donmez from comment #35)
> > I understand your frustration but basically logs show that you are getting
> > random GPU lockups, searching Google for "radeon r7 370 lockup" shows a lot
> > of results even from Windows. So at this point I believe this might be a
> > hardware problem.
> 
> I considered a hardware problem, but that seems unlikely for many reasons:
> The video card is still very new (about an year now) and a solid Gigabyte
> model... I've had another Radeon / Gigabyte card before it, which lasted me
> for nearly a decade. This issue is also triggered only by the KDE desktop,
> when certain desktop effects play or I select some windows (even with
> effects turned off)... I play high-end games on my machine, yet even they
> don't trigger this exact crash! The behavior of the crash also closely
> indicates a software problem... hardware issues usually cause a total freeze
> (Kernel panic), whereas this only freezes the desktop (without mouse
> pointer) then causes the monitor to shut down.

I can't see if in this bug report but did you try to set kwin to use XRender instead of OpenGL?
Comment 39 Mircea Kitsune 2017-05-19 13:44:27 UTC
(In reply to Ismail Donmez from comment #38)
> I can't see if in this bug report but did you try to set kwin to use XRender
> instead of OpenGL?

If you're referring to desktop compositing, I tried disabling it altogether. This might make the problem slightly rarer, but it is not the root of the cause: The freezes can still occur even without effects enabled at all.

The trigger always seems to be selecting another window or popping up a new panel: If I put the mouse pointer near a panel to make it appear, or alt-tab switch to another window, or switch to a different desktop... that's when there's a chance that it happens. It never happens when I do something in any application, only when I do something that changes the desktop.
Comment 40 Mircea Kitsune 2017-05-21 11:29:16 UTC
More info after the latest crash: In ~/.config/kwinrc I tried setting GLCore=false and GLColorCorrection=false, however none of these seem to affect the problem although I found them suspicious. I wonder if other such settings, such as vsync (tearing prevention) could make a difference... there are many combinations to try, and I'm not sure which also affect the system when compositing is off.

When the system doesn't completely freeze, the behavior of the problem can be very strange at times; Last night after the desktop froze, I quickly hit Control + Alt + Backspace several times to kill X11... the system went into a console, which started flashing continuously together with the HDD led. I couldn't do anything on my keyboard and mouse... but once I pressed the power button it stopped and the machine shut itself down quickly and cleanly, meaning it still received the power off signal and managed to recover from that.
Comment 41 Mircea Kitsune 2017-05-29 20:41:46 UTC
In weekly news, it appears a recent openSUSE Tumbleweed snapshot (which among other changes upgraded Mesa 17.0.5 to 17.1.0) made the issue a lot rarer for the time being: Until this snapshot it was even happening twice a day, hence why it was starting to drive me nuts... now I only seem to get this freeze every 2-3 days of uptime, which is so much more bearable! I worry whether the next snapshot will make the issue more frequent again rather than better... clearly it's one of the important system packages that's messing with it, but I still have no idea which and how.
Comment 42 Mircea Kitsune 2017-06-07 21:04:08 UTC
After another two weeks of absence, the issue was apparently reimplemented on top of Kernel 4.11.3 + Mesa 17.1.1 + Plasma 5.10.0, likely sometime during the last few days. The behavior is once again identical, with alt-tab switching or desktop effects causing everything but the mouse pointer to freeze then after 10 seconds the monitor shuts down. Other unrelated GPU crashes (such as those caused by some games) behave by the classic model, where the entire system simple freezes in place at once... that's a very different result from this freeze, and likely confirms this is a different type of crash.

At this point I have almost no doubt this is an attack that's being deliberately programmed, and manually reimplemented on top of new drivers once it gets fixed. The cycle seems to be that a kernel or driver update resolves the issue, then the creators of the crash require about two weeks to patch it and reimplement the exact same functionality. This is the 4th time the story repeats.

I tried steering away from this possibility until I was sure, as I didn't want it triggering any unnecessary arguments... if this is an attack then investigating it as such might help in finding its source more quickly. There's simply no way something this precise could happen by itself for nearly half an year, always coming back with the exact same effects after a period of absence... all despite radical changes to nearly every driver and system component, which would have no doubt altered the behavior of the initial problem in some form. Therefore I hope everyone can see why I'm now going with this theory and greatly considering the option of malicious intent.

I have no idea how the virus (?) could be updated on my computer, as it's likely not through the package update system directly. However I suspect it's using a constant series of vulnerabilities in one or more system components, which should be fixed by the developers if they exist. I would appreciate any ideas on both how the malicious code might be inserted into the computer, as well as finding the vulnerability within radeon / Mesa / X11 / etc that it exploits. Please let me know what your thoughts are!
Comment 43 Mircea Kitsune 2017-06-08 23:44:37 UTC
An update: The latest reimplementation of the freeze appears to be worse than all others. My system can now be taken down after only 4 hours of uptime! This is a huge difference from all previous versions of the crash, which required that the system had at least been running for a day before it could be crashed.
Comment 44 Mircea Kitsune 2017-06-18 18:59:21 UTC
Created attachment 729307 [details]
Photo of the corrupt image on the screen

I have discovered some very important details today. Everyone following up on the report, please see this comment!

Recently I realized that a useful test would be to jump into a different run level once I notice the crash, in order to see how the system behaves there. A few minutes ago another freeze took place, so I instantly hit Control + Alt + F1 to go to a console. What I noticed was pretty remarkable and sheds light on a few aspects:

I could keep typing in the console for nearly 10 seconds, but after that the exact same behavior still took place (monitor turned itself on and off two times then the image froze). This time however I was able to toggle the NumLock led a minute after the crash, while also seeing the HDD led still working; That means this is not (always) a total system freeze such as a Kernel panic... instead it appears to be the image output corrupting and staying that way, freezing only specific components with it (I was still unable to issue a blind reboot command for instance). To put everything into an approximate timeline, this is what happened:

00 seconds in: The crash occurs.
02 seconds in: I notice and instantly hit Control + Alt + F1.
05 seconds in: I'm taken to a console where everything works fine: I see the blinking cursor, can write my login and password, etc.
12 seconds in: Suddenly the monitor turns off and back on several times, then the image remains frozen in place.

This time however, the screen did not remain turned off or black. Instead it stayed stuck in a state showing corrupt lines and rectangles of random colors. I took a photo of my screen with my smartphone, which I attached to this issue.
Comment 45 Mircea Kitsune 2017-07-05 13:14:29 UTC
Created attachment 731245 [details]
Screenshot of "top"

Lots of important new information on this freeze, which was of course ported to the latest openSUSE Tumbleweed system packages and still works:

First and foremost, the problem does not happen in every session, and this is not always influenced by updates! During an interval in which I installed absolutely no relevant package changes, the following has happened: The freeze occurred after about just 8 hours of uptime... after that I restarted the machine, but then I had 4 days of uptime with no freeze! This leads me to believe that certain applications or system actions prepare the system with a "time bomb", which then causes switching between windows or desktops to produce the freeze... however I have no way to know what mines the system and what doesn't yet, as I use too many applications at once to figure out which might be responsible.

Anyway another crash happened today. Once more I quickly hit Control + Alt + F1 to switch to a different runlevel; This caused the image to become corrupted on the monitor, however the system remained responsive and didn't actually freeze. So I went to my mother's computer and logged in via SSH, which indeed still worked. I was able to issue a reboot command, which caused the image to briefly unfreeze as the monitor turned on and off a few more times... I could see a few KDE error messages about applications crashing, before the system actually went ahead and rebooted successfully! However this is only possible if I switch to a console quickly enough when noticing the freeze start to happen, if not the whole machine freezes and not even SSH responds from other devices!

While I was in SSH, I decided to run "top" and take a screenshot of my processes (while the computer was frozen and with corrupt image stuck on the screen). I can't tell if anything is out of the ordinary such as a memory leak, but I'm attaching a screenshot of it here.
Comment 46 Mircea Kitsune 2017-07-05 18:21:49 UTC
Thought I'd also post another detail that might be useful, I'm not sure how much it relates to the freeze but better be safe than sorry; I have the following two environment variables added to my ~/.profile file, which basically tell Mesa to post errors to a log file:

export MESA_DEBUG=1
export MESA_LOG_FILE=/home/mircea/.mesa_stderr

There's one reoccurring line which keeps getting printed in there. It's added periodically with no side effects, but I imagine it could still have some relation to the trigger of the freeze:

Mesa: User error: GL_INVALID_OPERATION in glTexSubImage2D(invalid texture image)
Comment 47 Mircea Kitsune 2017-08-02 13:31:00 UTC
After months of careful testing and experimentation, I have discovered what seems to be the primary trigger of this freeze at last. It's not what triggers it per say, but what "rigs" the system and causes it to crash within the course of the next hours... the actual trigger is alt-tab switching between windows, or certain desktop effects playing.

The freeze is mined into the system when you disable and re-enable KDE desktop compositing. If I hit Alt + Shift + F12 to turn off desktop effects, then hit the key combo to turn them back on... there is a great chance that within a few hours the crash occurs. If I don't toggle compositing on the run and just leave it enabled after the system has started, I seem to be fine... this only happens if I turn it off and back on during runtime. It's uncertain whether anything else mines the system, but this is almost always what seems to do it for me.

Notice: I use OpenGL 3.1 for desktop compositing. I remember selecting OpenGL 2.0 long ago, but that still caused the freeze at that time. I can't use Xrender on a daily basis as many effects don't work with it. No other compositor options seem to affect the problem either.

It would be highly appreciated if at least after this information, the developers and maintainers could finally look at this issue! It has taken me months to confirm this as a cause, and I really hope this information (alongside dozens of comments and logs I have posted) can finally be put to use.
Comment 48 Mircea Kitsune 2017-08-03 13:18:40 UTC
Today I discovered that even when not toggling desktop effects at runtime, the freeze can still be mined into the system. I got a crash after 1 day of uptime, no toggling of desktop compositing required.

I find it remarkable how the cause of the crash appears to have immediately changed after me making the comment above yesterday; I tested my theory that desktop effects are the root for 2 months, yet the moment I publish my observations the behavior changes in less than a day. This further makes me concerned that someone might be deliberately programming this crash using vulnerabilities in system components, solely for how strange this coincidence is. I'm still waiting for the developers to help investigate this further whatever the case, as I cannot find any explanation at this point.
Comment 49 Mircea Kitsune 2017-08-04 12:32:08 UTC
Created attachment 735281 [details]
Memtest86 screenshot

To rule out the possibility of a hardware issue, I ran two Memtest86 5.01 sessions from a Clonezilla bootable CD. The first was in the day for 5 hours, the second was during the night for over 10 hours: The program only registered 3 passes in total, but it did not find any errors. I'll attach a picture just in case any useful information is printed there.
Comment 50 Mircea Kitsune 2017-08-05 12:24:34 UTC
Created attachment 735385 [details]
Output of "dmesg -w"

This is perhaps the most important piece of information I managed to gather on the problem thus far. If you have a technical understanding of this data, please take a look at the log and let us know what it says!

I was able to run a SSH session on my computer from another machine. In it I left the command "dmesg -w" running. I toggled desktop effects last night to provoke a crash today, which happened as expected and allowed me to conduct the test. This is basically what dmesg is seeing in realtime as the system is crashing.

I can't make sense of the information, but it definitely looks descriptive. Although the computer seemed completely frozen locally, the output continued flowing on the other machine printing new information every few seconds. I had to wait in order to catch some of the red lines in the console.
Comment 51 Mircea Kitsune 2017-08-05 13:31:39 UTC
I briefly discussed the above log (output of "dmesg -w") on IRC with someone who seemed to have an understanding of the issue. They pointed out something important which I thought to highlight:

The problem appears to start from 'radeon_vm_bo_invalidate' and is most likely a GPU locking bug. Looking at the stack trace I can see it, alongside explicit mentions of spin lock / CPU soft lockup / stall on CPU. I've also noticed a potentially important message, which although marked as a warning seems to point to a line of source code from the radeon driver:

[58857.640890] WARNING: CPU: 3 PID: 2549 at ../drivers/gpu/drm/radeon/radeon_object.c:84 radeon_ttm_bo_destroy+0xec/0xf0 [radeon]
Comment 52 Michal Srb 2017-08-07 08:23:19 UTC
This bug was assigned to the llvm maintainer to build llvm package with the commit r280589 reverted for testing. I have inherited it now.

However, based on the additional information it doesn't look like it is a bug in llvm. User-space can do whatever it wants, if it causes lockups, it is bug in kernel.

The dmesg you posted is useful. It definitely looks like locking problem. The other warnings are probably just caused by the earlier errors. However, the dmesg doesn't look complete - there may have been some errors before the ones in the log. Can you try to get dmesg again, this time complete? Ideally showing everything from boot until the lockup.

To address some of your earlier comments:
(In reply to Mircea Kitsune from comment #15)
> I feel that at this point, I should express my disappointment regarding the
> lack of attention this report has received over the course of a month. I
> marked it as high priority (which I believe it is), posted about this on the
> forum, and wrote about it to the Factory mailing list (where it was
> completely ignored). To this day, I still have no idea what this is or when
> and how it might be fixed.

You are running Tumbleweed - bleeding edge. And this issue is likely hardware specific - if we don't have your hardware, all we can do is guess what may be wrong.

(In reply to Mircea Kitsune from comment #46)
> Mesa: User error: GL_INVALID_OPERATION in glTexSubImage2D(invalid texture
> image)

Some application is using OpenGL badly. It shouldn't cause issue like this.

(In reply to Mircea Kitsune from comment #48)
> I find it remarkable how the cause of the crash appears to have immediately
> changed after me making the comment above yesterday; I tested my theory that
> desktop effects are the root for 2 months, yet the moment I publish my
> observations the behavior changes in less than a day. This further makes me
> concerned that someone might be deliberately programming this crash using
> vulnerabilities in system components, solely for how strange this
> coincidence is.

Nah, locking issues are typically random and rare. That makes them hard to reproduce and debug. Nobody is hacking you.

(In reply to Mircea Kitsune from comment #49)
> Created attachment 735281 [details]
> Memtest86 screenshot

Your system memory is ok, that's good, but says nothing about GPU.
Comment 53 Mircea Kitsune 2017-08-07 12:36:04 UTC
(In reply to Michal Srb from comment #52)

How do I get the full log? "dmesg -w" prints the data in the console, which has a character limit by default. If it's simply the output of "dmesg" after rebooting and logging back in, I already attached that here... however it is pretty old, so next time it happens I will make a new one.

I'm eagerly looking to get to the bottom of this, so if any further data can be translated to pinpoint the source I would greatly appreciate the help! Please also keep the upstream report in mind for this, as I assume it will have to be brought up with the developers of the core components:

https://bugs.freedesktop.org/show_bug.cgi?id=101672
Comment 54 Mircea Kitsune 2017-08-07 12:37:11 UTC
Sorry about that, wrong link in the previous comment:

https://bugs.freedesktop.org/show_bug.cgi?id=100306
Comment 55 Michal Srb 2017-08-07 12:46:23 UTC
(In reply to Mircea Kitsune from comment #53)
> How do I get the full log? "dmesg -w" prints the data in the console, which
> has a character limit by default. If it's simply the output of "dmesg" after
> rebooting and logging back in, I already attached that here... however it is
> pretty old, so next time it happens I will make a new one.

You can redirect the output to a file, then get the file after reboot:
  dmesg -w > dmesg.txt

Alternatively use journalctl to retrieve some of the older logs:
  journalctl --list-boots
  journalctl --boot=<ID> --dmesg > dmesg.txt
Comment 56 Mircea Kitsune 2017-08-07 20:52:32 UTC
Created attachment 735590 [details]
Output of "dmesg -w" (full)

Full output of "dmesg -w", recorded by running "dmesg -w > filename.txt". The previous one was incomplete as it was subject to console line limitations, cutting off the moment when the crash actually occurs. I left the command running in a different runlevel; This time the crash didn't shut down the monitor after switching to it (Control + Alt + F1) so I was able to cleanly shut down dmesg then issue a normal reboot. I waited there for about 5 minutes before doing so, to give dmesg time to record as much information as possible. The crash appears to start at the following lines:

[112873.658950] radeon 0000:03:00.0: ring 4 stalled for more than 10024msec
[112873.658953] radeon 0000:03:00.0: GPU lockup (current fence id 0x000000000072f6bd last fence id 0x000000000072f6c1 on ring 4)
Comment 57 Mircea Kitsune 2017-08-07 21:30:09 UTC
I randomly decided to google parts of my dmesg output. I was surprised to discover that someone else has reported a very similar issue, which looks like it might have the same root as mine!

https://bugs.freedesktop.org/show_bug.cgi?id=101325

The dmesg output their provided almost perfectly matches my last log, and they also have a RadeonSI card which further narrows down the problem. The main difference is that they experience this with Unreal Engine 4 Editor, whereas for me the trigger is the Plasma desktop.

That report seems to contain a fair amount of logs, so hopefully bringing it and this together can help produce a solution at long last.
Comment 58 Mircea Kitsune 2017-08-31 18:16:54 UTC
I have important new information. After yet more weeks of testing, I seem to have found both of the common triggers for this issue. The crash happens a few hours after either of the following actions is preformed:

1 - Desktop effects are toggled at runtime. Pressing Alt + Shift + F12 twice to turn compositing off then back on will mine the system with this crash.

2 - I insert my USB stick or external drive into an USB port, mount it and access it in Dolphin, then unmount and remove it. A few hours after I've inserted / removed my drive, the freeze occurs. I suspect this has to do with the device notifier popping up in the system tray, asking what action to preform on the device or telling me the device is safe to unplug.

I'm not sure if the themes I'm using might have any relevancy. Considering this is a graphics problem, I figured I'd share this info as well so others can test them if they wish. I'm using the Plasma / KWin theme Freeze with the default Breeze icons / cursor / widget style:

https://www.opendesktop.org/p/998653/
https://www.opendesktop.org/p/1002663/
Comment 59 Mircea Kitsune 2017-08-31 18:38:18 UTC
Further more, I suspect I now know what the culprit component is. It's very likely that the problem lies within Mesa itself, and was introduced in the switch between 13.0 and 17.0.

This was confirmed by the bug report I linked previously, which I strongly believe is related to the issue I'm experiencing here: Another person there was able to verify that their crash happens with Mesa 17 but not 13. Looking at the dates, I realize that I started experiencing this problem precisely when openSUSE Tumbleweed upgraded from Mesa 13.0 to 17.0: Mesa 17 landed in early March 2017, it was a few days later that the issues began, which I then reported the following week (08 March 2017). See my comment in the other bug for more info on this:

https://bugs.freedesktop.org/show_bug.cgi?id=101325#c22

I also seem to confirm that the issue only affects RadeonSI cards but not R600: My laptop has a Mobility Radeon HD 5470 card (R600) whereas my desktop has a Radeon R7 370 card (RadeonSI). I've been away for two weeks and have been using my laptop exclusively during this time, which has the exact same OS and configuration as my desktop. I was able to preform every task I do on my desktop from my laptop, including the triggers I described above... I have never experienced this freeze with the laptop.
Comment 60 Mircea Kitsune 2017-11-12 15:04:22 UTC
I'm sorry for having taken so long to get back to this issue: I needed to be sure that what I'm mentioning is correct, which at this point took months of verification to be certain the issue is gone for good.

The problem has finally went away; It has not happened once during 3 months, in which I was able to achieve well over a week of uptime! It disappeared after I've preformed the following 3 changes on my system:

- Modifying my system GTK theme.
- Disabling KMix at startup.
- Uninstalling IBus.

I'm convinced the culprit here was IBus... more specifically its system tray icon. That icon has caused odd glitches in the past, such as making random menus pop up or crashing. It was likely also causing a graphical glitch that introduced this infinite GPU loop. As such the ingredients you should need are:

- A GCN 1.0 RadeonSI AMD card, running on the "radeon" driver.
- A KDE (Plasma 5) Linux OS.
- The IBus input system, with the option to show the system tray icon.

If others can reproduce this, please comment on the issue and let us know! If the problem does not return, I will mostly just be watching this bug from now on; I don't plan on spending days to do more odd tests... especially after receiving nearly no support from the FreeDesktop crew for almost an year, despite giving them a ton of data and how major this issue was.
Comment 61 Michal Srb 2018-04-17 10:57:45 UTC
As far as I can tell this is duplicate of bug #1084767. Good luck solving the issue with upstream, they know the drivers best.

*** This bug has been marked as a duplicate of bug 1084767 ***