Bug 1099494

Summary: zypper calls snapper, snapshot creation failed while snapper already running, still stalling system
Product: [openSUSE] openSUSE Tumbleweed Reporter: Oliver Kurz <okurz>
Component: OtherAssignee: Arvin Schnell <aschnell>
Status: RESOLVED WORKSFORME QA Contact: E-mail List <qa-bugs>
Severity: Major    
Priority: P3 - Medium CC: aschnell, kanderssen, kukuk, okurz
Version: Current   
Target Milestone: ---   
Hardware: Other   
OS: Other   
URL: https://trello.com/c/4Hslnlyf
See Also: https://bugzilla.suse.com/show_bug.cgi?id=1063638
http://bugzilla.suse.com/show_bug.cgi?id=1028286
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: y2logs, snapper.log, system journal, snapper cleanup journal, zypper.log

Description Oliver Kurz 2018-06-28 12:13:20 UTC
Created attachment 775554 [details]
y2logs, snapper.log, system journal, snapper cleanup journal, zypper.log

## Observation

On an openSUSE Tumbleweed installation on a HDD disk (btrfs, encrypted, i.e. slow performance) calling zypper dup proposed to install around 3000 packages - as expected for openSUSE Tumbleweed upgrades. During that process it seems snapper cleanup was triggered in parallel but zypper also tried to create a snapshot but failed to do so effectively stalling the system for a very long time. Logfiles show quite some errors.

E.g. from zypper.log:

```
018-06-28 10:54:26 <2> linux-9akz(17693) [PLUGIN] PluginScript.cc(~PluginDumpStderr):75 ! INFO:root:creating post snapshot
2018-06-28 10:54:26 <5> linux-9akz(17693) [zypp] Exception.cc(log):166 PluginExecutor.cc(doSend):133 CAUGHT:   Not ready to read within timeout.
2018-06-28 10:54:26 <2> linux-9akz(17693) [zypp::plugin] PluginExecutor.cc(doSend):134 Not ready to read within timeout.
2018-06-28 10:54:26 <2> linux-9akz(17693) [zypp::plugin] PluginExecutor.cc(doSend):140 Bad plugin response from PluginScript[4691] /usr/lib/zypp/plugins/commit/snapper.py: PluginF
rame[](0){0}
2018-06-28 10:54:26 <2> linux-9akz(17693) [zypp::plugin] PluginExecutor.cc(doSend):141 (Expected ACK or _ENOMETHOD)
2018-06-28 10:54:26 <1> linux-9akz(17693) [zypp::plugin++] PluginScript.cc(close):229 Close:PluginScript[4691] /usr/lib/zypp/plugins/commit/snapper.py
2018-06-28 10:54:26 <1> linux-9akz(17693) [zypp::plugin++] PluginScript.cc(send):272 PluginScript[4691] /usr/lib/zypp/plugins/commit/snapper.py ->send PluginFrame[_DISCONNECT](0){0}
2018-06-28 10:54:30 <2> linux-9akz(17693) [PLUGIN] PluginScript.cc(~PluginDumpStderr):75 ! ERROR:root:creating snapshot failed:
2018-06-28 10:54:30 <2> linux-9akz(17693) [PLUGIN] PluginScript.cc(~PluginDumpStderr):75 ! ERROR:root:  org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
```

whereas snapper-cleanup reports

```
2018-06-28 10:54:03 MIL libsnapper(4747) Snapper.cc(Snapper):91 - Snapper constructor
2018-06-28 10:54:03 MIL libsnapper(4747) Snapper.cc(Snapper):92 - libsnapper version 0.5.4
2018-06-28 10:54:03 MIL libsnapper(4747) Snapper.cc(Snapper):93 - config_name:root disable_filters:false
2018-06-28 10:54:03 MIL libsnapper(4747) AsciiFile.cc(reload):114 - loading file /etc/snapper/configs/root
2018-06-28 10:54:03 MIL libsnapper(4747) AsciiFile.cc(getValue):235 - key:SUBVOLUME value:/
2018-06-28 10:54:03 MIL libsnapper(4747) AsciiFile.cc(getValue):235 - key:FSTYPE value:btrfs
2018-06-28 10:54:03 MIL libsnapper(4747) AsciiFile.cc(getValue):235 - key:QGROUP value:1/0
2018-06-28 10:54:03 MIL libsnapper(4747) AsciiFile.cc(getValue):235 - key:SYNC_ACL value:no
2018-06-28 10:54:03 MIL libsnapper(4747) Snapper.cc(Snapper):125 - subvolume:/ filesystem:btrfs
2018-06-28 10:54:03 MIL libsnapper(4747) Snapper.cc(loadIgnorePatterns):174 - number of ignore patterns:8
2018-06-28 10:54:03 MIL libsnapper(4747) Snapshot.cc(read):245 - found 16 snapshots
2018-06-28 10:54:03 MIL libsnapper(4747) SystemCmd.cc(SystemCmd):46 - constructor SystemCmd:"/usr/lib/snapper/plugins/grub --refresh"
2018-06-28 10:54:03 MIL libsnapper(4747) SystemCmd.cc(execute):82 - SystemCmd Executing:"/usr/lib/snapper/plugins/grub --refresh"
2018-06-28 10:54:05 MIL libsnapper(4747) SystemCmd.cc(doExecute):267 - stopwatch 1.439245s for "/usr/lib/snapper/plugins/grub --refresh"
2018-06-28 10:54:05 MIL libsnapper(4747) SystemCmd.cc(doExecute):287 - system() Returns:0
2018-06-28 10:54:05 WAR libsnapper(4747) FileUtils.cc(SDir):88 - THROW: open failed path://.snapshots errno:2 (No such file or directory)
2018-06-28 10:55:01 MIL libsnapper(4747) SystemCmd.cc(SystemCmd):46 - constructor SystemCmd:"/usr/lib/snapper/plugins/grub --refresh"
2018-06-28 10:55:01 MIL libsnapper(4747) SystemCmd.cc(execute):82 - SystemCmd Executing:"/usr/lib/snapper/plugins/grub --refresh"
2018-06-28 10:55:03 MIL libsnapper(4747) SystemCmd.cc(doExecute):267 - stopwatch 1.765667s for "/usr/lib/snapper/plugins/grub --refresh"
```

so it looks like snapper was starting up just in before, coincidence?
Comment 1 Stefan Hundhammer 2018-07-19 10:28:48 UTC
I wouldn't expect anything else than the system hanging in this situation.

- Creating a snapshot for a lot of changes
  -> triggering a snapshot cleanup
- Requesting yet another snapshot while all this is running

What exactly do you expect to happen?

Creating the first snapshot needs more disk space, so that disk space needs to be cleaned up first. That cleanup takes time, and only after that the first snapshot can really be started. If then already yet another snapshot is requested, what should happen?
Comment 2 Michael Andres 2018-07-19 11:22:27 UTC
(In reply to Oliver Kurz from comment #0)
> it seems snapper cleanup was triggered in parallel but zypper also tried to
> create a snapshot...

Well, ZYPP does execute installed commit plugins and but we must rely on them behaving well. If the sapper plugins misbehave, it's snapper issue. 

Assigning to Arvin.
Comment 3 Arvin Schnell 2018-07-19 20:42:07 UTC
In general snapper can handle to run a cleanup (which can require the very slow
operation of comparing snapshots) and create new snapshots at the same time.
Normally creating a snapshot only takes a fraction of a second. If the system
has such high load that this operation takes more than 30 seconds there is not
much snapper can do about that.
Comment 4 Arvin Schnell 2018-07-26 17:26:33 UTC
Thorsten, Lukas wants you to decide here as the architect how to proceed.
Comment 5 Thorsten Kukuk 2018-07-30 08:09:22 UTC
(In reply to Stefan Hundhammer from comment #1)

> Creating the first snapshot needs more disk space, 

Creating a snapshot does not need disk space with btrfs. Updating the RPMs later needs disk space. 

(In reply to Arvin Schnell from comment #4)
> Thorsten, Lukas wants you to decide here as the architect how to proceed.

For me, the problem isn't really clear. If I understand this correct, then the problem is, that this is already a slow disk, and the timeouts happen, because too much is accessing the harddisk?

In this case I agree, there is not much we can do.
Comment 6 Oliver Kurz 2018-07-30 20:54:25 UTC
Well, the original issue is not about the system being slow due to snapshot handling but that the snapshot creation actually *failed*. AFAIK my system is fulfilling the system requirements so I expect that the snapshot handling might slow down my system but not fail.
Comment 7 Knut Alejandro Anderssen González 2018-08-09 08:30:33 UTC
@Arvin, taking in account last comments, I will add this bug to our Incoming Trello Board in order to prioritize it during next sprints
Comment 8 Lukas Ocilka 2018-09-12 14:36:53 UTC
(In reply to Oliver Kurz from comment #6)
> Well, the original issue is not about the system being slow due to snapshot
> handling but that the snapshot creation actually *failed*.

And that is the question: Did snapper really fail or did the caller (of snapper)
give up and decided to time-out?

As I can read in the desc:
PluginScript.cc(~PluginDumpStderr):75 ! ERROR:root:creating snapshot failed:
2018-06-28 10:54:30 <2> linux-9akz(17693) [PLUGIN] PluginScript.cc(~PluginDumpStderr):75 ! ERROR:root:  org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.

A.) If snapper failed, then we need to fix snapper
B.) But if it just did not send a reply soon enough, we might need to extend
    timeouts or periodically check whether snapper is still doing something.

Arvin?
Comment 9 Arvin Schnell 2018-09-12 14:42:55 UTC
AFAIS snapper does not fail and libzypp simple runs into the timeout.
Comment 10 Lukas Ocilka 2018-09-12 15:17:43 UTC
Reassigning according to the previous comment.
Comment 11 Michael Andres 2018-09-20 17:37:27 UTC
Messages '[PLUGIN] PluginScript.cc(~PluginDumpStderr):' simply log the lines a plugin writes to it's stderr. So the message 

> ! ERROR:root:  org.freedesktop.DBus.Error.NoReply: Did not receive 
> a reply. Possible causes include: the remote application did not send 
> a reply, the message bus security policy blocked the reply, the reply 
> timeout expired, or the network connection was broken.

was written by snapper.py not by libzypp.



zypp timeout: According to the log:

> snapper.py ->send PluginFrame[COMMITEND](0){370884}
> THROW:    Not ready to read within timeout.

It's true that snapper.py does not acknowledge the receipt of the COMMITEND message (within 30 sec.). Due to this zypp stops serving the plugin.

> snapper.py ->send PluginFrame[_DISCONNECT](0){0}
> snapper.py <-PluginFrame[ACK](0){0}
> snapper.py -> [0] 

Zypp sends a _DISCONNECT to inform the plugin and the plugin acknowledges the receipt. If _DISCONNECT is acknowledged, zypp does  not  kill the script, but leaves it running in the background [1]. 

AFAICS the plugin is not killed by libzypp. OTOH it is not able to delay the transaction until it has completed (but that's actually intended).
So what to change?


[1] https://doc.opensuse.org/projects/libzypp/HEAD/zypp-plugins.html
Comment 12 Arvin Schnell 2018-09-21 09:23:50 UTC
Maybe python dbus adds another timeout. I will check that.

But with the issue only being P3 for openSUSE it might take months until
I have time.
Comment 13 Arvin Schnell 2018-09-21 10:43:00 UTC
A trello card to prioritised the issue with the other tasks already exists.
Changing assignee to yast-internal to follow the process.
Comment 14 Lukas Ocilka 2018-09-21 15:30:19 UTC
Oliver, how often does this change? Do we see that also is some customers'
scenarios? Is this openQA-seen-only issue?
Comment 15 Oliver Kurz 2018-09-22 06:42:14 UTC
Hi Lukas,

(In reply to Lukas Ocilka from comment #14)
> Oliver, how often does this change?

I am sorry, I do not know what you refer to with "change"? Do you mean how often this would "happen"? Then I can say that *explicitly* I have investigated this situation only once in detail to find out this cause as the investigation effort is pretty high. However the domain btrfs snapshots in general is a problematic one, e.g. see https://bugzilla.suse.com/show_bug.cgi?id=1063638 with all the non-SUSE comments there.

> Do we see that also is some customers' scenarios?

I have not directly seen confirmations by customers but yes, I assume this is very likely to also show up in customer's scenarios given that I have not conducted anything "synthetic", just normal "zypper dup".

> Is this openQA-seen-only issue?

This issue has *not* been seen by openQA but in manual operation by me.
Comment 16 Oliver Kurz 2018-10-12 09:15:50 UTC
I created https://github.com/openSUSE/snapper/pull/437 which might help a bit in at least similar situations even though not the same as it concerns the "timeline" service and this bug involves the "cleanup" service.
Comment 17 Arvin Schnell 2020-08-24 08:54:22 UTC
The snapper plugin for zypper has been completely rewritten in C++ last
year. Does the problem still happen?
Comment 18 Oliver Kurz 2020-08-28 07:59:25 UTC
At least I have not observed that myself since then so let's assume it's at least behaving different after the rewrite :)