Bug 1183990

Summary: I think I found the cause of a kernel lock when attempting hibernation
Product: [openSUSE] openSUSE Distribution Reporter: Carlos Robinson <carlos.e.r>
Component: KernelAssignee: openSUSE Kernel Bugs <kernel-bugs>
Status: RESOLVED WONTFIX QA Contact: E-mail List <qa-bugs>
Severity: Normal    
Priority: P5 - None CC: mbenes, tiwai
Version: Leap 15.2   
Target Milestone: ---   
Hardware: Other   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: syslolg message file of session

Description Carlos Robinson 2021-03-25 11:57:43 UTC
Hi,

For months I have been experiencing a kernel "trouble" in _some_ of my attempts to hibernate. Sometimes the hibernation would stall and not proceed. Issuing "systemctl hibernate" replied that there was one in progress, but there was no progress. I would attempt to halt the machine, but this would also stall at some point. If I pressed ctrl-alt-del several times, fast, I would see the message that it had detected the keys seven times and would halt immediately, but it did not.

I had no way out but hit the power switch, and suffer the long fsck the next morning.

Nothing in the logs whatsoever.

Well, one day I noticed in "atop" that one of my disks went to 100% busy when this happened. So I left running another instance of gkrellm, displaying the i/o state of all my partitions in /dev/sdc, and experimenting with "sync" I noticed it was sdc9 which was active, at something like 400 Kbps.

I noticed that "sync" would take sometimes a minute to complete.

/dev/sdc9 is a reiserfs, and has several bind mounts:

  /dev/sdc9 on /data/Lareiserfs type reiserfs (rw,relatime,lazytime,user_xattr,acl)

bind mounts:

  /dev/sdc9 on /data/homedvl type reiserfs (rw,relatime,lazytime,user_xattr,acl)
  /dev/sdc9 on /usr/share/flightgear type reiserfs (rw,relatime,lazytime,user_xattr,acl)
  /dev/sdc9 on /var/spool/news type reiserfs (rw,relatime,lazytime,user_xattr,acl)
  /dev/sdc9 on /home/cer/terrasync type reiserfs (rw,relatime,lazytime,user_xattr,acl)
  /dev/sdc9 on /usr/src type reiserfs (rw,relatime,lazytime,user_xattr,acl)


Now, the directory that is active is "/var/spool/news". I use leafnode nntp proxy server. It contains 1.2 million files in about 3 GB of space.


I found that if I run this sequence:

time sync
time sync && systemctl hibernate

the machine hibernates successfully - 13 days so far, a record.


Most days it takes a minute to sync, but one day, I noticed it took several minutes. Why? Well, it happened that at the same time "/usr/sbin/texpire" was working (a cronjob triggers it). This task runs for about half an hour daily expunging old posts, meaning it examines a million files.


Maybe with this (tentative) report you can improve the kernel response so that it doesn't stall when trying to hibernate, assumedly when taking too long to sync. Me would think that the kernel should stop tasks before doing the sync :-?

At least, running sync manually I can detect the situation and kill the busy task before suffering the crash.




On the other hand, maybe there is an issue on reiserfs with "lazytime" (which is default), delaying the writes of "something" till forced to. My wild guess, it delays the timestamp that registers a file was touched. Each time I read a post, or Thunderbird scans a post, the timestamp (sorry, I don't remember which exact timestamp it is) is written, but it is not actually written but delayed "for ever".

I use reiserfs for this mount because in theory it should work better than others with millions of small files.
Comment 1 Carlos Robinson 2021-03-25 22:59:38 UTC
I forgot to paste some command output when the disk is "busy" - interesting lines are marked with arrows (not necessarily from the same day, it takes some time to copy paste):

iotop:

Total DISK READ :       0.00 B/s | Total DISK WRITE :     368.01 K/s
Actual DISK READ:       0.00 B/s | Actual DISK WRITE:     269.56 K/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
25530 be/4 root        0.00 B/s  332.85 K/s  0.00 % 97.58 % [kworker/u64:1+flush-8:32]  <==
25104 be/4 root        0.00 B/s    4.69 K/s  0.00 %  1.41 % [kworker/8:1-events]
 1408 be/3 root        0.00 B/s   11.72 K/s  0.00 %  1.25 % [jbd2/sdc10-8]
25944 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.04 % [kworker/0:2-events]
25637 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.04 % [kworker/0:0-events]
11191 be/4 cer         0.00 B/s   18.75 K/s  0.00 %  0.00 % firefox -P Small [localStorage DB]


atop:

PRC |  sys    0.84s   |  user   1.73s   |  #proc    625  |   #zombie    1  |   no  procacct  |
CPU |  sys      10%   |  user     29%   |  irq       1%  |   idle   1065%  |   wait     96%  |
CPL |  avg1    0.86   |  avg5    0.45   |  avg15   0.41  |   csw   102167  |   intr   58821  |
MEM |  tot    31.3G   |  free   13.2G   |  cache   2.8G  |   buff    1.4G  |   slab    1.2G  |
SWP |  tot   100.0G   |  free   94.8G   |                |   vmcom  29.4G  |   vmlim 115.7G  |
LVM |     cr_cripta   |  busy      1%   |  read       0  |   write      2  |   avio 34.0 ms  |
DSK |           sdc   |  busy     99%   |  read       0  |   write    399  |   avio 24.9 ms  | <==
NET |  transport      |  tcpi      11   |  tcpo      17  |   udpi       0  |   udpo       0  |
NET |  network        |  ipi       12   |  ipo       17  |   ipfrw      0  |   deliv     12  |
NET |  eth0      0%   |  pcki      13   |  pcko      18  |   si    1 Kbps  |   so    1 Kbps  |



  PID SYSCPU USRCPU  VGROW  RGROW  RUID     EUID     ST EXC   THR S CPUNR  CPU  CMD        1/6
10990  0.06s  0.22s     0K   172K  cer      cer      --   -    64 S     7   3%  Web Content
10799  0.08s  0.19s     0K     0K  cer      cer      --   -    81 S     5   3%  firefox
11012  0.02s  0.23s     0K   -84K  cer      cer      --   -    60 S     8   2%  Web Content
 9585  0.03s  0.20s     0K     0K  root     root     --   -     1 S     0   2%  iotop
11096  0.04s  0.18s  1024K  1784K  cer      cer      --   -    58 S     2   2%  Web Content
 4288  0.19s  0.00s 14356K     0K  root     root     --   -    21 S     3   2%  X
Comment 2 Takashi Iwai 2021-03-26 09:22:00 UTC
Could you verify whether it's Leap 15.2 kernel specific or happening in general, by using the recent upstream kernel from OBS Kernel:stable on top of your system?

If it happens with the upstream, it's worth to check further, and at best, report to upstream and diagnose there.
Comment 3 Carlos Robinson 2021-03-27 11:03:21 UTC
Hi,

I added repo <http://download.opensuse.org/repositories/Kernel:/stable/standard/>, but I get this conflict when trying to update to it:


#### YaST2 conflicts list - generated 2021-03-27 11:58:58 ####

nothing provides libc.so.6(GLIBC_2.33)(64bit) needed by kernel-default-devel-5.11.10-1.1.gfccd69b.x86_64

    [ ] break kernel-default-devel-5.11.10-1.1.gfccd69b.x86_64 by ignoring some of its dependencies

    [ ] do not install kernel-default-devel-5.11.10-1.1.gfccd69b.x86_64



nothing provides libc.so.6(GLIBC_2.33)(64bit) needed by kernel-default-devel-5.11.10-1.1.gfccd69b.x86_64

    [ ] do not install kernel-syms-5.11.10-1.1.gfccd69b.x86_64

    [ ] break kernel-default-devel-5.11.10-1.1.gfccd69b.x86_64 by ignoring some of its dependencies




#### YaST2 conflicts list END ###
Comment 4 Takashi Iwai 2021-03-27 15:56:57 UTC
Do you need kernel-devel, i.e. you have to build your module like Nivida driver?
If not, just download kernel-default.rpm from the OBS download URL, and install it.
Comment 5 Carlos Robinson 2021-03-27 17:10:28 UTC
I used Nvidia in the past, yes, and more in the past I sometimes recompiled the kernel. So the package is there although I don't use it this moment. 
Ok, updating kernel-default.rpm and the automatics:

Telcontar:~ # rpm -qa --last | head
kernel-firmware-realtek-20210315-35.1.noarch  Sat Mar 27 18:07:21 2021
kernel-firmware-radeon-20210315-35.1.noarch   Sat Mar 27 18:07:21 2021
kernel-firmware-platform-20210315-35.1.noarch Sat Mar 27 18:07:21 2021
kernel-firmware-amdgpu-20210315-35.1.noarch   Sat Mar 27 18:07:20 2021
kernel-default-5.11.10-2.1.gfa5dcf1.x86_64    Sat Mar 27 18:04:04 2021
Comment 6 Carlos Robinson 2021-03-28 01:21:28 UTC
Yes, I did reproduce the problem. With a change: id did not try to powerdown when hibernate "failed", and it did succeed after about half an hour.

I issued the hibernation request at about 10 minutes to midnight, just a few minutes after the texpire job starts to run and activity in /var/spool/news ramps up and continues for half an hour.

I will attach the /var/kog/message file out of syslog (by the way, the kernel boot messages are missing, this started happening with 15.2)
I have an allmessages file if you want it too, but it is way verbose and perhaps sensitive (may contain nntp and mail entries). Ask for it if you need it.
I have also saved "journalctl -b > journalparabugzilla", ask for it if you need it.


   Time points, two attempts - watch for line wraps in report:

2021-03-27 18:13:08+01:00 - Booting the system now  ================================================================================  Linux Telcontar 5.11.10-2.gfa5dcf1-default #1 SMP Fri Mar 26 08:43:28 UTC 2021 (fa5dcf1) x86_64 x86_64 x86_64 GNU/Linux

   boots up

   normal afternoon.

   cron job start

<9.4> 2021-03-27T23:45:01.556746+01:00 Telcontar texpire - - -  CER: running leafnode's texpire without ionice

   I request hibernate ("systemctl hibernate" as root):

<3.6> 2021-03-27T23:52:08.166881+01:00 Telcontar systemd 1 - -  Reached target Sleep.
<3.6> 2021-03-27T23:52:08.167292+01:00 Telcontar systemd 1 - -  Starting Hibernate...
<3.4> 2021-03-27T23:52:08.226198+01:00 Telcontar systemd-sh - - -  Hibernating the system now...


   and hibernation happens uneventfully.

<3.6> 2021-03-27T23:52:10.144155+01:00 Telcontar systemd-sleep 31043 - -  INFO: Done.
<3.6> 2021-03-27T23:52:13.262746+01:00 Telcontar systemd-sleep 31043 - -  Suspending system...
<0.6> 2021-03-27T23:52:13.266649+01:00 Telcontar kernel - - - [20367.676939] PM: hibernation: hibernation entry
<0.6> 2021-03-28T00:20:10.274394+01:00 Telcontar kernel - - - [20368.195058] Filesystems sync: 0.517 seconds
<0.6> 2021-03-28T00:20:10.274493+01:00 Telcontar kernel - - - [20368.195061] Freezing user space processes ... (elapsed 0.001 seconds) done.
<0.6> 2021-03-28T00:20:10.274494+01:00 Telcontar kernel - - - [20368.196906] OOM killer disabled.



   I think that it was a success. But sometime later I wake up the machine and try again, this time when I see continuous write activity on sdc9 aka /var/spool/news.


   Thawing the system:

<0.6> 2021-03-28T00:20:10.281155+01:00 Telcontar kernel - - - [20374.643302] PM: hibernation: hibernation exit
<0.5> 2021-03-28T00:20:10.281156+01:00 Telcontar kernel - - - [20374.643572] sd 10:0:0:0: [sde] Synchronizing SCSI cache
<3.6> 2021-03-28T00:20:10.284616+01:00 Telcontar systemd-sleep 31043 - -  System resumed.
<3.6> 2021-03-28T00:20:10.284827+01:00 Telcontar pulseaudio 6878 - -  W: [alsa-sink-ALC892 Analog] alsa-util.c: Could not recover alsa device from SUSPENDED state, trying to restart PCM

<3.6> 2021-03-28T00:20:10.303628+01:00 Telcontar systemd 1 - -  Starting Do daily mandb update...
<3.6> 2021-03-28T00:20:10.304331+01:00 Telcontar systemd 1 - -  Starting Rotate log files...
<3.6> 2021-03-28T00:20:10.305693+01:00 Telcontar systemd 1 - -  Starting Update locate database...

<3.6> 2021-03-28T00:20:10.389183+01:00 Telcontar systemd-sleep 31043 - -  INFO: Running grub-once-restore ..
<3.4> 2021-03-28T00:20:10.418684+01:00 Telcontar systemd-sh - - -  Thawing the system now...



   I issue systemctl hibernate. The system is busy with cronjobs, and I see continuous write activity on sdc9. The worst moment to hibernate.

<3.6> 2021-03-28T00:20:10.459479+01:00 Telcontar systemd 1 - -  Started Hibernate.
<3.6> 2021-03-28T00:20:10.459582+01:00 Telcontar systemd 1 - -  sleep.target: Unit not needed anymore. Stopping.
<3.6> 2021-03-28T00:20:10.459658+01:00 Telcontar systemd 1 - -  Stopped target Sleep.
<3.6> 2021-03-28T00:20:10.459727+01:00 Telcontar systemd 1 - -  Reached target Hibernate.
<4.6> 2021-03-28T00:20:10.459800+01:00 Telcontar systemd-logind 1731 - -  Operation 'sleep' finished.
<3.6> 2021-03-28T00:20:10.460173+01:00 Telcontar systemd 1 - -  hibernate.target: Unit not needed anymore. Stopping.
<3.6> 2021-03-28T00:20:10.460269+01:00 Telcontar systemd 1 - -  Stopped target Hibernate.

<0.6> 2021-03-28T00:20:10.525146+01:00 Telcontar kernel - - - [20374.886301] sd 10:0:0:0: [sde] Synchronize Cache(10) failed: Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK

<3.6> 2021-03-28T00:20:10.909232+01:00 Telcontar do_mandb 32042 - -  find: ‘standard output’: Broken pipe
<3.6> 2021-03-28T00:20:10.909441+01:00 Telcontar do_mandb 32042 - -  find: write error

   Notice that in the middle of a failed request to hibernate, the system is still starting cron jobs adding more delay

<3.6> 2021-03-28T00:20:14.076079+01:00 Telcontar systemd 1 - -  Reloading System Logging Service.
<3.6> 2021-03-28T00:20:14.076923+01:00 Telcontar systemd 1 - -  Reloaded System Logging Service.
<3.6> 2021-03-28T00:20:16.575734+01:00 Telcontar systemd 1 - -  Started Rotate log files.


<3.6> 2021-03-28T00:23:04.377712+01:00 Telcontar systemd 1 - -  Reached target Sleep.
<3.6> 2021-03-28T00:23:04.378315+01:00 Telcontar systemd 1 - -  Starting Hibernate...

<3.6> 2021-03-28T00:23:04.409724+01:00 Telcontar systemd-sleep 24504 - -  INFO: Running prepare-grub ..
<3.4> 2021-03-28T00:23:04.436985+01:00 Telcontar systemd-sh - - -  Hibernating the system now...


<3.6> 2021-03-28T00:23:04.674795+01:00 Telcontar systemd-sleep 24504 - -    preparing boot-loader: selecting entry main-os, kernel /boot/5.11.10-2.gfa5dcf1-default
<3.6> 2021-03-28T00:23:44.081034+01:00 Telcontar systemd 1 - -  Starting Backup RPM database...
<3.6> 2021-03-28T00:24:07.216108+01:00 Telcontar systemd 1 - -  Started Backup RPM database.
<3.4> 2021-03-28T00:24:34.406964+01:00 Telcontar systemd-sleep 24504 - -  system-sleep terminated by signal ALRM.
<3.3> 2021-03-28T00:24:34.407204+01:00 Telcontar systemd-sleep 24504 - -  Execution failed: Protocol error
<3.6> 2021-03-28T00:24:34.407299+01:00 Telcontar systemd-sleep 24504 - -  Suspending system...
<0.6> 2021-03-28T00:24:34.409141+01:00 Telcontar kernel - - - [20638.773346] PM: hibernation: hibernation entry

<3.6> 2021-03-28T00:35:12.724847+01:00 Telcontar systemd 1 - -  Started Update locate database.

<5.6> 2021-03-28T00:41:06.345537+01:00 Telcontar rsyslogd - - -  -- MARK --

<9.4> 2021-03-28T00:45:26.408094+01:00 Telcontar texpire - - -  CER: finished

   This is when /dev/sdc9 activity would stop, but I was not looking, I had gone from the room tired of waiting.

<1.6> 2021-03-28T00:45:26.435402+01:00 Telcontar run-crons 30712 - -  leafnode: OK
<9.6> 2021-03-28T00:45:26.473139+01:00 Telcontar run-crons.daily - - -  Considering my daily cron jobs.
<9.6> 2021-03-28T00:45:26.502372+01:00 Telcontar run-crons.daily - - -  Disabling google-chrome cron job (removing its execute permission).
<1.4> 2021-03-28T00:45:26.530153+01:00 Telcontar run-crons 30712 - -  mio.daily returned 1
<1.6> 2021-03-28T00:45:26.563622+01:00 Telcontar run-crons 30712 - -  storebackup: OK
<1.6> 2021-03-28T00:45:26.849819+01:00 Telcontar run-crons 30712 - -  suse-texlive: OK

   Time discontinuity here, which means that now is when machine actually hibernates, about half an hour after I issued the order. Imagine I was in an emergency, like on UPS battery. Normally I panic much earlier and try to powerdown, which fails, then I hit the power switch and pray for a good recovery later.

<0.6> 2021-03-28T00:58:54.286380+01:00 Telcontar kernel - - - [21931.825420] Filesystems sync: 1293.045 seconds
<0.6> 2021-03-28T00:58:54.286458+01:00 Telcontar kernel - - - [21931.825426] Freezing user space processes ... (elapsed 0.001 seconds) done.
<0.6> 2021-03-28T00:58:54.286460+01:00 Telcontar kernel - - - [21931.827418] OOM killer disabled.
<0.7> 2021-03-28T00:58:54.286460+01:00 Telcontar kernel - - - [21931.827620] PM: hibernation: Marking nosave pages: [mem 0x00000000-0x00000fff]


<3.4> 2021-03-28T00:58:54.337671+01:00 Telcontar systemd-sh - - -  Thawing the system now...

<3.6> 2021-03-28T00:58:54.375645+01:00 Telcontar systemd 1 - -  Started Hibernate.
<3.6> 2021-03-28T00:58:54.375759+01:00 Telcontar systemd 1 - -  sleep.target: Unit not needed anymore. Stopping.
<3.6> 2021-03-28T00:58:54.375855+01:00 Telcontar systemd 1 - -  Stopped target Sleep.
<3.6> 2021-03-28T00:58:54.375937+01:00 Telcontar systemd 1 - -  Reached target Hibernate.
<3.6> 2021-03-28T00:58:54.376069+01:00 Telcontar systemd 1 - -  hibernate.target: Unit not needed anymore. Stopping.
<3.6> 2021-03-28T00:58:54.376156+01:00 Telcontar systemd 1 - -  Stopped target Hibernate.
<4.6> 2021-03-28T00:58:54.376213+01:00 Telcontar systemd-logind 1731 - -  Operation 'sleep' finished.


   I will now hibernate the machine and go to sleep myself. If you want me to run some command on the same session, tell me; on Monday I'll reboot to the Leap standard kernel.
Comment 7 Takashi Iwai 2021-03-28 08:02:20 UTC
Could you try hibernating directly via kernel interface:
  echo disk > /sys/power/state
instead of systemd?  If the problem is seen there, it's a bit easier to see what happening.  The next step would be testing as described in
  Documentation/power/basic-pm-debugging.rst
Comment 8 Carlos Robinson 2021-03-28 10:17:48 UTC
Be aware that I can only test once a day, at 23:55 hours, while texpire is writing.

Forcing texpire to run more times may not work, as there will have been no nntp activity enough (for writing) since it was triggered before.
Comment 9 Carlos Robinson 2021-03-28 10:18:58 UTC
Created attachment 847737 [details]
syslolg message file of session
Comment 10 Carlos Robinson 2021-03-28 22:13:44 UTC
recording data, 10 minutes after issuing command. Saving.

Total DISK READ :       0.00 B/s | Total DISK WRITE :      33.02 K/s
Actual DISK READ:       0.00 B/s | Actual DISK WRITE:     180.42 K/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND                           
14628 be/4 root        0.00 B/s    0.00 B/s  0.00 % 99.94 % [kworker/6:2+reiserfs/sdc9]
 1437 be/3 root        0.00 B/s    0.00 B/s  0.00 % 73.67 % [jbd2/sdc5-8]
  493 be/3 root        0.00 B/s   14.15 K/s  0.00 %  0.27 % [jbd2/nvme0n1p5-]
 4234 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.09 % [kworker/0:3-events_power_efficient]
15038 be/4 cer         0.00 B/s   18.87 K/s  0.00 %  0.00 % firefox --sm-clien~ [localStorage DB]
Comment 11 Carlos Robinson 2021-03-28 22:29:40 UTC
Ok, I did:

Telcontar:~ # time echo disk > /sys/power/state

real	24m59.993s
user	0m0.014s
sys	0m3.949s
Telcontar:~ # 


9.4> 2021-03-28T23:45:01.294828+02:00 Telcontar texpire - - -  CER: running leafnode's texpire without ionice
...
<0.6> 2021-03-28T23:56:24.272664+02:00 Telcontar kernel - - - [70697.016128] PM: hibernation: hibernation entry
...
<9.4> 2021-03-29T00:14:21.258312+02:00 Telcontar texpire - - -  CER: finished
<1.6> 2021-03-29T00:14:21.287049+02:00 Telcontar run-crons 4014 - -  leafnode: OK
<9.6> 2021-03-29T00:14:21.316525+02:00 Telcontar run-crons.daily - - -  Considering my daily cron jobs.
<1.6> 2021-03-29T00:14:21.344056+02:00 Telcontar run-crons 4014 - -  mio.daily: OK
<1.6> 2021-03-29T00:14:21.376993+02:00 Telcontar run-crons 4014 - -  storebackup: OK
<1.6> 2021-03-29T00:14:21.688651+02:00 Telcontar run-crons 4014 - -  suse-texlive: OK
<0.6> 2021-03-29T00:21:24.260731+02:00 Telcontar kernel - - - [71780.466285] Filesystems sync: 1083.443 seconds
<0.6> 2021-03-29T00:21:24.260822+02:00 Telcontar kernel - - - [71780.466288] Freezing user space processes ... (elapsed 0.002 seconds) done.
...
<0.6> 2021-03-29T00:21:24.260965+02:00 Telcontar kernel - - - [71790.090320] OOM killer enabled.
<0.6> 2021-03-29T00:21:24.265114+02:00 Telcontar kernel - - - [71790.090320] Restarting tasks ... done.
<0.6> 2021-03-29T00:21:24.265118+02:00 Telcontar kernel - - - [71790.092322] PM: hibernation: hibernation exit


As you can see the problem is the same. If I trigger hibernation while there is intense i/o on /dev/sdc9, hibernation stalls for 15..20 minutes. The disk activity takes precedence over the hibernation order.

Tomorrow night I can test something else, please tell me what command to run.
Comment 12 Takashi Iwai 2021-03-30 12:31:33 UTC
As already mentioned, the basic checks for debugging the PM issue is described in
  Documentation/power/basic-pm-debugging.rst
or the web version of the same content
  https://www.kernel.org/doc/html/latest/power/basic-pm-debugging.html

Could you perform those testing?
Comment 13 Miroslav Beneš 2022-02-26 09:09:25 UTC
Closing, no response. Feel free to reopen if the issue persists with Leap 15.3 and provide the information Takashi asked about.
Comment 14 Carlos Robinson 2022-03-10 20:48:42 UTC
Sorry, I forgot about your request.

I did, however, did many tests, reported on the openSUSE user mail list. The problem was indeed related to syncing the single reiserfs partition, I assume writing the time stamp metadata: a news server updates access times of thousand or millions of files, but the actual writing is delayed when the partition is mounted with relatime,lazytime.

The problem disappeared mostly when I moved the news spool from rotating rust to a new reiserfs partition on SSD (indicating that the long delay time is related to head access time), but while the one on rotating rust was mounted (although unused) I occasionally still saw long sync delays which I could not explain.

The problem disappeared completely when I reformated said reiserfs rotating rust partition to ext4, and put the news spool back on it (no, btrfs and XFS are both reported as a bad alternatives for this usage, specially btrfs).


The problem can not be evaluated on 15.3, because on 15.3 reiserfs is not supported at all (but this is undocumented):

   Unknown filesystem 'reiserfs'

The kernel modules have to be installed, package "reiserfs-kmp-default" with kernel module for it (plus libreiserfs-0_3-0, libreiserfscore0 and reiserfs).

Once that module is added, an attempt to mount the partition results in:

   mount: /somepartition: WARNING: source write-protected, mount read only.

and in syslog it can be read:

  ... kernel - - - [ 7297.208721] reiserfs: read-write mode is unsupported.
  ... kernel - - - [ 7297.208724] reiserfs: load module with
allow_unsupported=1 to enable read-write mode.


To enable r/w (with unknown consequences aka risks), create file
"/etc/modprobe.d/50-reiserfs.conf":

   options reiserfs       allow_unsupported=1 

and reboot (or perhaps unload and reload the module).


Thus I'm sorry, but I can not do further investigation.