Bug 1151839 - Starting clamd through systemd gives time out.
Starting clamd through systemd gives time out.
Status: RESOLVED FIXED
Classification: openSUSE
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Other
Current
x86-64 All
: P5 - None : Minor (vote)
: ---
Assigned To: Reinhard Max
E-mail List
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2019-09-24 10:56 UTC by Ivan Topolsky
Modified: 2020-12-09 17:22 UTC (History)
2 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
Switching type to simple and starting in foreground (501 bytes, patch)
2019-09-24 10:56 UTC, Ivan Topolsky
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Ivan Topolsky 2019-09-24 10:56:19 UTC
Created attachment 819302 [details]
Switching type to simple and starting in foreground

Over time the database of virus signature by ClamAV has grown substantially, over 6 million entries as of writing this.

The signature themselves aren't simple string pattern matches anymore, some of the signatures require bytecode (which depending on clamd installtion might be JITed during load time of clamd).

So over the three years since bug #962227, diminutive low-power devices such as the Raspberry Pi aren't the only ones where clamd's systemd .service times out.
It now also happens on some old-ish intel CPUs.

I would kindly ask to reconsider the "WONTFIX" timeout situation.

Either increasing the time-out limits locally for clamd.service (see older raspberry pi thread for suggestion),
or alternatively running clamd in foreground and switch the unit type to "simple" (see attached diff).

A signal for reloading the database (instead of plain restarting the daemon) could also help (yet another situation where (re-)startup timeout could happen).

Thank you very much for your consideration.
Comment 1 Reinhard Max 2019-09-26 11:16:00 UTC
I agree that something should be done about this, but using Type=simple is not the right solution, because it causes clamd.setvice to be considered "up" by systemd when it is not yet able to handle client requets, which might cause followup services to fail.

Also, just increasing the timeout in the packaged service file would not be more than a workaround, because you'll always find a machine that needs more time and OTOH it might cause problems on fast machines if clamd actually hangs, but has to wait for ages before it times out.

The perfect solution for this case would be Type=notify, so that clamd can extend the timeout if it needs more time to process the database, but that requires changes to clamd (see sd_notify(3)).

I've just suggested this change to upstream, so let's see what they say about it.

For the time being I suggest you check how long clamd takes for starting up in your case and set the timeout accordingly to avoid the problems that might arise from using Type=simple.
Comment 2 Arjen de Korte 2019-10-09 17:25:27 UTC
Since July, the startup time for the clamd service went up by more than 50%, so two days ago I was hit by this too. I kindly ask to reconsider to add TimeoutStartSec=180 to the service file. I fully agree this is not an optimal solution, but we'll probably see more and more people running into this even on not so minimal hardware.

If you're reading this bugreport because you searched Bugzilla for why clamd will suddenly no longer start, the following will fix this for now:

mkdir /etc/systemd/system/clamd.service.d
cat > /etc/systemd/system/clamd.service.d/override.conf << EOF
[Service]
TimeoutStartSec=180
EOF

Increase the timeout if necessary (default is 90 seconds).
Comment 3 Arjen de Korte 2019-10-10 19:10:12 UTC
(In reply to Ivan Topolsky from comment #0)

> A signal for reloading the database (instead of plain restarting the daemon)
> could also help (yet another situation where (re-)startup timeout could
> happen).

I just tried the above. This doesn't really help much:

systemctl reload clamd.service  - 77 seconds to completion
systemctl restart clamd.service - 78 seconds to completion

At best reloading will save a second. I don't think this is worth the effort. The culprit here is (re)loading the databases, which is the vast majority of the reload/restart time.
Comment 4 Ivan Topolsky 2019-10-11 08:44:01 UTC
(In reply to Arjen de Korte from comment #3)
> (In reply to Ivan Topolsky from comment #0)
> I just tried the above. This doesn't really help much:
>
> systemctl reload clamd.service  - 77 seconds to completion
> systemctl restart clamd.service - 78 seconds to completion

Sorry, it's my wrong: I didn't formulate my post clearly enough.

My point isn't to *shave a few seconds* by using a signal. (Again the largest time consumed comes from loading and parsing all the .cvd/.cld files and JITing all the bytecode. The posix process teardown/start-up is insignificant as your tests clearly show).

But instead...

> At best reloading will save a second. {...} which is the vast majority of the reload/restart time.


...systemd makes a clear distinction between *reload* and *restart*.

Restart is basically a short hand for *stop and then start*.
If loading the databases times out, you are again at risk at the "start" failing.

Reload is basically "send some signal to the service and let it do its thing".
If loading the database takes a long time, there's no systemd's "start" command that might get failed.

TL;DR: it's not about being faster, it's about avoiding running yet another "start" command that is at risk of timing out.

Sorry for not being clear enough the first time.
Comment 5 Ivan Topolsky 2019-10-11 08:55:57 UTC
(In reply to Reinhard Max from comment #1)
> The perfect solution for this case would be Type=notify, so that clamd can
> extend the timeout if it needs more time to process the database, but that
> requires changes to clamd (see sd_notify(3)).
> 
> I've just suggested this change to upstream, so let's see what they say
> about it.


That would be great, it's indeed the perfect solution.

I hope that upstream will think about it.

On the other hand, Cisco Talos, the current owner of ClamAV are into big server business (e.g.: milters). Most of their targeted use case are much beefier server machines that probably handle loading the database in a blink. Adding sd_notify might not be on the top of their priority list. :-/
Comment 7 Swamp Workflow Management 2019-10-25 17:40:06 UTC
This is an autogenerated message for OBS integration:
This bug (1151839) was mentioned in
https://build.opensuse.org/request/show/742982 Factory / clamav
Comment 8 Reinhard Max 2019-11-18 15:00:47 UTC
I am closing this for now as we have workarounds in all relevant code streams.
I won't get to implement the systemd improvements myself in the forseeable future, but if upstream does so, we will of course adopt it.
Comment 10 Swamp Workflow Management 2019-11-25 20:15:53 UTC
SUSE-SU-2019:3053-1: An update that solves two vulnerabilities and has one errata is now available.

Category: security (moderate)
Bug References: 1144504,1149458,1151839
CVE References: CVE-2019-12625,CVE-2019-12900
Sources used:
SUSE Linux Enterprise Module for Basesystem 15-SP1 (src):    clamav-0.100.3-3.14.1
SUSE Linux Enterprise Module for Basesystem 15 (src):    clamav-0.100.3-3.14.1

NOTE: This line indicates an update has been released for the listed product(s). At times this might be only a partial fix. If you have questions please reach out to maintenance coordination.
Comment 11 Swamp Workflow Management 2019-11-26 14:11:38 UTC
SUSE-SU-2019:3066-1: An update that solves two vulnerabilities and has one errata is now available.

Category: security (moderate)
Bug References: 1144504,1149458,1151839
CVE References: CVE-2019-12625,CVE-2019-12900
Sources used:
SUSE OpenStack Cloud Crowbar 8 (src):    clamav-0.100.3-33.26.1
SUSE OpenStack Cloud 8 (src):    clamav-0.100.3-33.26.1
SUSE OpenStack Cloud 7 (src):    clamav-0.100.3-33.26.1
SUSE Linux Enterprise Server for SAP 12-SP3 (src):    clamav-0.100.3-33.26.1
SUSE Linux Enterprise Server for SAP 12-SP2 (src):    clamav-0.100.3-33.26.1
SUSE Linux Enterprise Server for SAP 12-SP1 (src):    clamav-0.100.3-33.26.1
SUSE Linux Enterprise Server 12-SP4 (src):    clamav-0.100.3-33.26.1
SUSE Linux Enterprise Server 12-SP3-LTSS (src):    clamav-0.100.3-33.26.1
SUSE Linux Enterprise Server 12-SP3-BCL (src):    clamav-0.100.3-33.26.1
SUSE Linux Enterprise Server 12-SP2-LTSS (src):    clamav-0.100.3-33.26.1
SUSE Linux Enterprise Server 12-SP2-BCL (src):    clamav-0.100.3-33.26.1
SUSE Linux Enterprise Server 12-SP1-LTSS (src):    clamav-0.100.3-33.26.1
SUSE Linux Enterprise Desktop 12-SP4 (src):    clamav-0.100.3-33.26.1
SUSE Enterprise Storage 5 (src):    clamav-0.100.3-33.26.1
HPE Helion Openstack 8 (src):    clamav-0.100.3-33.26.1

NOTE: This line indicates an update has been released for the listed product(s). At times this might be only a partial fix. If you have questions please reach out to maintenance coordination.
Comment 12 Reinhard Max 2019-11-26 15:47:40 UTC
Just for completeness: Starting with versioin 0.102.1 ClamAV comes with a new and much faster signature loading algorithm.
Comment 13 Arjen de Korte 2019-11-26 15:57:18 UTC
In addition to the faster loading of signatures in ClamAV 0.102.1, there is also a development going on for threaded reloading of signatures.

With the trade-off of maximum double the amount memory in use, it allows to keep scanning with the previous set of signatures while loading of a new set is in progress (the existing scanner will block scanning while reloading is in progress). When the new signatures are loaded, new processes of the scanner will start using the new set. As soon as all processes using the old set have finished, the old set will be free'd and memory usage is back to normal.

It does mean however that you need to allow for a peak memory usage of around 2 GB. See https://build.opensuse.org/package/show/home:adkorte/clamav
Comment 14 Swamp Workflow Management 2019-11-30 23:13:37 UTC
openSUSE-SU-2019:2595-1: An update that solves two vulnerabilities and has one errata is now available.

Category: security (moderate)
Bug References: 1144504,1149458,1151839
CVE References: CVE-2019-12625,CVE-2019-12900
Sources used:
openSUSE Leap 15.1 (src):    clamav-0.100.3-lp151.2.3.1
Comment 15 Swamp Workflow Management 2019-12-01 05:11:17 UTC
openSUSE-SU-2019:2597-1: An update that solves two vulnerabilities and has one errata is now available.

Category: security (moderate)
Bug References: 1144504,1149458,1151839
CVE References: CVE-2019-12625,CVE-2019-12900
Sources used:
openSUSE Leap 15.0 (src):    clamav-0.100.3-lp150.2.13.1
Comment 17 Swamp Workflow Management 2020-12-09 17:22:29 UTC
SUSE-SU-2020:3729-1: An update that solves 8 vulnerabilities, contains one feature and has one errata is now available.

Category: security (important)
Bug References: 1118459,1119353,1144504,1149458,1151839,1157763,1171981,1174250,1174255
CVE References: CVE-2019-12625,CVE-2019-12900,CVE-2019-15961,CVE-2020-3123,CVE-2020-3327,CVE-2020-3341,CVE-2020-3350,CVE-2020-3481
JIRA References: ECO-3010
Sources used:
SUSE Linux Enterprise Server 12-SP5 (src):    clamav-0.103.0-3.3.1

NOTE: This line indicates an update has been released for the listed product(s). At times this might be only a partial fix. If you have questions please reach out to maintenance coordination.