Bug 913695

Summary: possible memory leak in kernel 3.16.7
Product: [openSUSE] openSUSE Distribution Reporter: Miroslav Beneš <mbenes>
Component: KernelAssignee: Michal Hocko <mhocko>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Normal    
Priority: P5 - None CC: jslaby, mbenes, mhocko, mhocko
Version: 13.2   
Target Milestone: ---   
Hardware: Other   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: meminfo
slabinfo
zoneinfo
processes

Description Miroslav Beneš 2015-01-19 14:12:12 UTC
openSUSE 13.2 with the latest kernel 3.16.7-7 got to the state where there was not enough free memory available. Swap was full and no process held the memory at first glance. The system ran for several weeks. There was no page allocation failure or out of memory message in the log.

/proc/meminfo, slabinfo and zoneinfo are attached with the list of processes. 

According to Michal Hocko it could be fixed by 5ddacbe92b806cd5b4f8f154e8e46ac267fff55c in upstream kernel. It is not present in 3.16.7 version.

Unfortunately I had to turn off the computer due to power down in the building and cannot reproduce it again.
Comment 1 Miroslav Beneš 2015-01-19 14:12:35 UTC
Created attachment 620006 [details]
meminfo
Comment 2 Miroslav Beneš 2015-01-19 14:13:05 UTC
Created attachment 620007 [details]
slabinfo
Comment 3 Miroslav Beneš 2015-01-19 14:13:22 UTC
Created attachment 620008 [details]
zoneinfo
Comment 4 Miroslav Beneš 2015-01-19 14:14:04 UTC
Created attachment 620009 [details]
processes
Comment 5 Michal Hocko 2015-01-19 16:15:07 UTC
MemTotal:        8131708 kB
MemFree:         1889300 kB
[...]
Buffers:           23424 kB
[...]
SwapCached:        31980 kB
Active:          3174752 kB
Inactive:        1886984 kB
[...]
Unevictable:          80 kB
Mlocked:              80 kB
SwapTotal:       2103292 kB
SwapFree:              8 kB
[...]
Slab:             782384 kB
[...]
KernelStack:        8144 kB
PageTables:        41880 kB

echo $((8131708-(1889300+23424+31980+3174752+1886984+80+782384+8144+41880)))
292780

So almost 300M is missing somewhere. This might be a signal of memory leak or some kernel component allocates from the page allocator directly and uses that memory.

5ddacbe92b80 (mm: free compound page with correct order) sounds like a good fit as well because you've told me (off-bugzilla) that the machine was running for a long time and you were running kvm with 4GB of memory many times. I can imagine that THP zero page would be mapped many times in that load and so the leak could build up continually until it starts getting noticable.

I will push the patch to the git even though we are not 100% sure this is the real fix for this issue. Having /proc/vmstat would be ideal because we could check thp_zero_page_alloc which is not present in meminfo nor zoneinfo.
Comment 6 Michal Hocko 2015-01-19 16:25:42 UTC
pushed to openSUSE-13.2 branch with a note that the culprit might be different but the fix is addressing a real leak anyway and as the stable kernel for 3.16 is dead already we need it anyway.
Comment 7 Michal Hocko 2015-01-19 16:32:33 UTC
pushed the patch to openSUSE-13.1 as well. SLE branches are fine (SLE12 has the fix from the stable, SLE11-SP3 and older do not have 97ae17497e99 which introduced the issue).
Comment 8 Miroslav Beneš 2015-02-02 15:21:23 UTC
So I am in the same situation after two weeks. This is the relevant part of /proc/vmstat

thp_fault_alloc 9500
thp_fault_fallback 107474
thp_collapse_alloc 3284
thp_collapse_alloc_failed 17798
thp_split 2167
thp_zero_page_alloc 21
thp_zero_page_alloc_failed 1751

I am going to install kernel with the fix and see what happens.
Comment 9 Swamp Workflow Management 2015-04-13 12:13:55 UTC
openSUSE-SU-2015:0713-1: An update that solves 13 vulnerabilities and has 52 fixes is now available.

Category: security (important)
Bug References: 867199,893428,895797,900811,901925,903589,903640,904899,905681,907039,907818,907988,908582,908588,908589,908592,908593,908594,908596,908598,908603,908604,908605,908606,908608,908610,908612,909077,909078,909477,909634,910150,910322,910440,911311,911325,911326,911356,911438,911578,911835,912061,912202,912429,912705,913059,913466,913695,914175,915425,915454,915456,915577,915858,916608,917830,917839,918954,918970,919463,920581,920604,921313,922542,922944
CVE References: CVE-2014-8134,CVE-2014-8160,CVE-2014-8559,CVE-2014-9419,CVE-2014-9420,CVE-2014-9428,CVE-2014-9529,CVE-2014-9584,CVE-2014-9585,CVE-2015-0777,CVE-2015-1421,CVE-2015-1593,CVE-2015-2150
Sources used:
openSUSE 13.2 (src):    bbswitch-0.8-3.6.6, cloop-2.639-14.6.6, crash-7.0.8-6.6, hdjmod-1.28-18.7.6, ipset-6.23-6.6, kernel-docs-3.16.7-13.2, kernel-obs-build-3.16.7-13.7, kernel-obs-qa-3.16.7-13.1, kernel-obs-qa-xen-3.16.7-13.1, kernel-source-3.16.7-13.1, kernel-syms-3.16.7-13.1, pcfclock-0.44-260.6.2, vhba-kmp-20140629-2.6.2, virtualbox-4.3.20-10.2, xen-4.4.1_08-12.2, xtables-addons-2.6-6.2
Comment 10 Swamp Workflow Management 2015-04-13 12:19:38 UTC
openSUSE-SU-2015:0714-1: An update that solves 11 vulnerabilities and has 5 fixes is now available.

Category: security (important)
Bug References: 903640,904899,907988,909078,910150,911325,911326,912202,912654,912705,913059,913695,914175,915322,917839,920901
CVE References: CVE-2014-7822,CVE-2014-8134,CVE-2014-8160,CVE-2014-8173,CVE-2014-8559,CVE-2014-9419,CVE-2014-9420,CVE-2014-9529,CVE-2014-9584,CVE-2014-9585,CVE-2015-1593
Sources used:
openSUSE 13.1 (src):    cloop-2.639-11.19.1, crash-7.0.2-2.19.1, hdjmod-1.28-16.19.1, ipset-6.21.1-2.23.1, iscsitarget-1.4.20.3-13.19.1, kernel-docs-3.11.10-29.2, kernel-source-3.11.10-29.1, kernel-syms-3.11.10-29.1, ndiswrapper-1.58-19.1, pcfclock-0.44-258.19.1, vhba-kmp-20130607-2.20.1, virtualbox-4.2.28-2.28.1, xen-4.3.3_04-37.1, xtables-addons-2.3-2.19.1
Comment 11 Jiri Slaby 2015-05-20 07:15:03 UTC
Any updates with the updated kernel?
Comment 12 Miroslav Beneš 2015-05-20 07:56:47 UTC
It seems to be ok now. My current uptime is 36 days and I do not see the mentioned problems anymore. I think this can be closed as resolved/fixed. Thanks a lot.
Comment 13 Michal Hocko 2015-05-20 08:03:40 UTC
Thanks for the feedback. Closing...