Historically memcg overhead was high even if memcg was unused. This has
improved a lot but it still showed up in a profile summary as being a
problem.
/usr/src/linux-4.0-vanilla/mm/memcontrol.c 6.6441 395842
mem_cgroup_try_charge 2.950% 175781
__mem_cgroup_count_vm_event 1.431% 85239
mem_cgroup_page_lruvec 0.456% 27156
mem_cgroup_commit_charge 0.392% 23342
uncharge_list 0.323% 19256
mem_cgroup_update_lru_size 0.278% 16538
memcg_check_events 0.216% 12858
mem_cgroup_charge_statistics.isra.22 0.188% 11172
try_charge 0.150% 8928
commit_charge 0.141% 8388
get_mem_cgroup_from_mm 0.121% 7184
That is showing that 6.64% of system CPU cycles were in memcontrol.c and
dominated by mem_cgroup_try_charge. The annotation shows that the bulk
of the cost was checking PageSwapCache which is expected to be cache hot
but is very expensive. The problem appears to be that __SetPageUptodate
is called just before the check which is a write barrier. It is
required to make sure struct page and page data is written before the
PTE is updated and the data visible to userspace. memcg charging does
not require or need the barrier but gets unfairly hit with the cost so
this patch attempts the charging before the barrier. Aside from the
accidental cost to memcg there is the added benefit that the barrier is
avoided if the page cannot be charged. When applied the relevant
profile summary is as follows.
/usr/src/linux-4.0-chargefirst-v2r1/mm/memcontrol.c 3.7907 223277
__mem_cgroup_count_vm_event 1.143% 67312
mem_cgroup_page_lruvec 0.465% 27403
mem_cgroup_commit_charge 0.381% 22452
uncharge_list 0.332% 19543
mem_cgroup_update_lru_size 0.284% 16704
get_mem_cgroup_from_mm 0.271% 15952
mem_cgroup_try_charge 0.237% 13982
memcg_check_events 0.222% 13058
mem_cgroup_charge_statistics.isra.22 0.185% 10920
commit_charge 0.140% 8235
try_charge 0.131% 7716
That brings the overhead down to 3.79% and leaves the memcg fault
accounting to the root cgroup but it's an improvement. The difference
in headline performance of the page fault microbench is marginal as
memcg is such a small component of it.
pft faults
4.0.0 4.0.0
vanilla chargefirst
Hmean faults/cpu-1
1443258.1051 ( 0.00%)
1509075.7561 ( 4.56%)
Hmean faults/cpu-3
1340385.9270 ( 0.00%)
1339160.7113 ( -0.09%)
Hmean faults/cpu-5 875599.0222 ( 0.00%) 874174.1255 ( -0.16%)
Hmean faults/cpu-7 601146.6726 ( 0.00%) 601370.9977 ( 0.04%)
Hmean faults/cpu-8 510728.2754 ( 0.00%) 510598.8214 ( -0.03%)
Hmean faults/sec-1
1432084.7845 ( 0.00%)
1497935.5274 ( 4.60%)
Hmean faults/sec-3
3943818.1437 ( 0.00%)
3941920.1520 ( -0.05%)
Hmean faults/sec-5
3877573.5867 ( 0.00%)
3869385.7553 ( -0.21%)
Hmean faults/sec-7
3991832.0418 ( 0.00%)
3992181.4189 ( 0.01%)
Hmean faults/sec-8
3987189.8167 ( 0.00%)
3986452.2204 ( -0.02%)
It's only visible at single threaded. The overhead is there for higher
threads but other factors dominate.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
goto oom;
cow_user_page(new_page, old_page, address, vma);
}
- __SetPageUptodate(new_page);
if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg))
goto oom_free_new;
+ __SetPageUptodate(new_page);
+
mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
/*
page = alloc_zeroed_user_highpage_movable(vma, address);
if (!page)
goto oom;
+
+ if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg))
+ goto oom_free_page;
+
/*
* The memory barrier inside __SetPageUptodate makes sure that
* preceeding stores to the page contents become visible before
*/
__SetPageUptodate(page);
- if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg))
- goto oom_free_page;
-
entry = mk_pte(page, vma->vm_page_prot);
if (vma->vm_flags & VM_WRITE)
entry = pte_mkwrite(pte_mkdirty(entry));