Currently the locking in blockdev_direct_IO is a mess, we have three
different locking types and very confusing checks for some of them. The
most complicated one is DIO_OWN_LOCKING for reads, which happens to not
actually be used.
This patch gets rid of the DIO_OWN_LOCKING - as mentioned above the read
case is unused anyway, and the write side is almost identical to
DIO_NO_LOCKING. The difference is that DIO_NO_LOCKING always sets the
create argument for the get_blocks callback to zero, but we can easily
move that to the actual get_blocks callbacks. There are four users of the
DIO_NO_LOCKING mode: gfs already ignores the create argument and thus is
fine with the new version, ocfs2 only errors out if create were ever set,
and we can remove this dead code now, the block device code only ever uses
create for an error message if we are fully beyond the device which can
never happen, and last but not least XFS will need the new behavour for
writes.
Now we can replace the lock_type variable with a flags one, where no flag
means the DIO_NO_LOCKING behaviour and DIO_LOCKING is kept as the first
flag. Separate out the check for not allowing to fill holes into a
separate flag, although for now both flags always get set at the same
time.
Also revamp the documentation of the locking scheme to actually make
sense.
[akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Zach Brown <zach.brown@oracle.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Alex Elder <aelder@sgi.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <joel.becker@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
dio: zero struct dio with kzalloc instead of manually
This patch uses kzalloc to zero all of struct dio rather than
manually trying to track which fields we rely on being zero. It
passed aio+dio stress testing and some bug regression testing on
ext3.
This patch was introduced by Linus in the conversation that lead up
to Badari's minimal fix to manually zero .map_bh.b_state in commit:
I was unable to measure a stable difference in the number of cpu
cycles spent in blockdev_direct_IO() when pushing aio+dio 256K reads
at ~340MB/s.
So the resulting intent of the patch isn't a performance gain but to
avoid exposing ourselves to the risk of finding another field like
.map_bh.b_state where we rely on zeroing but don't enforce it in the
code.
Zach surmised that zeroing out the page array was what caused most of
the problem, and suggested the approach taken in the attached patch for
resolving the issue. Intel re-tested with this patch and saw a 0.6%
performance gain (the original regression was 0.5%).
[akpm@linux-foundation.org: add comment] Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Zach Brown <zach.brown@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Shaohua Li [Wed, 16 Dec 2009 00:47:47 +0000 (16:47 -0800)]
aio: remove unused field
Don't know the reason, but it appears ki_wait field of iocb never gets used.
Signed-off-by: Shaohua Li <shaohua.li@intel.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Zach Brown <zach.brown@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Howells [Wed, 16 Dec 2009 00:47:46 +0000 (16:47 -0800)]
FS-Cache: Avoid maybe-used-uninitialised warning on variable
Andrew Morton's compiler sees the following warning in FS-Cache:
fs/fscache/object-list.c: In function 'fscache_objlist_lookup':
fs/fscache/object-list.c:94: warning: 'obj' may be used uninitialized in this function
which my compiler doesn't. This is a false positive as obj can only be
used in the comparison against minobj if minobj has been set to something
other than NULL, but for that to happen, obj has to be first set to
something.
Deal with this by preclearing obj too.
Reported-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Note, you can only do this before loading the crash kernel.
Signed-off-by: WANG Cong <amwang@redhat.com> Cc: Neil Horman <nhorman@redhat.com> Acked-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nils Carlson [Wed, 16 Dec 2009 00:47:42 +0000 (16:47 -0800)]
edac: i5100 add 6 ranks per channel
Add support for 6 ranks per channel to the i5100 chipset. I have tested
the patch as far as possible with correctible errors and things appear
good. The DIMM mapping is correct for our board, but boards may differ.
Signed-off-by: Nils Carlson <nils.carlson@ludd.ltu.se> Acked-by: Arthur Jones <ajones@riverbed.com> Signed-off-by: Doug Thompson <dougthompson@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nils Carlson [Wed, 16 Dec 2009 00:47:42 +0000 (16:47 -0800)]
edac: i5100 add scrubbing
Addscrubbing to the i5100 chipset. The i5100 chipset only supports one
scrubbing rate, which is not constant but dependent on memory load. The
rate returned by this driver is an estimate based on some experimentation,
but is substantially closer to the truth than the speed supplied in the
documentation.
Also, scrubbing is done once, and then a done-bit is set. This means that
to accomplish continuous scrubbing a re-enabling mechanism must be used.
I have created the simplest possible such mechanism in the form of a
work-queue which will check every five minutes. This interval is quite
arbitrary but should be sufficient for all sizes of system memory.
Signed-off-by: Nils Carlson <nils.carlson@ludd.ltu.se> Signed-off-by: Doug Thompson <dougthompson@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
pid: reduce code size by using a pointer to iterate over array
It decreases code size by 16 bytes on my gcc 4.4.1 on Core 2:
text data bss dec hex filename
4314 2216 8 6538 198a kernel/pid.o-BEFORE
4298 2216 8 6522 197a kernel/pid.o-AFTER
Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently all architectures but microblaze unconditionally define
USE_ELF_CORE_DUMP. The microblaze omission seems like an error to me, so
let's kill this ifdef and make sure we are the same everywhere.
Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: <linux-arch@vger.kernel.org> Cc: Michal Simek <michal.simek@petalogix.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Manfred Spraul [Wed, 16 Dec 2009 00:47:34 +0000 (16:47 -0800)]
ipc/sem.c: optimize single sops when semval is zero
If multiple simple decrements on the same semaphore are pending, then the
current code scans all decrement operations, even if the semaphore value
is already 0.
The patch optimizes that: if the semaphore value is 0, then there is no
need to scan the q->alter entries.
Note that this is a common case: It happens if 100 decrements by one are
pending and now an increment by one increases the semaphore value from 0
to 1. Without this patch, all 100 entries are scanned. With the patch,
only one entry is scanned, then woken up. Then the new rule triggers and
the scanning is aborted, without looking at the remaining 99 tasks.
With this patch, single sop increment/decrement by 1 are now O(1).
(same as with Nick's patch)
Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Pierre Peiffer <peifferp@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Manfred Spraul [Wed, 16 Dec 2009 00:47:33 +0000 (16:47 -0800)]
ipc/sem.c: optimize single semop operations
sysv sem has the concept of semaphore arrays that consist out of multiple
semaphores. Atomic operations that affect multiple semaphores are
supported.
The patch optimizes single semaphore operation calls that affect only one
semaphore: It's not necessary to scan all pending operations, it is
sufficient to scan the per-semaphore list.
The idea is from Nick Piggin version of an ipc sem improvement, the
implementation is different: The code tries to keep as much common code as
possible.
As the result, the patch is simpler, but optimizes fewer cases.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Pierre Peiffer <peifferp@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Manfred Spraul [Wed, 16 Dec 2009 00:47:32 +0000 (16:47 -0800)]
ipc/sem.c: add a per-semaphore pending list
Based on Nick's findings:
sysv sem has the concept of semaphore arrays that consist out of multiple
semaphores. Atomic operations that affect multiple semaphores are
supported.
The patch is the first step for optimizing simple, single semaphore
operations: In addition to the global list of all pending operations, a
2nd, per-semaphore list with the simple operations is added.
Note: this patch does not make sense by itself, the new list is used
nowhere.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Pierre Peiffer <peifferp@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Manfred Spraul [Wed, 16 Dec 2009 00:47:31 +0000 (16:47 -0800)]
ipc/sem.c: optimize if semops fail
Reduce the amount of scanning of the list of pending semaphore operations:
If try_atomic_semop failed, then no changes were applied. Thus no need to
restart.
Additionally, this patch correct an incorrect comment: It's possible to
wait for arbitrary semaphore values (do a dec by <x>, wait-for-zero, inc
by <x> in one atomic operation)
Both changes are from Nick Piggin, the patch is the result of a different
split of the individual changes.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Pierre Peiffer <peifferp@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nick Piggin [Wed, 16 Dec 2009 00:47:30 +0000 (16:47 -0800)]
ipc/sem.c: sem preempt improve
The strange sysv semaphore wakeup scheme has a kind of busy-wait lock
involved, which could deadlock if preemption is enabled during the "lock".
It is an implementation detail (due to a spinlock being held) that this is
actually the case. However if "spinlocks" are made preemptible, or if the
sem lock is changed to a sleeping lock for example, then the wakeup would
become buggy. So this might be a bugfix for -rt kernels.
Imagine waker being preempted by wakee and never clearing IN_WAKEUP -- if
wakee has higher RT priority then there is a priority inversion deadlock.
Even if there is not a priority inversion to cause a deadlock, then there
is still time wasted spinning.
Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Pierre Peiffer <peifferp@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nick Piggin [Wed, 16 Dec 2009 00:47:28 +0000 (16:47 -0800)]
ipc/sem.c: sem optimise undo list search
Around a month ago, there was some discussion about an improvement of the
sysv sem algorithm: Most (at least: some important) users only use simple
semaphore operations, therefore it's worthwile to optimize this use case.
This patch:
Move last looked up sem_undo struct to the head of the task's undo list.
Attempt to move common entries to the front of the list so search time is
reduced. This reduces lookup_undo on oprofile of problematic SAP workload
by 30% (see patch 4 for a description of SAP workload).
Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Pierre Peiffer <peifferp@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Serge E. Hallyn [Wed, 16 Dec 2009 00:47:27 +0000 (16:47 -0800)]
ipc ns: fix memory leak (idr)
We have apparently had a memory leak since 7ca7e564e049d8b350ec9d958ff25eaa24226352 "ipc: store ipcs into IDRs" in
2007. The idr of which 3 exist for each ipc namespace is never freed.
This patch simply frees them when the ipcns is freed. I don't believe any
idr_remove() are done from rcu (and could therefore be delayed until after
this idr_destroy()), so the patch should be safe. Some quick testing
showed no harm, and the memory leak fixed.
Caught by kmemleak.
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg Nesterov [Wed, 16 Dec 2009 00:47:26 +0000 (16:47 -0800)]
signals: check ->group_stop_count after tracehook_get_signal()
Move the call to do_signal_stop() down, after tracehook call. This makes
->group_stop_count condition visible to tracers before do_signal_stop()
will participate in this group-stop.
Currently the patch has no effect, tracehook_get_signal() always returns 0.
Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg Nesterov [Wed, 16 Dec 2009 00:47:24 +0000 (16:47 -0800)]
signals: send_signal: use si_fromuser() to detect from_ancestor_ns
Change send_signal() to use si_fromuser(). From now SEND_SIG_NOINFO
triggers the "from_ancestor_ns" check.
This fixes reparent_thread()->group_send_sig_info(pdeath_signal)
behaviour, before this patch send_signal() does not detect the
cross-namespace case when the child of the dying parent belongs to the
sub-namespace.
This patch can affect the behaviour of send_sig(), kill_pgrp() and
kill_pid() when the caller sends the signal to the sub-namespace with
"priv == 0" but surprisingly all callers seem to use them correctly,
including disassociate_ctty(on_exit).
Except: drivers/staging/comedi/drivers/addi-data/*.c incorrectly use
send_sig(priv => 0). But his is minor and should be fixed anyway.
Reported-by: Daniel Lezcano <dlezcano@fr.ibm.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Reviewed-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg Nesterov [Wed, 16 Dec 2009 00:47:22 +0000 (16:47 -0800)]
signals: SEND_SIG_NOINFO should be considered as SI_FROMUSER()
No changes in compiled code. The patch adds the new helper, si_fromuser()
and changes check_kill_permission() to use this helper.
The real effect of this patch is that from now we "officially" consider
SEND_SIG_NOINFO signal as "from user-space" signals. This is already true
if we look at the code which uses SEND_SIG_NOINFO, except __send_signal()
has another opinion - see the next patch.
The naming of these special SEND_SIG_XXX siginfo's is really bad
imho. From __send_signal()'s pov they mean
SEND_SIG_NOINFO from user
SEND_SIG_PRIV from kernel
SEND_SIG_FORCED no info
Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Reviewed-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg Nesterov [Wed, 16 Dec 2009 00:47:19 +0000 (16:47 -0800)]
ptrace: change tracehook_report_syscall_exit() to handle stepping
Suggested by Roland.
Change tracehook_report_syscall_exit() to look at step flag and send the
trap signal if needed.
This change affects ia64, microblaze, parisc, powerpc, sh. They pass
nonzero "step" argument to tracehook but since it was ignored the tracee
reports via ptrace_notify(), this is not right and not consistent.
- PTRACE_SETSIGINFO doesn't work
- if the tracer resumes the tracee with signr != 0 the new signal
is generated rather than delivering it
- If PT_TRACESYSGOOD is set the tracee reports the wrong exit_code
I don't have a powerpc machine, but I think this test-case should see the
difference:
Currently there is no way to synthesize a single-stepping trap in the
arch-independent manner. This patch adds the default helper which fills
siginfo_t, arch/ can can override it.
Architetures which implement user_enable_single_step() should add
user_single_step_siginfo() also.
Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: <linux-arch@vger.kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg Nesterov [Wed, 16 Dec 2009 00:47:16 +0000 (16:47 -0800)]
ptrace: copy_process() should disable stepping
If the tracee calls fork() after PTRACE_SINGLESTEP, the forked child
starts with TIF_SINGLESTEP/X86_EFLAGS_TF bits copied from ptraced parent.
This is not right, especially when the new child is not auto-attaced: in
this case it is killed by SIGTRAP.
Change copy_process() to call user_disable_single_step(). Tested on x86.
ptrace_init_task() looks confusing, as if we always auto-attach when "bool
ptrace" argument is true, while in fact we attach only if current is
traced.
Make the code more explicit and kill now unused ptrace_link().
Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
memcg_tasklist was introduced at commit 7f4d454d(memcg: avoid deadlock
caused by race between oom and cpuset_attach) instead of cgroup_mutex to
fix a deadlock problem. The cgroup_mutex, which was removed by the
commit, in mem_cgroup_out_of_memory() was originally introduced at commit c7ba5c9e (Memory controller: OOM handling).
IIUC, the intention of this cgroup_mutex was to prevent task move during
select_bad_process() so that situations like below can be avoided.
Assume cgroup "foo" has exceeded its limit and is about to trigger oom.
1. Process A, which has been in cgroup "baa" and uses large memory, is just
moved to cgroup "foo". Process A can be the candidates for being killed.
2. Process B, which has been in cgroup "foo" and uses large memory, is just
moved from cgroup "foo". Process B can be excluded from the candidates for
being killed.
But these race window exists anyway even if we hold a lock, because
__mem_cgroup_try_charge() decides wether it should trigger oom or not
outside of the lock. So the original cgroup_mutex in
mem_cgroup_out_of_memory and thus current memcg_tasklist has no use. And
IMHO, those races are not so critical for users.
memcg: avoid oom-killing innocent task in case of use_hierarchy
task_in_mem_cgroup(), which is called by select_bad_process() to check
whether a task can be a candidate for being oom-killed from memcg's limit,
checks "curr->use_hierarchy"("curr" is the mem_cgroup the task belongs
to).
But this check return true(it's false positive) when:
<some path>/aa use_hierarchy == 0 <- hitting limit
<some path>/aa/00 use_hierarchy == 1 <- the task belongs to
This leads to killing an innocent task in aa/00. This patch is a fix for
this bug. And this patch also fixes the arg for
mem_cgroup_print_oom_info(). We should print information of mem_cgroup
which the task being killed, not current, belongs to.
mem_cgroup_move_parent() calls try_charge first and cancel_charge on
failure. IMHO, charge/uncharge(especially charge) is high cost operation,
so we should avoid it as far as possible.
This patch tries to delay try_charge in mem_cgroup_move_parent() by
re-ordering checks it does.
And this patch renames mem_cgroup_move_account() to
__mem_cgroup_move_account(), changes the return value of
__mem_cgroup_move_account() from int to void, and adds a new
wrapper(mem_cgroup_move_account()), which checks whether a @pc is valid
for moving account and calls __mem_cgroup_move_account().
This patch removes the last caller of trylock_page_cgroup(), so removes
its definition too.
memcg: make memcg's file mapped consistent with global VM
In global VM, FILE_MAPPED is used but memcg uses MAPPED_FILE. This makes
grep difficult. Replace memcg's MAPPED_FILE with FILE_MAPPED
And in global VM, mapped shared memory is accounted into FILE_MAPPED.
But memcg doesn't. fix it.
Note:
page_is_file_cache() just checks SwapBacked or not.
So, we need to check PageAnon.
This is a patch for coalescing access to res_counter at charging by percpu
caching. At charge, memcg charges 64pages and remember it in percpu
cache. Because it's cache, drain/flush if necessary.
This version uses public percpu area.
2 benefits for using public percpu area.
1. Sum of stocked charge in the system is limited to # of cpus
not to the number of memcg. This shows better synchonization.
2. drain code for flush/cpuhotplug is very easy (and quick)
The most important point of this patch is that we never touch res_counter
in fast path. The res_counter is system-wide shared counter which is modified
very frequently. We shouldn't touch it as far as we can for avoiding
false sharing.
On x86-64 8cpu server, I tested overheads of memcg at page fault by
running a program which does map/fault/unmap in a loop. Running
a task per a cpu by taskset and see sum of the number of page faults
in 60secs.
Changelog (since Oct/2):
- updated comments
- replaced get_cpu_var() with __get_cpu_var() if possible.
- removed mutex for system-wide drain. adds a counter instead of it.
- removed CONFIG_HOTPLUG_CPU
Changelog (old):
- rebased onto the latest mmotm
- moved charge size check before __GFP_WAIT check for avoiding unnecesary
- added asynchronous flush routine.
- fixed bugs pointed out by Nishimura-san.
[akpm@linux-foundation.org: tweak comments]
[nishimura@mxp.nes.nec.co.jp: don't do INIT_WORK() repeatedly against the same work_struct] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In massive parallel enviroment, res_counter can be a performance
bottleneck. One strong techinque to reduce lock contention is reducing
calls by coalescing some amount of calls into one.
Considering charge/uncharge chatacteristic,
- charge is done one by one via demand-paging.
- uncharge is done by
- in chunk at munmap, truncate, exit, execve...
- one by one via vmscan/paging.
It seems we have a chance to coalesce uncharges for improving scalability
at unmap/truncation.
This patch is a for coalescing uncharge. For avoiding scattering memcg's
structure to functions under /mm, this patch adds memcg batch uncharge
information to the task. A reason for per-task batching is for making use
of caller's context information. We do batched uncharge (deleyed
uncharge) when truncation/unmap occurs but do direct uncharge when
uncharge is called by memory reclaim (vmscan.c).
The degree of coalescing depends on callers
- at invalidate/trucate... pagevec size
- at unmap ....ZAP_BLOCK_SIZE
(memory itself will be freed in this degree.)
Then, we'll not coalescing too much.
On x86-64 8cpu server, I tested overheads of memcg at page fault by
running a program which does map/fault/unmap in a loop. Running
a task per a cpu by taskset and see sum of the number of page faults
in 60secs.
We can see some amounts of improvement.
(root cgroup doesn't affected by this patch)
Another patch for "charge" will follow this and above will be improved more.
Changelog(since 2009/10/02):
- renamed filed of memcg_batch (as pages to bytes, memsw to memsw_bytes)
- some clean up and commentary/description updates.
- added initialize code to copy_process(). (possible bug fix)
Changelog(old):
- fixed !CONFIG_MEM_CGROUP case.
- rebased onto the latest mmotm + softlimit fix patches.
- unified patch for callers
- added commetns.
- make ->do_batch as bool.
- removed css_get() at el. We don't need it.
memcg: fix memory.memsw.usage_in_bytes for root cgroup
A memory cgroup has a memory.memsw.usage_in_bytes file. It shows the sum
of the usage of pages and swapents in the cgroup. Presently the root
cgroup's memsw.usage_in_bytes shows the wrong value - the number of
swapents are not added.
Alexey Dobriyan [Wed, 16 Dec 2009 00:46:59 +0000 (16:46 -0800)]
proc: remove docbook and example
Example is outdated, it still uses old ->read_proc interfaces and "fb"
example is plain racy. There are better examples all over the tree.
Docbook itself says almost nothing about /proc and contain quite a number
of simply wrong facts, e.g. device nodes support. What it does is
describing at great length interface which are going to be removed.
There are Documentation/filesystems/seq_file.txt in exchange.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Erik Mouw <mouw@nl.linux.org> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexey Dobriyan [Wed, 16 Dec 2009 00:46:52 +0000 (16:46 -0800)]
reiserfs: remove /proc/fs/reiserfs/version
/proc/fs/reiserfs/version is on the way of removing ->read_proc interface.
It's empty however, so simply remove it instead of doing dummy
conversion. It's hard to see what information userspace can extract from
empty file.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Wed, 16 Dec 2009 00:46:49 +0000 (16:46 -0800)]
ext2: report metadata errors during fsync
When an IO error happens while writing metadata buffers, we should better
report it and call ext2_error since the filesystem is probably no longer
consistent. Sometimes such IO errors happen while flushing thread does
background writeback, the buffer gets later evicted from memory, and thus
the only trace of the error remains as AS_EIO bit set in blockdevice's
mapping. So we check this bit in ext2_fsync and report the error although
we cannot be really sure which buffer we failed to write.
Signed-off-by: Jan Kara <jack@suse.cz> Cc: Chris Mason <chris.mason@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexey Dobriyan [Wed, 16 Dec 2009 00:46:47 +0000 (16:46 -0800)]
pnpbios: convert to seq_file
Convert code away from ->read_proc/->write_proc interfaces. Switch to
proc_create()/proc_create_data() which make addition of proc entries
reliable wrt NULL ->proc_fops, NULL ->data and so on.
Chaithrika U S [Wed, 16 Dec 2009 00:46:46 +0000 (16:46 -0800)]
da850/omap-l138: add callback to control LCD panel power
Add the platform specific callback to control LCD panel and backlight
power.
Signed-off-by: Chaithrika U S <chaithrika@ti.com> Acked-by: Kevin Hilman <khilman@deeprootsystems.com> Cc: Krzysztof Helt <krzysztof.h1@poczta.fm> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Krzysztof Helt [Wed, 16 Dec 2009 00:46:45 +0000 (16:46 -0800)]
intelfb: fix setting of active pipe with LVDS displays
The intelfb driver sets color map depending on currently active pipe.
However, if an LVDS display is attached (like in laptop) the active pipe
variable is never set. The default value is PIPE_A and can be wrong. Set
up the pipe variable during driver initialization after hardware state was
read.
Also, the detection of the active display (and hence the pipe) is wrong.
The pipes are assigned to so called planes. Both pipes are always enabled
on my laptop but only one plane is enabled (the plane A for the CRT or the
plane B for the LVDS). Change active pipe detection code to take into
account a status of the plane assigned to each pipe.
The problem is visible in the 8 bpp mode if colors above 15 are used. The
first 16 color entries are displayed correctly.
The graphics chip description is here (G45 vol. 3):
http://intellinuxgraphics.org/documentation.html
Signed-off-by: Krzysztof Helt <krzysztof.h1@wp.pl> Cc: Michal Suchanek <hramrach@centrum.cz> Cc: Dean Menezes <samanddeanus@yahoo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Harald Welte [Wed, 16 Dec 2009 00:46:44 +0000 (16:46 -0800)]
viafb: cosmetic cleanup of function integrated_lvds_enable()
A humble attempt to simplify the coding style to improve readability
Signed-off-by: Harald Welte <HaraldWelte@viatech.com> Signed-off-by: Florian Tobias Schandinat <FlorianSchandinat@gmx.de> Cc: Joseph Chan <JosephChan@via.com.tw> Cc: Scott Fang <ScottFang@viatech.com.cn> Cc: Krzysztof Helt <krzysztof.h1@poczta.fm> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Harald Welte [Wed, 16 Dec 2009 00:46:42 +0000 (16:46 -0800)]
viafb: documentation update
We now support the VX855, and the VX800 is no longer unaccellerated.
viafb_video_dev was removed as it was useless.
Signed-off-by: Harald Welte <HaraldWelte@viatech.com> Signed-off-by: Florian Tobias Schandinat <FlorianSchandinat@gmx.de> Cc: Joseph Chan <JosephChan@via.com.tw> Cc: Scott Fang <ScottFang@viatech.com.cn> Cc: Krzysztof Helt <krzysztof.h1@poczta.fm> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alan Cox [Wed, 16 Dec 2009 00:46:40 +0000 (16:46 -0800)]
matroxfb: fix problems with display stability
Regression caused in 2.6.23 and then despite repeated requests never fixed
or dealt with (Petr promised to sort it in 2008 but seems to have
forgotten).
Enough is enough - remove the problem line that was added. If it upsets
someone they've had two years to deal with it and at the very least it'll
rattle their cage and wake them up.
Signed-off-by: Alan Cox <alan@linux.intel.com> Reported-by: Damon <account@bugzilla.kernel.org.juxtaposition.net> Tested-by: Ruud van Melick <rvm1974@raketnet.nl> Cc: Petr Vandrovec <VANDROVE@vc.cvut.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Paul A. Clarke <pc@us.ibm.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Chaithrika U S [Wed, 16 Dec 2009 00:46:39 +0000 (16:46 -0800)]
davinci: fb: add framebuffer blank operation
Implement frame buffer blank operation feature for DA8xx/OMAP-L1xx driver.
Signed-off-by: Chaithrika U S <chaithrika@ti.com> Cc: Kevin Hilman <khilman@deeprootsystems.com> Cc: Krzysztof Helt <krzysztof.h1@poczta.fm> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Chaithrika U S [Wed, 16 Dec 2009 00:46:39 +0000 (16:46 -0800)]
davinci: fb : add suspend/resume suuport for DA8xx/OMAP-L1xx fb driver
Suspend/resume support DA8xx/OMAP-L1xx frame buffer driver. This feature
has been tested on DA850/OMAP-L138 EVM. For the purpose of testing, the
patch series[1] which adds suspend support for DA850/OMAP-L138 SoC was
applied.
[1] http://patchwork.kernel.org/patch/60260/
Signed-off-by: Chaithrika U S <chaithrika@ti.com> Cc: Kevin Hilman <khilman@deeprootsystems.com> Cc: Krzysztof Helt <krzysztof.h1@poczta.fm> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Chaithrika U S [Wed, 16 Dec 2009 00:46:38 +0000 (16:46 -0800)]
davinci: fb: update the driver in preparation for addition of power management features
Add a helper function to enable raster. Also add one member in the
private data structure to track the current blank status, another function
pointer which takes in the platform specific callback function to control
panel power.
These updates will help in adding suspend/resume and frame buffer blank
operation features.
Signed-off-by: Chaithrika U S <chaithrika@ti.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vincent Sanders [Wed, 16 Dec 2009 00:46:35 +0000 (16:46 -0800)]
sm501: implement acceleration features
This patch provides the acceleration entry points for the SM501
framebuffer driver.
This patch provides the sync, copyarea and fillrect entry points, using
the SM501's 2D acceleration engine to perform the operations in-chip
rather than across the bus.
Signed-off-by: Ben Dooks <ben@simtec.co.uk> Signed-off-by: Simtec Linux Team <linux@simtec.co.uk> Signed-off-by: Vincent Sanders <vince@simtec.co.uk> Cc: Krzysztof Helt <krzysztof.h1@poczta.fm> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ben Dooks [Wed, 16 Dec 2009 00:46:34 +0000 (16:46 -0800)]
sm501: fix use of old <asm/io.h> instead of <linux/io.h>
Fix the old style use of <asm/io.h> by replacing it with <linux/io.h>.
Signed-off-by: Ben Dooks <ben@simtec.co.uk> Signed-off-by: Simtec Linux Team <linux@simtec.co.uk> Cc: Vincent Sanders <vince@simtec.co.uk> Cc: Krzysztof Helt <krzysztof.h1@poczta.fm> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ben Dooks [Wed, 16 Dec 2009 00:46:33 +0000 (16:46 -0800)]
sm501: fix missing uses of resource_size()
There are several places in the SM501 fb driver that could do with using
resource_size() to calculate the size of a resource.
Also fix a bug where request_mem_region() is being passed one too few
bytes when requesting the register memory region, which was causing the
following in /proc/iomem:
Signed-off-by: Ben Dooks <ben@simtec.co.uk> Signed-off-by: Simtec Linux Team <linux@simtec.co.uk> Cc: Vincent Sanders <vince@simtec.co.uk> Cc: Krzysztof Helt <krzysztof.h1@poczta.fm> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Krzysztof Helt [Wed, 16 Dec 2009 00:46:32 +0000 (16:46 -0800)]
i810fb: fix stack exploding
Alan Cox has found that the i810fb function "uses a whopping 2.5K of stack".
Signed-off-by: Krzysztof Helt <krzysztof.h1@wp.pl> Reported-by: Alan Cox <alan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Chaithrika U S [Wed, 16 Dec 2009 00:46:29 +0000 (16:46 -0800)]
davinci: fb: add cpufreq support
Add cpufreq support for DA8xx/OMAP-L1xx frame buffer driver
Signed-off-by: Chaithrika U S <chaithrika@ti.com> Cc: Krzysztof Helt <krzysztof.h1@poczta.fm> Cc: Kevin Hilman <khilman@deeprootsystems.com> Cc: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Chaithrika U S [Wed, 16 Dec 2009 00:46:29 +0000 (16:46 -0800)]
davinci: fb: calculate the clock divider from pixel clock info
The clock divider value can be calculated from the pixel clock value for
the panel. This gives more flexiblity to the driver to change the divider
value on the fly as in the case of cpufreq feature- support for which will
be added shortly.
Signed-off-by: Chaithrika U S <chaithrika@ti.com> Cc: Sudhakar Rajashekhara <sudhakar.raj@ti.com> Cc: Steve Chen <schen@mvista.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* x->fld
... when != \(x = E\|&x\)
* x == NULL
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk> Cc: Krzysztof Helt <krzysztof.h1@poczta.fm> Cc: Eric Miao <eric.y.miao@gmail.com> Cc: Daniel Mack <daniel@caiaq.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Krzysztof Helt [Wed, 16 Dec 2009 00:46:25 +0000 (16:46 -0800)]
fbdev: add palette register check to several drivers
Add check if palette register number is in correct range for few drivers
which miss it. The regno value comes indirectly from user space.
Two drivers has converted check from BUG_ON() macro to just return an
error (non-zero value).
Signed-off-by: Krzysztof Helt <krzysztof.h1@wp.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andy Shevchenko [Wed, 16 Dec 2009 00:46:24 +0000 (16:46 -0800)]
fbdev: drop custom atoi from drivers/video/modedb.c
Kernel has simple_strtol() implementation which could be used as atoi().
Signed-off-by: Andy Shevchenko <ext-andriy.shevchenko@nokia.com> Cc: Krzysztof Helt <krzysztof.h1@poczta.fm> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Roel Kluin [Wed, 16 Dec 2009 00:46:23 +0000 (16:46 -0800)]
fbdev: TV_PALN bit set twice in sisfb_detect_VB_connect()
The TV_PALN bit was tested twice, replace one by TV_PALM.
Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Cc: Thomas Winischhofer <thomas@winischhofer.net> Cc: Krzysztof Helt <krzysztof.h1@poczta.fm> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
fbdev: bfin-lq035q1-fb: new Blackfin Landscape LCD EZ-Extender driver
Framebuffer driver for the Landscape LCD EZ-Extender (ADZS-BFLLCD-EZEXT)
http://docs.blackfin.uclinux.org/doku.php?id=hw:cards:landscape_lcd_ez-extender
Signed-off-by: Michael Hennerich <michael.hennerich@analog.com> Signed-off-by: Bryan Wu <cooloney@kernel.org> Signed-off-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jani Nikula [Wed, 16 Dec 2009 00:46:20 +0000 (16:46 -0800)]
gpiolib: add support for changing value polarity in sysfs
Drivers may use gpiolib sysfs as part of their public user space
interface. The GPIO number and polarity might change from board to
board. The gpio_export_link() call can be used to hide the GPIO number
from user space. Add support for also hiding the GPIO line polarity
changes from user space.
Signed-off-by: Jani Nikula <ext-jani.1.nikula@nokia.com> Cc: David Brownell <david-b@pacbell.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Walleij [Wed, 16 Dec 2009 00:46:13 +0000 (16:46 -0800)]
rtc: remove __raw_* accessors from PL031 RTC
This switches __raw_[read|write]l() for plain [read|write]l in the PL031
RTC driver. The sister driver for PL030 use the simple accessors as most
PrimeCell drivers.
Piotr Ziecik [Wed, 16 Dec 2009 00:46:12 +0000 (16:46 -0800)]
rtc: add driver for BQ32000 I2C RTC
This patch adds basic support for Texas Instruments BQ32000 I2C RTC. Only
time reading/writing is implemented. Advanced features, such as trickle
charger and crystal calibration are not supported.
Signed-off-by: Piotr Ziecik <kosmo@semihalf.com> Signed-off-by: Alessandro Zummo <a.zummo@towertech.it> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mark A. Greer [Wed, 16 Dec 2009 00:46:11 +0000 (16:46 -0800)]
rtc: make rtc-omap driver ioremap its register space
The rtc-omap driver currently assumes that the rtc's registers are at a
fixed address and already mapped into virtual memory space. Remove those
assumptions so the same driver can be used for similar devices that reside
at different physical addresses (e.g., TI's DA8xx/OMAP-L13x SoC's).
Also allow the possibility for the timer and alarm interrupts to use the
same IRQ.
Signed-off-by: Mark A. Greer <mgreer@mvista.com> Acked-by: David Brownell <dbrownell@users.sourceforge.net> Acked-by: Kevin Hilman <khilman@deeprootsystems.com> Acked-by: Tony Lindgren <tony@atomide.com> Signed-off-by: Alessandro Zummo <a.zummo@towertech.it> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add setting and clearing of the "pending" flag of the RTC alarm. The
semantics follow the UEFI specification 2.2 available at
http://www.uefi.org/specs/, i.e., the "pending" flag is cleared by
disabling the alarm, but not by any other condition (such as the passing
of time, a successful wakeup, or setting of a new alarm.)
Signed-off-by: Werner Almesberger <werner@openmoko.org> Signed-off-by: Paul Fertser <fercerpav@gmail.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: Paul Gortmaker <p_gortmaker@yahoo.com> Cc: Balaji Rao <balajirrao@openmoko.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
John Kacur [Wed, 16 Dec 2009 00:46:06 +0000 (16:46 -0800)]
efirtc: explicitly set llseek to no_llseek
Now that we've removed the BKL here, let's explicitly set llseek to
no_llseek since the default llseek is not used here.
The default_llseek function still contains the BKL. When we are auditing
code to see if we can remove the BKL, this is one of the hidden
considerations we need to take into account. i.e., is there
syncronization between code that has the BKL and llseek.
At the same time we remove the BKL it would be a good idea to do indicate
when no llseek function is required, so we don't have to revisit this code
again, when we are trying to determine if we can remove the BKL from the
default_llseek.
Signed-off-by: John Kacur <jkacur@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>