As scaling now takes place on all kind of cpu add/remove events a user
that configures values via proc should be able to configure if his set
values are still rescaled or kept whatever happens.
As the comments state that log2 was just a second guess that worked the
interface is not just designed for on/off, but to choose a scaling type.
Currently this allows none, log and linear, but more important it allwos
us to keep the interface even if someone has an even better idea how to
scale the values.
Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1259579808-11357-3-git-send-email-ehrhardt@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
sched: Fix missing sched tunable recalculation on cpu add/remove
Based on Peter Zijlstras patch suggestion this enables recalculation of
the scheduler tunables in response of a change in the number of cpus. It
also adds a max of eight cpus that are considered in that scaling.
Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1259579808-11357-2-git-send-email-ehrhardt@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Peter Zijlstra [Fri, 4 Dec 2009 08:59:02 +0000 (09:59 +0100)]
sched: Fix task priority bug
83f9ac removed a call to effective_prio() in wake_up_new_task(), which
leads to tasks running at MAX_PRIO.
This is caused by the idle thread being set to MAX_PRIO before forking
off init. O(1) used that to make sure idle was always preempted, CFS
uses check_preempt_curr_idle() for that so we can savely remove this bit
of legacy code.
Reported-by: Mike Galbraith <efault@gmx.de> Tested-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1259754383.4003.610.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Peter Zijlstra [Thu, 3 Dec 2009 17:00:07 +0000 (18:00 +0100)]
sched: cgroup: Implement different treatment for idle shares
When setting the weight for a per-cpu task-group, we have to put in a
phantom weight when there is no work on that cpu, otherwise we'll not
service that cpu when new work gets placed there until we again update
the per-cpu weights.
We used to add these phantom weights to the total, so that the idle
per-cpu shares don't get inflated, this however causes the non-idle
parts to get deflated, causing unexpected weight distibutions.
Reverse this, so that the non-idle shares are correct but the idle
shares are inflated.
Jupyung Lee [Tue, 17 Nov 2009 09:51:40 +0000 (18:51 +0900)]
sched: Move update_curr() in check_preempt_wakeup() to avoid redundant call
If a RT task is woken up while a non-RT task is running,
check_preempt_wakeup() is called to check whether the new task can
preempt the old task. The function returns quickly without going deeper
because it is apparent that a RT task can always preempt a non-RT task.
In this situation, check_preempt_wakeup() always calls update_curr() to
update vruntime value of the currently running task. However, the
function call is unnecessary and redundant at that moment because (1) a
non-RT task can always be preempted by a RT task regardless of its
vruntime value, and (2) update_curr() will be called shortly when the
context switch between two occurs.
By moving update_curr() in check_preempt_wakeup(), we can avoid
redundant call to update_curr(), slightly reducing the time taken to
wake up RT tasks.
Signed-off-by: Jupyung Lee <jupyung@gmail.com>
[ Place update_curr() right before the wake_preempt_entity() call, which
is the only thing that relies on the updated vruntime ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1258451500-6714-1-git-send-email-jupyung@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Peter Zijlstra [Fri, 27 Nov 2009 16:32:46 +0000 (17:32 +0100)]
sched: Sanitize fork() handling
Currently we try to do task placement in wake_up_new_task() after we do
the load-balance pass in sched_fork(). This yields complicated semantics
in that we have to deal with tasks on different RQs and the
set_task_cpu() calls in copy_process() and sched_fork()
Rename ->task_new() to ->task_fork() and call it from sched_fork()
before the balancing, this gives the policy a clear point to place the
task.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Peter Zijlstra [Fri, 27 Nov 2009 13:12:25 +0000 (14:12 +0100)]
sched: Remove rq->clock coupling from set_task_cpu()
set_task_cpu() should be rq invariant and only touch task state, it
currently fails to do so, which opens up a few races, since not all
callers hold both rq->locks.
Remove the relyance on rq->clock, as any site calling set_task_cpu()
should also do a remote clock update, which should ensure the observed
time between these two cpus is monotonic, as per
kernel/sched_clock.c:sched_clock_remote().
Therefore we can simply remove the clock_offset bits and be happy.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Thomas Gleixner [Wed, 9 Dec 2009 08:32:03 +0000 (09:32 +0100)]
sched: Protect sched_rr_get_param() access to task->sched_class
sched_rr_get_param calls
task->sched_class->get_rr_interval(task) without protection
against a concurrent sched_setscheduler() call which modifies
task->sched_class.
Serialize the access with task_rq_lock(task) and hand the rq
pointer into get_rr_interval() as it's needed at least in the
sched_fair implementation.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <alpine.LFD.2.00.0912090930120.3089@localhost.localdomain> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Peter Zijlstra [Wed, 25 Nov 2009 12:31:39 +0000 (13:31 +0100)]
sched: Fix balance vs hotplug race
Since (e761b77: cpu hotplug, sched: Introduce cpu_active_map and redo
sched domain managment) we have cpu_active_mask which is suppose to rule
scheduler migration and load-balancing, except it never (fully) did.
The particular problem being solved here is a crash in try_to_wake_up()
where select_task_rq() ends up selecting an offline cpu because
select_task_rq_fair() trusts the sched_domain tree to reflect the
current state of affairs, similarly select_task_rq_rt() trusts the
root_domain.
However, the sched_domains are updated from CPU_DEAD, which is after the
cpu is taken offline and after stop_machine is done. Therefore it can
race perfectly well with code assuming the domains are right.
Cure this by building the domains from cpu_active_mask on
CPU_DOWN_PREPARE.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Commit acc3f5d7cabbfd6cec71f0c1f9900621fa2d6ae7 ("cpumask:
Partition_sched_domains takes array of cpumask_var_t") changed
the function signature of generate_sched_domains() for the
CONFIG_SMP=y case, but forgot to update the corresponding
function for the CONFIG_SMP=n case, causing:
kernel/cpuset.c:2073: warning: passing argument 1 of 'generate_sched_domains' from incompatible pointer type
Linus Torvalds [Sat, 5 Dec 2009 23:33:27 +0000 (15:33 -0800)]
Merge branch 'x86-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: Limit number of per cpu TSC sync messages
x86: dumpstack, 64-bit: Disable preemption when walking the IRQ/exception stacks
x86: dumpstack: Clean up the x86_stack_ids[][] initalization and other details
x86, cpu: mv display_cacheinfo -> cpu_detect_cache_sizes
x86: Suppress stack overrun message for init_task
x86: Fix cpu_devs[] initialization in early_cpu_init()
x86: Remove CPU cache size output for non-Intel too
x86: Minimise printk spew from per-vendor init code
x86: Remove the CPU cache size printk's
cpumask: Avoid cpumask_t in arch/x86/kernel/apic/nmi.c
x86: Make sure we also print a Code: line for show_regs()
Linus Torvalds [Sat, 5 Dec 2009 23:32:35 +0000 (15:32 -0800)]
Merge branch 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, msr, cpumask: Use struct cpumask rather than the deprecated cpumask_t
x86, cpuid: Simplify the code in cpuid_open
x86, cpuid: Remove the bkl from cpuid_open()
x86, msr: Remove the bkl from msr_open()
x86: AMD Geode LX optimizations
x86, msr: Unify rdmsr_on_cpus/wrmsr_on_cpus
Linus Torvalds [Sat, 5 Dec 2009 23:32:18 +0000 (15:32 -0800)]
Merge branch 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: Fix a section mismatch in arch/x86/kernel/setup.c
x86: Fixup last users of irq_chip->typename
x86: Remove BKL from apm_32
x86: Remove BKL from microcode
x86: use kernel_stack_pointer() in kprobes.c
x86: use kernel_stack_pointer() in kgdb.c
x86: use kernel_stack_pointer() in dumpstack.c
x86: use kernel_stack_pointer() in process_32.c
Linus Torvalds [Sat, 5 Dec 2009 23:32:03 +0000 (15:32 -0800)]
Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
include/linux/compiler-gcc4.h: Fix build bug - gcc-4.0.2 doesn't understand __builtin_object_size
x86/alternatives: No need for alternatives-asm.h to re-invent stuff already in asm.h
x86/alternatives: Check replacementlen <= instrlen at build time
x86, 64-bit: Set data segments to null after switching to 64-bit mode
x86: Clean up the loadsegment() macro
x86: Optimize loadsegment()
x86: Add missing might_fault() checks to copy_{to,from}_user()
x86-64: __copy_from_user_inatomic() adjustments
x86: Remove unused thread_return label from switch_to()
x86, 64-bit: Fix bstep_iret jump
x86: Don't use the strict copy checks when branch profiling is in use
x86, 64-bit: Move K8 B step iret fixup to fault entry asm
x86: Generate cmpxchg build failures
x86: Add a Kconfig option to turn the copy_from_user warnings into errors
x86: Turn the copy_from_user check into an (optional) compile time warning
x86: Use __builtin_memset and __builtin_memcpy for memset/memcpy
x86: Use __builtin_object_size() to validate the buffer size for copy_from_user()
Linus Torvalds [Sat, 5 Dec 2009 23:31:25 +0000 (15:31 -0800)]
Merge branch 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (30 commits)
x86, apic: Enable lapic nmi watchdog on AMD Family 11h
x86: Remove unnecessary mdelay() from cpu_disable_common()
x86, ioapic: Document another case when level irq is seen as an edge
x86, ioapic: Fix the EOI register detection mechanism
x86, io-apic: Move the effort of clearing remoteIRR explicitly before migrating the irq
x86: SGI UV: Map low MMR ranges
x86: apic: Print out SRAT table APIC id in hex
x86: Re-get cfg_new in case reuse/move irq_desc
x86: apic: Remove not needed #ifdef
x86: io-apic: IO-APIC MMIO should not fail on resource insertion
x86: Remove asm/apicnum.h
x86: apic: Do not use stacked physid_mask_t
x86, apic: Get rid of apicid_to_cpu_present assign on 64-bit
x86, ioapic: Use snrpintf while set names for IO-APIC resourses
x86, apic: Use PAGE_SIZE instead of numbers
x86: Remove local_irq_enable()/local_irq_disable() in fixup_irqs()
x86: Use EOI register in io-apic on intel platforms
x86: Force irq complete move during cpu offline
x86: Remove move_cleanup_count from irq_cfg
x86, intr-remap: Avoid irq_chip mask/unmask in fixup_irqs() for intr-remapping
...
Linus Torvalds [Sat, 5 Dec 2009 17:53:36 +0000 (09:53 -0800)]
Merge branch 'tracing-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (40 commits)
tracing: Separate raw syscall from syscall tracer
ring-buffer-benchmark: Add parameters to set produce/consumer priorities
tracing, function tracer: Clean up strstrip() usage
ring-buffer benchmark: Run producer/consumer threads at nice +19
tracing: Remove the stale include/trace/power.h
tracing: Only print objcopy version warning once from recordmcount
tracing: Prevent build warning: 'ftrace_graph_buf' defined but not used
ring-buffer: Move access to commit_page up into function used
tracing: do not disable interrupts for trace_clock_local
ring-buffer: Add multiple iterations between benchmark timestamps
kprobes: Sanitize struct kretprobe_instance allocations
tracing: Fix to use __always_unused attribute
compiler: Introduce __always_unused
tracing: Exit with error if a weak function is used in recordmcount.pl
tracing: Move conditional into update_funcs() in recordmcount.pl
tracing: Add regex for weak functions in recordmcount.pl
tracing: Move mcount section search to front of loop in recordmcount.pl
tracing: Fix objcopy revision check in recordmcount.pl
tracing: Check absolute path of input file in recordmcount.pl
tracing: Correct the check for number of arguments in recordmcount.pl
...
Linus Torvalds [Sat, 5 Dec 2009 17:53:21 +0000 (09:53 -0800)]
Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
tracing: Fix trace_marker output
tracing: Fix event format export
tracing: Fix return value of tracing_stats_read()
Linus Torvalds [Sat, 5 Dec 2009 17:52:14 +0000 (09:52 -0800)]
Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (31 commits)
rcu: Make RCU's CPU-stall detector be default
rcu: Add expedited grace-period support for preemptible RCU
rcu: Enable fourth level of TREE_RCU hierarchy
rcu: Rename "quiet" functions
rcu: Re-arrange code to reduce #ifdef pain
rcu: Eliminate unneeded function wrapping
rcu: Fix grace-period-stall bug on large systems with CPU hotplug
rcu: Eliminate __rcu_pending() false positives
rcu: Further cleanups of use of lastcomp
rcu: Simplify association of forced quiescent states with grace periods
rcu: Accelerate callback processing on CPUs not detecting GP end
rcu: Mark init-time-only rcu_bootup_announce() as __init
rcu: Simplify association of quiescent states with grace periods
rcu: Rename dynticks_completed to completed_fqs
rcu: Enable synchronize_sched_expedited() fastpath
rcu: Remove inline from forward-referenced functions
rcu: Fix note_new_gpnum() uses of ->gpnum
rcu: Fix synchronization for rcu_process_gp_end() uses of ->completed counter
rcu: Prepare for synchronization fixes: clean up for non-NO_HZ handling of ->completed counter
rcu: Cleanup: balance rcu_irq_enter()/rcu_irq_exit() calls
...
Linus Torvalds [Sat, 5 Dec 2009 17:50:22 +0000 (09:50 -0800)]
Merge branch 'core-printk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-printk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
ratelimit: Make suppressed output messages more useful
printk: Remove ratelimit.h from kernel.h
ratelimit: Fix/allow use in atomic contexts
ratelimit: Use per ratelimit context locking
Linus Torvalds [Sat, 5 Dec 2009 17:49:59 +0000 (09:49 -0800)]
Merge branch 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
mutex: Fix missing conditions to build mutex_spin_on_owner()
mutex: Better control mutex adaptive spinning config
locking, task_struct: Reduce size on TRACE_IRQFLAGS and 64bit
locking: Use __[SPIN|RW]_LOCK_UNLOCKED in [spin|rw]_lock_init()
locking: Remove unused prototype
locking: Reduce ifdefs in kernel/spinlock.c
locking: Make inlining decision Kconfig based
Linus Torvalds [Sat, 5 Dec 2009 17:49:07 +0000 (09:49 -0800)]
Merge branch 'core-iommu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-iommu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (63 commits)
x86, Calgary IOMMU quirk: Find nearest matching Calgary while walking up the PCI tree
x86/amd-iommu: Remove amd_iommu_pd_table
x86/amd-iommu: Move reset_iommu_command_buffer out of locked code
x86/amd-iommu: Cleanup DTE flushing code
x86/amd-iommu: Introduce iommu_flush_device() function
x86/amd-iommu: Cleanup attach/detach_device code
x86/amd-iommu: Keep devices per domain in a list
x86/amd-iommu: Add device bind reference counting
x86/amd-iommu: Use dev->arch->iommu to store iommu related information
x86/amd-iommu: Remove support for domain sharing
x86/amd-iommu: Rearrange dma_ops related functions
x86/amd-iommu: Move some pte allocation functions in the right section
x86/amd-iommu: Remove iommu parameter from dma_ops_domain_alloc
x86/amd-iommu: Use get_device_id and check_device where appropriate
x86/amd-iommu: Move find_protection_domain to helper functions
x86/amd-iommu: Simplify get_device_resources()
x86/amd-iommu: Let domain_for_device handle aliases
x86/amd-iommu: Remove iommu specific handling from dma_ops path
x86/amd-iommu: Remove iommu parameter from __(un)map_single
x86/amd-iommu: Make alloc_new_range aware of multiple IOMMUs
...
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw: (31 commits)
GFS2: Fix glock refcount issues
writeback: remove unused nonblocking and congestion checks (gfs2)
GFS2: drop rindex glock to refresh rindex list
GFS2: Tag all metadata with jid
GFS2: Locking order fix in gfs2_check_blk_state
GFS2: Remove dirent_first() function
GFS2: Display nobarrier option in /proc/mounts
GFS2: add barrier/nobarrier mount options
GFS2: remove division from new statfs code
GFS2: Improve statfs and quota usability
GFS2: Use dquot_send_warning()
VFS: Export dquot_send_warning
GFS2: Add set_xquota support
GFS2: Add get_xquota support
GFS2: Clean up gfs2_adjust_quota() and do_glock()
GFS2: Remove constant argument from qd_get()
GFS2: Remove constant argument from qdsb_get()
GFS2: Add proper error reporting to quota sync via sysfs
GFS2: Add get_xstate quota function
GFS2: Remove obsolete code in quota.c
...
Linus Torvalds [Sat, 5 Dec 2009 17:44:57 +0000 (09:44 -0800)]
Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (30 commits)
TOMOYO: Add recursive directory matching operator support.
remove CONFIG_SECURITY_FILE_CAPABILITIES compile option
SELinux: print denials for buggy kernel with unknown perms
Silence the existing API for capability version compatibility check.
LSM: Move security_path_chmod()/security_path_chown() to after mutex_lock().
SELinux: header generation may hit infinite loop
selinux: Fix warnings
security: report the module name to security_module_request
Config option to set a default LSM
sysctl: require CAP_SYS_RAWIO to set mmap_min_addr
tpm: autoload tpm_tis based on system PnP IDs
tpm_tis: TPM_STS_DATA_EXPECT workaround
define convenient securebits masks for prctl users (v2)
tpm: fix header for modular build
tomoyo: improve hash bucket dispersion
tpm add default function definitions
LSM: imbed ima calls in the security hooks
SELinux: add .gitignore files for dynamic classes
security: remove root_plug
SELinux: fix locking issue introduced with c6d3aaa4e35c71a3
...
David Daney [Sat, 5 Dec 2009 01:44:51 +0000 (17:44 -0800)]
x86: Convert BUG() to use unreachable()
Use the new unreachable() macro instead of for(;;);. When
allyesconfig is built with a GCC-4.5 snapshot on i686 the size of the
text segment is reduced by 3987 bytes (from 6827019 to 6823032).
Signed-off-by: David Daney <ddaney@caviumnetworks.com> Acked-by: "H. Peter Anvin" <hpa@zytor.com> CC: Thomas Gleixner <tglx@linutronix.de> CC: Ingo Molnar <mingo@redhat.com> CC: x86@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Daney [Sat, 5 Dec 2009 01:44:50 +0000 (17:44 -0800)]
Add support for GCC-4.5's __builtin_unreachable() to compiler.h (v2)
Starting with version 4.5, GCC has a new built-in function
__builtin_unreachable() that can be used in places like the kernel's
BUG() where inline assembly is used to transfer control flow. This
eliminated the need for an endless loop in these places.
The patch adds a new macro 'unreachable()' that will expand to either
__builtin_unreachable() or an endless loop depending on the compiler
version.
Change from v1: Simplify unreachable() for non-GCC 4.5 case.
This patch fixes some ref counting issues. Firstly by moving
the point at which we drop the ref count after a dlm lock
operation has completed we ensure that we never call
gfs2_glock_hold() on a lock with a zero ref count.
Secondly, by using atomic_dec_and_lock() in gfs2_glock_put()
we ensure that at no time will a glock with zero ref count
appear on the lru_list. That means that we can remove the
check for this in our shrinker (which was racy).
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Wu Fengguang [Wed, 18 Nov 2009 10:09:41 +0000 (18:09 +0800)]
writeback: remove unused nonblocking and congestion checks (gfs2)
No one is calling wb_writeback and write_cache_pages with
wbc.nonblocking=1 any more. And lumpy pageout will want to do
nonblocking writeback without the congestion wait.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
When a gfs2 filesystem is grown, it needs to rebuild the rindex list to be able
to use the new space. gfs2 does this when the rindex is marked not uptodate,
which happens when the rindex glock is dropped. However, on a single node
setup, there is never any reason to drop the rindex glock, so gfs2 never
invalidates the the rindex. This patch makes gfs2 automatically drop the
rindex glock after filesystem grows, so it can refresh the rindex list.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
There are two spare field in the header common to all GFS2
metadata. One is just the right size to fit a journal id
in it, and this patch updates the journal code so that each
time a metadata block is modified, we tag it with the journal
id of the node which is performing the modification.
The reason for this is that it should make it much easier to
debug issues which arise if we can tell which node was the
last to modify a particular metadata block.
Since the field is updated before the block is written into
the journal, each journal should only contain metadata which
is tagged with its own journal id. The one exception to this
is the journal header block, which might have a different node's
id in it, if that journal was recovered by another node in the
cluster.
Thus each journal will contain a record of which nodes recovered
it, via the journal header.
The other field in the metadata header could potentially be
used to hold information about what kind of operation was
performed, but for the time being we just zero it on each
transaction so that if we use it for that in future, we'll
know that the information (where it exists) is reliable.
I did consider using the other field to hold the journal
sequence number, however since in GFS2's journaling we write
the modified data into the journal and not the original
data, this gives no information as to what action caused the
modification, so I think we can probably come up with a better
use for those 64 bits in the future.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This function only had one caller left, and that caller only
called it for leaf blocks, hence one branch of the "if" was
never taken. In addition the call to get_left had already
verified the metadata type, so the function can be reduced
to a single line of code in its caller.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Currently gfs2 issues barrier unconditionally. There are various reasons
to disable them, be that just for testing or for stupid devices flushing
large battert backed caches. Add a nobarrier option that matches xfs and
btrfs for this. Also add a symmetric barrier option to turn it back on
at remount time.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
GFS2 now has three new mount options, statfs_quantum, quota_quantum and
statfs_percent. statfs_quantum and quota_quantum simply allow you to
set the tunables of the same name. Setting setting statfs_quantum to 0
will also turn on the statfs_slow tunable. statfs_percent accepts an
integer between 0 and 100. Numbers between 1 and 100 will cause GFS2 to
do any early sync when the local number of blocks free changes by at
least statfs_percent from the totoal number of blocks free. Setting
statfs_percent to 0 disables this.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This adds support to GFS2 to send quota warnings via netlink.
Also it removes a stray \r which was left over from when the
code used to print warnings on the console.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Sending a message to userspace in a generic format to warn
of events (e.g. quota exceeded) in the quota subsystem is
a generically useful feature. This patch makes some minor
changes to the send_message function from dquot.c renaming
it quota_send_message, moving it to quota.c and exporting it
for use by filesystems which do not use the dquot code.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This adds support for viewing the current GFS2 quota settings
via the XFS quota API. The setting of quotas will be addressed
in a later patch. Fields which are not supported here are left
set to zero.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Reviewed-by: Bob Peterson <rpeterso@redhat.com>
Both of these functions contained confusing and in one case
duplicate code. This patch adds a new check in do_glock()
so that we report -ENOENT if we are asked to sync a quota
entry which doesn't exist. Due to the previous patch this is
now reported correctly to userspace.
Also there are a few new comments, and I hope that the code
is easier to understand now.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
These two functions are altered so that gfs2_quota_sync may
in future be called directly from the VFS. The GFS2 superblock
changes to a VFS super block and there is an addition of an int
argument which is currently ignored.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The other patches in this series have been building towards
being able to support cached ACLs like other filesystems. The
only real difference with GFS2 is that we have to invalidate
the cache when we drop a glock, but that is dealt with in earlier
patches.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
To prepare for support for caching of ACLs, this cleans up the GFS2
ACL support by pushing the xattr code back into xattr.c and changing
the acl_get function into one which only returns ACLs so that we
can drop the caching function into it shortly.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This code has been shamelessly stolen from XFS at the suggestion
of Christoph Hellwig. I've not added support for cached ACLs so
far... watch for that in a later patch, although this is designed
in such a way that they should be easy to add.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Cc: Christoph Hellwig <hch@infradead.org>
GFS2: Fix -o meta mounts for subsequent mounts (i.e. all but the first one)
We have a long term plan to use the "-o meta" flag to GFS2 mounts to
access the alternate root which is used to store metadata for a GFS2
filesystem. This will allow us to eventually remove support for the
gfs2meta filesystem type (which is in any case just a "front end" to
the gfs2 filesystem type with the meta/master root).
Currently the "-o meta" option is only taken into account on the
initial mount of the filesystem. Subsequent mounts of the same
filesystem (i.e. on the same device) result in basically the same
as bind mounting the root of the original mount.
This patch changes that by using what is more or less a copy
of get_sb_bdev() and extending it so that it will take into
account the alternate root in all cases. The main difference
is that we have to parse the mount options a bit earlier. We can
then use them to select the appropriate root towards the end of
the function.
In addition this also fixes a bug where it was possible (but certainly
not desirable) to set different ro/rw options for the meta root
when mounted via the gfs2meta fs compared with the original mount.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Cc: Alexander Viro <aviro@redhat.com>
mutex: Fix missing conditions to build mutex_spin_on_owner()
We don't need to build mutex_spin_on_owner() if we have
CONFIG_DEBUG_MUTEXES or CONFIG_HAVE_DEFAULT_NO_SPIN_MUTEXES as
it won't be used under such configs.
Use CONFIG_MUTEX_SPIN_ON_OWNER as it gathers all the necessary
checks before building it.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <1259783357-8542-2-git-send-regression-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org>
Darrick J. Wong [Wed, 2 Dec 2009 23:05:56 +0000 (15:05 -0800)]
x86, Calgary IOMMU quirk: Find nearest matching Calgary while walking up the PCI tree
On a multi-node x3950M2 system, there's a slight oddity in the
PCI device tree for all secondary nodes:
30:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev e1)
\-33:00.0 PCI bridge: IBM CalIOC2 PCI-E Root Port (rev 01)
\-34:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 1078 (rev 04)
...as compared to the primary node:
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev e1)
\-01:00.0 VGA compatible controller: ATI Technologies Inc ES1000 (rev 02)
03:00.0 PCI bridge: IBM CalIOC2 PCI-E Root Port (rev 01)
\-04:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 1078 (rev 04)
In both nodes, the LSI RAID controller hangs off a CalIOC2
device, but on the secondary nodes, the BIOS hides the VGA
device and substitutes the device tree ending with the disk
controller.
It would seem that Calgary devices don't necessarily appear at
the top of the PCI tree, which means that the current code to
find the Calgary IOMMU that goes with a particular device is
buggy.
Rather than walk all the way to the top of the PCI
device tree and try to match bus number with Calgary descriptor,
the code needs to examine each parent of the particular device;
if it encounters a Calgary with a matching bus number, simply
use that.
Otherwise, we BUG() when the bus number of the Calgary doesn't
match the bus number of whatever's at the top of the device tree.
Extra note: This patch appears to work correctly for the x3950
that came before the x3950 M2.
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Acked-by: Muli Ben-Yehuda <muli@il.ibm.com> Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: Joerg Roedel <joerg.roedel@amd.com> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Cc: Jon D. Mason <jdmason@kudzu.us> Cc: Corinna Schultz <coschult@us.ibm.com> Cc: <stable@kernel.org>
LKML-Reference: <20091202230556.GG10295@tux1.beaverton.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
The RCU_CPU_STALL_DETECTOR costs almost nothing and has located
some bugs that might otherwise have been difficult to track
down. Make it be default for the TREE RCU implementations.
The vmlinux size impact is limited (on 64-bit x86 defconfig):
rcu: Add expedited grace-period support for preemptible RCU
Implement an synchronize_rcu_expedited() for preemptible RCU
that actually is expedited. This uses
synchronize_sched_expedited() to force all threads currently
running in a preemptible-RCU read-side critical section onto the
appropriate ->blocked_tasks[] list, then takes a snapshot of all
of these lists and waits for them to drain.
Enable a fourth level of rcu_node hierarchy for TREE_RCU and
TREE_PREEMPT_RCU. This is for stress-testing and experiemental
purposes only, although in theory this would enable 16,777,216
CPUs on 64-bit systems, though only 1,048,576 CPUs on 32-bit
systems. Normal experimental use of this fourth level will
normally set CONFIG_RCU_FANOUT=2, requiring a 16-CPU system,
though the more adventurous (and more fortunate) experimenters
may wish to chose CONFIG_RCU_FANOUT=3 for 81-CPU systems or even
CONFIG_RCU_FANOUT=4 for 256-CPU systems.
The number of "quiet" functions has grown recently, and the
names are no longer very descriptive. The point of all of these
functions is to do some portion of the task of reporting a
quiescent state, so rename them accordingly:
o cpu_quiet() becomes rcu_report_qs_rdp(), which reports a
quiescent state to the per-CPU rcu_data structure. If this
turns out to be a new quiescent state for this grace period,
then rcu_report_qs_rnp() will be invoked to propagate the
quiescent state up the rcu_node hierarchy.
o cpu_quiet_msk() becomes rcu_report_qs_rnp(), which reports
a quiescent state for a given CPU (or possibly a set of CPUs)
up the rcu_node hierarchy.
o cpu_quiet_msk_finish() becomes rcu_report_qs_rsp(), which
reports a full set of quiescent states to the global rcu_state
structure.
o task_quiet() becomes rcu_report_unblock_qs_rnp(), which reports
a quiescent state due to a task exiting an RCU read-side critical
section that had previously blocked in that same critical section.
As indicated by the new name, this type of quiescent state is
reported up the rcu_node hierarchy (using rcu_report_qs_rnp()
to do so).
Maybe 4.1.0 doesn't too, but this fixed it for me.
Caused by:
4a31276: x86: Turn the copy_from_user check into an (optional) compile time warning 9f0cf4a: x86: Use __builtin_object_size() to validate the buffer size for copy_from_user()
Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Arjan van de Ven <arjan@linux.intel.com>
LKML-Reference: <200910090724.n997OQl6013538@imap1.linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Julia Lawall [Sun, 9 Aug 2009 09:42:32 +0000 (11:42 +0200)]
VIDEO: Correct use of request_region/request_mem_region
request_region should be used with release_region, not request_mem_region.
Geert Uytterhoeven pointed out that in the case of drivers/video/gbefb.c,
the problem is actually the other way around; request_mem_region should be
used instead of request_region.
The semantic patch that finds/fixes this problem is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@r1@
expression start;
@@
request_region(start,...)
@b1@
expression r1.start;
@@
request_mem_region(start,...)
@depends on !b1@
expression r1.start;
expression E;
@@
Helge Deller [Wed, 2 Dec 2009 23:29:15 +0000 (00:29 +0100)]
modules: don't export section names of empty sections via sysfs
On the parisc architecture we face for each and every loaded kernel module
this kernel "badness warning":
sysfs: cannot create duplicate filename '/module/ac97_bus/sections/.text'
Badness at fs/sysfs/dir.c:487
Reason for that is, that on parisc all kernel modules do have multiple
.text sections due to the usage of the -ffunction-sections compiler flag
which is needed to reach all jump targets on this platform.
An objdump on such a kernel module gives:
Sections:
Idx Name Size VMA LMA File off Algn
0 .note.gnu.build-id 00000024000000000000000000000034 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
1 .text 00000000000000000000000000000058 2**0
CONTENTS, ALLOC, LOAD, READONLY, CODE
2 .text.ac97_bus_match 0000001c000000000000000000000058 2**2
CONTENTS, ALLOC, LOAD, READONLY, CODE
3 .text 000000000000000000000000000000d4 2**0
CONTENTS, ALLOC, LOAD, READONLY, CODE
...
Since the .text sections are empty (size of 0 bytes) and won't be
loaded by the kernel module loader anyway, I don't see a reason
why such sections need to be listed under
/sys/module/<module_name>/sections/<section_name> either.
The attached patch does solve this issue by not exporting section
names which are empty.
This fixes bugzilla http://bugzilla.kernel.org/show_bug.cgi?id=14703