When performing write-behind we allocate pages to store the data
during write.
Previously we just keep a list of pages. Now we keep a list of
bi_vec which includes offset and size.
This means that the r1bio has complete information to create a new
bio which will be needed for retrying after write errors.
Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
md/raid1: avoid writing to known-bad blocks on known-bad drives.
If we have seen any write error on a drive, then don't write to
any known-bad blocks on that drive.
If necessary, we divide the write request up into pieces just
like we do for reads, so each piece is either all written or
all not written to any given drive.
Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
md: make it easier to wait for bad blocks to be acknowledged.
It is only safe to choose not to write to a bad block if that bad
block is safely recorded in metadata - i.e. if it has been
'acknowledged'.
If it hasn't we need to wait for the acknowledgement.
We support that using rdev->blocked wait and
md_wait_for_blocked_rdev by introducing a new device flag
'BlockedBadBlock'.
This flag is only advisory.
It is cleared whenever we acknowledge a bad block, so that a waiter
can re-check the particular bad blocks that it is interested it.
It should be set by a caller when they find they need to wait.
This (set after test) is inherently racy, but as
md_wait_for_blocked_rdev already has a timeout, losing the race will
have minimal impact.
When we clear "Blocked" was also clear "BlockedBadBlocks" incase it
was set incorrectly (see above race).
We also modify the way we manage 'Blocked' to fit better with the new
handling of 'BlockedBadBlocks' and to make it consistent between
externally managed and internally managed metadata. This requires
that each raidXd loop checks if the metadata needs to be written and
triggers a write (md_check_recovery) if needed. Otherwise a queued
write request might cause raidXd to wait for the metadata to write,
and only that thread can write it.
Before writing metadata, we set FaultRecorded for all devices that
are Faulty, then after writing the metadata we clear Blocked for any
device for which the Fault was certainly Recorded.
The 'faulty' device flag now appears in sysfs if the device is faulty
*or* it has unacknowledged bad blocks. So user-space which does not
understand bad blocks can continue to function correctly.
User space which does, should not assume a device is faulty until it
sees the 'faulty' flag, and then sees the list of unacknowledged bad
blocks is empty.
If a device has ever seen a write error, we will want to handle
known-bad-blocks differently.
So create an appropriate state flag and export it via sysfs.
Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
md/raid1: avoid reading known bad blocks during resync
When performing resync/etc, keep the size of the request
small enough that it doesn't overlap any known bad blocks.
Devices with badblocks at the start of the request are completely
excluded.
If there is nowhere to read from due to bad blocks, record
a bad block on each target device.
Now that we never read from known-bad-blocks we can allow devices with
known-bad-blocks into a RAID1.
Now that we have a bad block list, we should not read from those
blocks.
There are several main parts to this:
1/ read_balance needs to check for bad blocks, and return not only
the chosen device, but also how many good blocks are available
there.
2/ fix_read_error needs to avoid trying to read from bad blocks.
3/ read submission must be ready to issue multiple reads to
different devices as different bad blocks on different devices
could mean that a single large read cannot be served by any one
device, but can still be served by the array.
This requires keeping count of the number of outstanding requests
per bio. This count is stored in 'bi_phys_segments'
4/ retrying a read needs to also be ready to submit a smaller read
and queue another request for the rest.
This does not yet handle bad blocks when reading to perform resync,
recovery, or check.
'md_trim_bio' will also be used for RAID10, so put it in md.c and
export it.
Space must have been allocated when array was created.
A feature flag is set when the badblock list is non-empty, to
ensure old kernels don't load and trust the whole device.
We only update the on-disk badblocklist when it has changed.
If the badblocklist (or other metadata) is stored on a bad block, we
don't cope very well.
If metadata has no room for bad block, flag bad-blocks as disabled,
and do the same for 0.90 metadata.
md: don't allow arrays to contain devices with bad blocks.
As no personality understand bad block lists yet, we must
reject any device that is known to contain bad blocks.
As the personalities get taught, these tests can be removed.
This only applies to raid1/raid5/raid10.
For linear/raid0/multipath/faulty the whole concept of bad blocks
doesn't mean anything so there is no point adding the checks.
Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
md/bad-block-log: add sysfs interface for accessing bad-block-log.
This can show the log (providing it fits in one page) and
allows bad blocks to be 'acknowledged' meaning that they
have safely been recorded in metadata.
Clearing bad blocks is not allowed via sysfs (except for
code testing). A bad block can only be cleared when
a write to the block succeeds.
Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
When calling bioset_create we pass the size of the front_pad as
sizeof(mddev)
which looks suspicious as mddev is a pointer and so it looks like a
common mistake where
sizeof(*mddev)
was intended.
The size is actually correct as we want to store a pointer in the
front padding of the bios created by the bioset, so make the intent
more explicit by using
sizeof(mddev_t *)
If device-mapper creates a RAID1 array that includes devices to
be rebuilt, it will deref a NULL pointer when finished because
sysfs is not used by device-mapper instantiated RAID devices.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
While preparing to write a stripe we keep the parity block or blocks
locked (R5_LOCKED) - towards the end of schedule_reconstruction.
If the array is discovered to have failed before this write completes
we can leave those blocks LOCKED, and init_stripe will notice that a
free stripe still has a locked block and will complain.
So clear the R5_LOCKED flag in handle_failed_stripe, and demote the
'BUG' to a 'WARN_ON'.
Namhyung Kim [Wed, 27 Jul 2011 01:00:36 +0000 (11:00 +1000)]
md/raid10: move rdev->corrected_errors counting
Read errors are considered to corrected if write-back and re-read
cycle is finished without further problems. Thus moving the rdev->
corrected_errors counting after the re-reading looks more reasonable
IMHO.
Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
Namhyung Kim [Wed, 27 Jul 2011 01:00:36 +0000 (11:00 +1000)]
md/raid5: move rdev->corrected_errors counting
Read errors are considered to corrected if write-back and re-read
cycle is finished without further problems. Thus moving the rdev->
corrected_errors counting after the re-reading looks more reasonable
IMHO.
Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
Namhyung Kim [Wed, 27 Jul 2011 01:00:36 +0000 (11:00 +1000)]
md/raid1: move rdev->corrected_errors counting
Read errors are considered to corrected if write-back and re-read
cycle is finished without further problems. Thus moving the rdev->
corrected_errors counting after the re-reading looks more reasonable
IMHO. Also included a couple of whitespace fixes on sync_page_io().
Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
md/raid10: Improve decision on whether to fail a device with a read error.
Normally we would fail a device with a READ error. However if doing
so causes the array to fail, it is better to leave the device
in place and just return the read error to the caller.
The current test for decide if the array will fail is overly
simplistic.
We have a function 'enough' which can tell if the array is failed or
not, so use it to guide the decision.
md/raid10: Make use of new recovery_disabled handling
When we get a read error during recovery, RAID10 previously
arranged for the recovering device to appear to fail so that
the recovery stops and doesn't restart. This is misleading and wrong.
Instead, make use of the new recovery_disabled handling and mark
the target device and having recovery disabled.
Add appropriate checks in add_disk and remove_disk so that devices
are removed and not re-added when recovery is disabled.
If we hit a read error while recovering a mirror, we want to abort the
recovery without necessarily failing the disk - as having a disk this
a read error is better than not having an array at all.
Currently this is managed with a per-array flag "recovery_disabled"
and is only implemented for RAID1. For RAID10 we will need finer
grained control as we might want to disable recovery for individual
devices separately.
So push more of the decision making into the personality.
'recovery_disabled' is now a 'cookie' which is copied when the
personality want to disable recovery and is changed when a device is
added to the array as this is used as a trigger to 'try recovery
again'.
This will allow RAID10 to get the control that it needs.
Namhyung Kim [Wed, 27 Jul 2011 01:00:36 +0000 (11:00 +1000)]
md: remove ro check in md_check_recovery()
Commit c89a8eee6154 ("Allow faulty devices to be removed from a
readonly array.") added some work on ro array in the function,
but it couldn't be done since we didn't allow the ro array to be
handled from the beginning. Fix it.
Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
md/raid5: move some more common code into handle_stripe
The RAID6 version of this code is usable for RAID5 providing:
- we test "conf->max_degraded" rather than "2" as appropriate
- we make sure s->failed_num[1] is meaningful (and not '-1')
when s->failed > 1
The 'return 1' must become 'goto finish' in the new location.
Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
md/raid5: move more common code into handle_stripe
Apart from 'prexor' which can only be set for RAID5, and
'qd_idx' which can only be meaningful for RAID6, these two
chunks of code are nearly the same.
So combine them into one adding a test to call either
handle_parity_checks5 or handle_parity_checks6 as appropriate.
Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
md/raid5: unite handle_stripe_dirtying5 and handle_stripe_dirtying6
RAID6 is only allowed to choose 'reconstruct-write' while RAID5 is
also allow 'read-modify-write'
Apart from this difference, handle_stripe_dirtying[56] are nearly
identical. So resolve these differences and create just one function.
Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
Next patch will unite fetch_block5 and fetch_block6.
First I want to make the differences a little more clear.
For RAID6 if we are writing at all and there is a failed device, then
we need to load or compute every block so we can do a
reconstruct-write.
This case isn't needed for RAID5 - we will do a read-modify-write in
that case.
So make that test a separate test in fetch_block6 rather than merged
with two other tests.
Make a similar change in fetch_block5 so the one bit that is not
needed for RAID6 is clearly separate.
Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
md/raid5: Move code for finishing a reconstruction into handle_stripe.
Prior to commit ab69ae12ceef7 the code in handle_stripe5 and
handle_stripe6 to "Finish reconstruct operations initiated by the
expansion process" was identical.
That commit added an identical stanza of code to each function, but in
different places. That was careless.
The raid5 code was correct, so move that out into handle_stripe and
remove raid6 version.
Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
md/raid5: Remove stripe_head_state arg from handle_stripe_expansion.
This arg is only used to differentiate between RAID5 and RAID6 but
that is not needed. For RAID5, raid5_compute_sector will set qd_idx
to "~0" so j with certainly not equals qd_idx, so there is no need
for a guard on that condition.
So remove the guard and remove the arg from the declaration and
callers of handle_stripe_expansion.
Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
md/raid5: move stripe_head_state and more code into handle_stripe.
By defining the 'stripe_head_state' in 'handle_stripe', we can move
some common code out of handle_stripe[56]() and into handle_stripe.
The means that all accesses for stripe_head_state in handle_stripe[56]
need to be 's->' instead of 's.', but the compiler should inline
those functions and just use a direct stack reference, and future
patches while hoist most of this code up into handle_stripe()
so we will revert to "s.".
Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
'struct stripe_head_state' stores state about the 'current' stripe
that is passed around while handling the stripe.
For RAID6 there is an extension structure: r6_state, which is also
passed around.
There is no value in keeping these separate, so move the fields from
the latter into the former.
This means that all code now needs to treat s->failed_num as an small
array, but this is a small cost.
Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
sh->lock is now mainly used to ensure that two threads aren't running
in the locked part of handle_stripe[56] at the same time.
That can more neatly be achieved with an 'active' flag which we set
while running handle_stripe. If we find the flag is set, we simply
requeue the stripe for later by setting STRIPE_HANDLE.
For safety we take ->device_lock while examining the state of the
stripe and creating a summary in 'stripe_head_state / r6_state'.
This possibly isn't needed but as shared fields like ->toread,
->towrite are checked it is safer for now at least.
We leave the label after the old 'unlock' called "unlock" because it
will disappear in a few patches, so renaming seems pointless.
This leaves the stripe 'locked' for longer as we clear STRIPE_ACTIVE
later, but that is not a problem.
Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
md/raid5: Protect some more code with ->device_lock.
Other places that change or follow dev->towrite and dev->written take
the device_lock as well as the sh->lock.
So it should really be held in these places too.
Also, doing so will allow sh->lock to be discarded.
with merged fixes by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
This is the start of a series of patches to remove sh->lock.
sync_request takes sh->lock before setting STRIPE_SYNCING to ensure
there is no race with testing it in handle_stripe[56].
Instead, use a new flag STRIPE_SYNC_REQUESTED and test it early
in handle_stripe[56] (after getting the same lock) and perform the
same set/clear operations if it was set.
Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
Namhyung Kim [Mon, 18 Jul 2011 07:38:50 +0000 (17:38 +1000)]
md/raid5: use kmem_cache_zalloc()
Replace kmem_cache_alloc + memset(,0,) to kmem_cache_zalloc.
I think it's not harmful since @conf->slab_cache already knows
actual size of struct stripe_head.
Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
Namhyung Kim [Mon, 18 Jul 2011 07:38:49 +0000 (17:38 +1000)]
md/raid10: share pages between read and write bio's during recovery
When performing a recovery, only first 2 slots in r10_bio are in use,
for read and write respectively. However all of pages in the write bio
are never used and just replaced to read bio's when the read completes.
Get rid of those unused pages and share read pages properly.
Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
Namhyung Kim [Mon, 18 Jul 2011 07:38:47 +0000 (17:38 +1000)]
md/raid10: factor out common bio handling code
When normal-write and sync-read/write bio completes, we should
find out the disk number the bio belongs to. Factor those common
code out to a separate function.
Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
Namhyung Kim [Mon, 18 Jul 2011 07:38:43 +0000 (17:38 +1000)]
md/raid10: get rid of duplicated conditional expression
Variable 'first' is initialized to zero and updated to @rdev->raid_disk
only if it is greater than 0. Thus condition '>= first' always implies
'>= 0' so the latter is not needed.
Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
clocksource: convert ARM 32-bit up counting clocksources
broke the build for ixp4xx and made big endian operation impossible.
This commit restores the original behaviour.
Signed-off-by: Richard Cochran <richard.cochran@omicron.at> Signed-off-by: Krzysztof Hałasa <khc@pm.waw.pl>
[ Thomas says that we might want to have generic BE accessor functions
to the MMIO clock source, but that hasn't happened yet, so in the
meantime this seems to be the short-term fix for the particular
problem - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Axel Lin [Sun, 10 Jul 2011 07:45:07 +0000 (15:45 +0800)]
gpio: wm831x: add a missing break in wm831x_gpio_dbg_show
Signed-off-by: Axel Lin <axel.lin@gmail.com> Acked-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Merge branch 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/staging
* 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/staging:
hwmon: (adm1275) Fix coefficients per datasheet revision B
hwmon: (pmbus) Use long variables for register to data conversions
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
fix loop checks in d_materialise_unique()
Fix ->d_lock locking order in unlazy_walk()
Merge branch 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-2.6-rcu
* 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-2.6-rcu:
rcu: Prevent RCU callbacks from executing before scheduler initialized
Peter Zijlstra [Mon, 11 Jul 2011 14:28:50 +0000 (16:28 +0200)]
sched: Fix 32bit race
Commit 3fe1698b7fe0 ("sched: Deal with non-atomic min_vruntime reads
on 32bit") forgot to initialize min_vruntime_copy which could lead to
an infinite while loop in task_waking_fair() under some circumstances
(early boot, lucky timing).
[ This bug was also reported by others that blamed it on the RCU
initialization problems ]
Reported-and-tested-by: Bruno Wolff III <bruno@wolff.to> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
hwmon: (adm1275) Fix coefficients per datasheet revision B
Coefficients to convert chip register values to voltage/current have been
slightly changed in revision B of the chip datasheet. Update driver coefficients
to match the coefficients in the datasheet.
Signed-off-by: Guenter Roeck <guenter.roeck@ericsson.com> Acked-by: Jean Delvare <khali@linux-fr.org>
Al Viro [Wed, 13 Jul 2011 01:42:24 +0000 (21:42 -0400)]
fix loop checks in d_materialise_unique()
Both __d_unalias() and __d_materialise_dentry() need loop prevention.
Grab rename_lock in caller, check for loops there...
As a side benefit, we have dentry_lock_for_move() called only under
rename_lock, which seriously reduces deadlock potential of the
execrable "locking order" used for ->d_lock.
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes:
GFS2: Resolve inode eviction and ail list interaction bug
GFS2: Fix race during filesystem mount
GFS2: force a log flush when invalidating the rindex glock
GFS2: Resolve inode eviction and ail list interaction bug
This patch contains a few misc fixes which resolve a recently
reported issue. This patch has been a real team effort and has
received a lot of testing.
The first issue is that the ail lock needs to be held over a few
more operations. The lock thats added into gfs2_releasepage() may
possibly be a candidate for replacing with RCU at some future
point, but at this stage we've gone for the obvious fix.
The second issue is that gfs2_write_inode() can end up calling
a glock recursively when called from gfs2_evict_inode() via the
syncing code, so it needs a guard added.
The third issue is that we either need to not truncate the metadata
pages of inodes which have zero link count, but which we cannot
deallocate due to them still being in use by other nodes, or we need
to ensure that those pages have all made it through the journal and
ail lists first. This patch takes the former approach, but the
latter has also been tested and there is nothing to choose between
them performance-wise. So again, we could revise that decision
in the future.
Also, the inode eviction process is now better documented.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Tested-by: Bob Peterson <rpeterso@redhat.com> Tested-by: Abhijith Das <adas@redhat.com> Reported-by: Barry J. Marson <bmarson@redhat.com> Reported-by: David Teigland <teigland@redhat.com>
Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
SUNRPC: Fix use of static variable in rpcb_getport_async
NFSv4.1: update nfs4_fattr_bitmap_maxsz
SUNRPC: Fix a race between work-queue and rpc_killall_tasks
pnfs: write: Set mds_offset in the generic layer - it is needed by all LDs
Merge branch 'drm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6
* 'drm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6:
drm/radeon/kms/evergreen: emit SQ_LDS_RESOURCE_MGMT for blits
agp/intel: Fix typo in G4x_GMCH_SIZE_VT_2M
drm/radeon/kms: fix typo in read_disabled vbios code
drm/radeon/kms: use correct BUS_CNTL reg on rs600
drm/radeon/kms: fix backend map typo on juniper
drm/radeon/kms: fix regression in hotplug
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (21 commits)
slip: fix wrong SLIP6 ifdef-endif placing
natsemi: fix another dma-debug report
sctp: ABORT if receive, reassmbly, or reodering queue is not empty while closing socket
net: Fix default in docs for tcp_orphan_retries.
hso: fix a use after free condition
net/natsemi: Fix module parameter permissions
XFRM: Fix memory leak in xfrm_state_update
sctp: Enforce retransmission limit during shutdown
mac80211: fix TKIP replay vulnerability
mac80211: fix ie memory allocation for scheduled scans
ssb: fix init regression of hostmode PCI core
rtlwifi: rtl8192cu: Add new USB ID for Netgear WNA1000M
ath9k: Fix tx throughput drops for AR9003 chips with AES encryption
carl9170: add NEC WL300NU-AG usbid
cfg80211: fix deadlock with rfkill/sched_scan by adding new mutex
ath5k: fix incorrect use of drvdata in PCI suspend/resume code
ath5k: fix incorrect use of drvdata in sysfs code
Bluetooth: Fix memory leak under page timeouts
Bluetooth: Fix regression with incoming L2CAP connections
Bluetooth: Fix hidp disconnect deadlocks and lost wakeup
...
Philip Rakity [Thu, 7 Jul 2011 16:04:55 +0000 (09:04 -0700)]
mmc: core: Bus width testing needs to handle suspend/resume
On reading the ext_csd for the first time (in 1 bit mode), save the
ext_csd information needed for bus width compare.
On every pass we make re-reading the ext_csd, compare the data
against the saved ext_csd data.
This fixes a regression introduced in 3.0-rc1 by 08ee80cc397ac1a3
("mmc: core: eMMC bus width may not work on all platforms"), which
incorrectly assumed we would be re-reading the ext_csd at resume-
time.
Signed-off-by: Philip Rakity <prakity@marvell.com> Tested-by: Jaehoon Chung <jh80.chung@samsung.com> Signed-off-by: Chris Ball <cjb@laptop.org>
Paul E. McKenney [Sun, 10 Jul 2011 22:57:35 +0000 (15:57 -0700)]
rcu: Prevent RCU callbacks from executing before scheduler initialized
Under some rare but real combinations of configuration parameters, RCU
callbacks are posted during early boot that use kernel facilities that
are not yet initialized. Therefore, when these callbacks are invoked,
hard hangs and crashes ensue. This commit therefore prevents RCU
callbacks from being invoked until after the scheduler is fully up and
running, as in after multiple tasks have been spawned.
It might well turn out that a better approach is to identify the specific
RCU callbacks that are causing this problem, but that discussion will
wait until such time as someone really needs an RCU callback to be invoked
(as opposed to merely registered) during early boot.
Reported-by: julie Sullivan <kernelmail.jms@gmail.com> Reported-by: RKK <kulkarni.ravi4@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Tested-by: julie Sullivan <kernelmail.jms@gmail.com> Tested-by: RKK <kulkarni.ravi4@gmail.com>
Ryusuke Konishi [Sun, 19 Jun 2011 07:56:29 +0000 (16:56 +0900)]
nilfs2: remove resize from unsupported features list
Resize feature was supported by the commit 4e33f9eab07e but it was not
reflected to the list of unsupported features in nilfs2.txt file.
This updates the list to fix discrepancy.
Chris Wilson [Tue, 12 Jul 2011 22:38:18 +0000 (23:38 +0100)]
agp/intel: Fix typo in G4x_GMCH_SIZE_VT_2M
Konstantin Belousov found an error in the define of G4x_GMCH_SIZE_VT_2M
relative to the GMCH specs, and confirmed that indeed one of his users
with a Q45 reports 0xb not 0xc for a 2/2MiB GATT.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Konstantin Belousov <kostikbel@gmail.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Dave Airlie <airlied@redhat.com>
Al Viro [Wed, 13 Jul 2011 01:40:23 +0000 (21:40 -0400)]
Fix ->d_lock locking order in unlazy_walk()
Make sure that child is still a child of parent before nested locking
of child->d_lock in unlazy_walk(); otherwise we are risking a violation
of locking order and deadlocks.
Merge branch 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
powerpc/mm: Fix memory_block_size_bytes() for non-pseries
mm: Move definition of MIN_MEMORY_BLOCK_SIZE to a header
Merge branch 'drm-intel-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/keithp/linux-2.6
* 'drm-intel-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/keithp/linux-2.6:
drm/i915/ringbuffer: Idling requires waiting for the ring to be empty
Revert "drm/i915: enable rc6 by default"
drm/i915: Clean up i915_driver_load failure path
drm/i915: Enable GPU reset on Ivybridge.
drm/i915/dp: manage sink power state if possible
drm/i915/dp: consolidate AUX retry code
drm/i915/dp: remove DPMS mode tracking from DP
drm/i915/dp: try to read receiver capabilities 3 times when detecting
drm/i915/dp: read more receiver capability bits on hotplug
drm/i915/dp: use DP DPCD defines when looking at DPCD values
drm/i915/dp: retry link status read 3 times on failure
Ben Greear [Tue, 12 Jul 2011 17:27:55 +0000 (10:27 -0700)]
SUNRPC: Fix use of static variable in rpcb_getport_async
Because struct rpcbind_args *map was declared static, if two
threads entered this method at the same time, the values
assigned to map could be sent two two differen tasks.
This could cause all sorts of problems, include use-after-free
and double-free of memory.
Fix this by removing the static declaration so that the map
pointer is on the stack.
Signed-off-by: Ben Greear <greearb@candelatech.com> Cc: stable@kernel.org Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Chris Wilson [Tue, 12 Jul 2011 17:03:29 +0000 (18:03 +0100)]
drm/i915/ringbuffer: Idling requires waiting for the ring to be empty
...which is measured by the size and not the amount of space remaining.
Waiting upon size-8, did one of two things. In the common case with more
than 8 bytes available to write into the ring, it would return
immediately. Otherwise, it would timeout given the impossible condition
of waiting for more space than is available in the ring, leading to
warnings such as:
[drm:intel_cleanup_ring_buffer] *ERROR* failed to quiesce render ring
whilst cleaning up: -16
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Eric Anholt <eric@anholt.net> Signed-off-by: Keith Packard <keithp@keithp.com>
Keith Packard [Sun, 10 Jul 2011 20:12:17 +0000 (13:12 -0700)]
drm/i915: Clean up i915_driver_load failure path
i915_driver_load adds a write-combining MTRR region for the GTT
aperture to improve memory speeds through the aperture. If
i915_driver_load fails after this, it would not have cleaned up the
MTRR. This shouldn't cause any problems, except for consuming an MTRR
register. Still, it's best to clean up completely in the failure path,
which is easily done by calling mtrr_del if the mtrr was successfully
allocated.
i915_driver_load calls i915_gem_load which register
i915_gem_inactive_shrink. If i915_driver_load fails after calling
i915_gem_load, the shrinker will be left registered. When called, it
will access freed memory and crash. The fix is to unregister the shrinker in the
failure path using code duplicated from i915_driver_unload.
i915_driver_load also has some incorrect gotos in the error cleanup
paths:
* After failing to initialize the GTT (which cannot happen, btw,
intel_gtt_get returns a fixed (non-NULL) value), it tries to
free the uninitialized WC IO mapping. Fixed this by changing the
target from out_iomapfree to out_rmmap
Signed-off-by: Keith Packard <keithp@keithp.com> Tested-by: Lin Ming <ming.m.lin@intel.com>
hwmon: (pmbus) Use long variables for register to data conversions
Using integer variable types for register to data conversions can cause
overflows especially for power calculations, which are in microwatt.
Use long variables instead.
Michal Marek [Tue, 12 Jul 2011 09:54:48 +0000 (11:54 +0200)]
kbuild: Do not write to builddir in modules_install
Let depmod.sh create a temporary directory in /tmp instead of writing to
the build directory as root. The mktemp utility should be available on
any recent system (and there is already scripts/gen_initramfs_list.sh
relying on it).
Reported-by: Christian Kujau <lists@nerdbynature.de> Signed-off-by: Michal Marek <mmarek@suse.cz>
There is a potential race during filesystem mounting which has recently
been reported. It occurs when the userland gfs_controld is able to
process requests fast enough that it tries to use the sysfs interface
before the lock module is properly initialised. This is a pretty
unusual case as normally the lock module initialisation is very quick
compared with gfs_controld.
This patch adds an interruptible completion which is used to ensure that
userland will wait for the initialisation of the lock module to
complete.
There are other potential solutions to this problem, but this is the
quickest at this stage and has been tested both with and without
mount.gfs2 present in the system.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Reported-by: David Booher <dbooher@adams.net>
GFS2: force a log flush when invalidating the rindex glock
Right now, there is nothing that forces the log to get flushed when a node
drops its rindex glock so that another node can grow the filesystem. If the
log doesn't get flushed, GFS2 can corrupt the sd_log_le_rg list in the
following way.
A node puts an rgd on the list in rg_lo_add(), and then the rindex glock is
dropped so the other node can grow the filesystem. When the node reacquires the
rindex glock, that rgd gets deleted in clear_rgrpdi() before ever being
removed from the list by gfs2_log_flush().
This code simply forces a log flush when the rindex glock is invalidated,
solving the problem.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
powerpc/mm: Fix memory_block_size_bytes() for non-pseries
Just compiling pseries in the kernel causes it to override
memory_block_size_bytes() regardless of what is the runtime
platform.
This cleans up the implementation of that function, fixing
a bug or two while at it, so that it's harmless (and potentially
useful) for other platforms. Without this, bugs in that code
would trigger a WARN_ON() in drivers/base/memory.c when
booting some different platforms.
If/when we have another platform supporting memory hotplug we
might want to either move that out to a generic place or
make it a ppc_md. callback.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
mm: Move definition of MIN_MEMORY_BLOCK_SIZE to a header
The macro MIN_MEMORY_BLOCK_SIZE is currently defined twice in two .c
files, and I need it in a third one to fix a powerpc bug, so let's
first move it into a header
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: Ingo Molnar <mingo@elte.hu>
Documentation/Changes: remove some really obsolete text
That file harkens back to the days of the big 2.4 -> 2.6 version jump,
and was based even then on older versions. Some of it is just obsolete,
and Jesper Juhl points out that it talks about kernel versions 2.6 and
should be updated to 3.0.
Remove some obsolete text, and re-phrase some other to not be 2.6-specific.
Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-2.6
* 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-2.6:
[media] msp3400: fill in v4l2_tuner based on vt->type field
[media] tuner-core.c: don't change type field in g_tuner or g_frequency
[media] cx18/ivtv: fix g_tuner support
[media] tuner-core: power up tuner when called with s_power(1)
[media] v4l2-ioctl.c: check for valid tuner type in S_HW_FREQ_SEEK
[media] tuner-core: simplify the standard fixup
[media] tuner-core/v4l2-subdev: document that the type field has to be filled in
[media] v4l2-subdev.h: remove unused s_mode tuner op
[media] feature-removal-schedule: change in how radio device nodes are handled
[media] bttv: fix s_tuner for radio
[media] pvrusb2: fix g/s_tuner support
[media] v4l2-ioctl.c: prefill tuner type for g_frequency and g/s_tuner
[media] tuner-core: fix tuner_resume: use t->mode instead of t->type
[media] tuner-core: fix s_std and s_tuner