Alex Elder [Sat, 2 Mar 2013 00:00:16 +0000 (18:00 -0600)]
libceph: be explicit about message data representation
A ceph message has a data payload portion. The memory for that data
(either the source of data to send or the location to place data
that is received) is specified in several ways. The ceph_msg
structure includes fields for all of those ways, but this
mispresents the fact that not all of them are used at a time.
Specifically, the data in a message can be in:
- an array of pages
- a list of pages
- a list of Linux bios
- a second list of pages (the "trail")
(The two page lists are currently only ever used for outgoing data.)
Impose more structure on the ceph message, making the grouping of
some of these fields explicit. Shorten the name of the
"page_alignment" field.
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Alex Elder [Sat, 9 Mar 2013 02:58:59 +0000 (20:58 -0600)]
libceph: define and use ceph_tcp_recvpage()
Define a new function ceph_tcp_recvpage() that behaves in a way
comparable to ceph_tcp_sendpage().
Rearrange the code in both read_partial_message_pages() and
read_partial_message_bio() so they have matching structure,
(similar to what's in write_partial_msg_pages()), and use
this new function.
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Alex Elder [Thu, 7 Mar 2013 05:39:38 +0000 (23:39 -0600)]
libceph: small write_partial_msg_pages() refactor
Define local variables page_offset and length to represent the range
of bytes within a page that will be sent by ceph_tcp_sendpage() in
write_partial_msg_pages().
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Alex Elder [Thu, 7 Mar 2013 05:39:39 +0000 (23:39 -0600)]
libceph: consolidate message prep code
In prepare_write_message_data(), various fields are initialized in
preparation for writing message data out. Meanwhile, in
read_partial_message(), there is essentially the same block of code,
operating on message variables associated with an incoming message.
Generalize prepare_write_message_data() so it works for both
incoming and outcoming messages, and use it in both spots. The
did_page_crc is not used for input (so it's harmless to initialize
it).
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Alex Elder [Thu, 7 Mar 2013 05:39:38 +0000 (23:39 -0600)]
libceph: use local variables for message positions
There are several places where a message's out_msg_pos or in_msg_pos
field is used repeatedly within a function. Use a local pointer
variable for this purpose to unclutter the code.
This and the upcoming cleanup patches are related to:
http://tracker.ceph.com/issues/4403
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Alex Elder [Thu, 7 Mar 2013 05:39:39 +0000 (23:39 -0600)]
libceph: don't clear bio_iter in prepare_write_message()
At one time it was necessary to clear a message's bio_iter field to
avoid a bad pointer dereference in write_partial_msg_pages().
That no longer seems to be the case. Here's why.
The message's bio fields represent (in this case) outgoing data.
Between where the bio_iter is made NULL in prepare_write_message()
and the call in that function to prepare_message_data(), the
bio fields are never used.
In prepare_message_data(), init-bio_iter() is called, and the result
of that overwrites the value in the message's bio_iter field.
Because it gets overwritten anyway, there is no need to set it to
NULL. So don't do it.
This resolves:
http://tracker.ceph.com/issues/4402
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Alex Elder [Tue, 5 Mar 2013 00:29:06 +0000 (18:29 -0600)]
libceph: activate message data assignment checks
The mds client no longer tries to assign zero-length message data,
and the osd client no longer sets its data info more than once.
This allows us to activate assertions in the messenger to verify
these things never happen.
This resolves both of these:
http://tracker.ceph.com/issues/4263
http://tracker.ceph.com/issues/4284
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Alex Elder [Tue, 5 Mar 2013 00:29:06 +0000 (18:29 -0600)]
libceph: set response data fields earlier
When an incoming message is destined for the osd client, the
messenger calls the osd client's alloc_msg method. That function
looks up which request has the tid matching the incoming message,
and returns the request message that was preallocated to receive the
response. The response message is therefore known before the
request is even started.
Between the start of the request and the receipt of the response,
the request and its data fields will not change, so there's no
reason we need to hold off setting them. In fact it's preferable
to set them just once because it's more obvious that they're
unchanging.
So set up the fields describing where incoming data is to land in a
response message at the beginning of ceph_osdc_start_request().
Define a helper function that sets these fields, and use it to
set the fields for both outgoing data in the request message and
incoming data in the response.
This resolves:
http://tracker.ceph.com/issues/4284
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Alex Elder [Thu, 7 Mar 2013 21:38:26 +0000 (15:38 -0600)]
libceph: record message data byte length
Record the number of bytes of data in a page array rather than the
number of pages in the array. It can be assumed that the page array
is of sufficient size to hold the number of bytes indicated (and
offset by the indicated alignment).
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Alex Elder [Thu, 14 Feb 2013 18:16:43 +0000 (12:16 -0600)]
libceph: isolate other message data fields
Define ceph_msg_data_set_pagelist(), ceph_msg_data_set_bio(), and
ceph_msg_data_set_trail() to clearly abstract the assignment of the
remaining data-related fields in a ceph message structure. Use the
new functions in the osd client and mds client.
This partially resolves:
http://tracker.ceph.com/issues/4263
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Alex Elder [Thu, 14 Feb 2013 18:16:43 +0000 (12:16 -0600)]
libceph: isolate message page field manipulation
Define a function ceph_msg_data_set_pages(), which more clearly
abstracts the assignment page-related fields for data in a ceph
message structure. Use this new function in the osd client and mds
client.
Ideally, these fields would never be set more than once (with
BUG_ON() calls to guarantee that). At the moment though the osd
client sets these every time it receives a message, and in the event
of a communication problem this can happen more than once. (This
will be resolved shortly, but setting up these helpers first makes
it all a bit easier to work with.)
Rearrange the field order in a ceph_msg structure to group those
that are used to define the possible data payloads.
This partially resolves:
http://tracker.ceph.com/issues/4263
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Alex Elder [Thu, 7 Mar 2013 21:38:25 +0000 (15:38 -0600)]
libceph: record byte count not page count
Record the byte count for an osd request rather than the page count.
The number of pages can always be derived from the byte count (and
alignment/offset) but the reverse is not true.
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Alex Elder [Sat, 2 Mar 2013 00:00:16 +0000 (18:00 -0600)]
libceph: simplify new message initialization
Rather than explicitly initializing many fields to 0, NULL, or false
in a newly-allocated message, just use kzalloc() for allocating new
messages. This will become a much more convenient way of doing
things anyway for upcoming patches that abstract the data field.
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Alex Elder [Thu, 7 Mar 2013 05:39:38 +0000 (23:39 -0600)]
libceph: advance pagelist with list_rotate_left()
While processing an outgoing pagelist (either the data pagelist or
trail) in a ceph message, the messenger cycles through each of the
pages on the list. This is accomplished in out_msg_pos_next(), if
the end of the first page on the list is reached, the first page is
moved to the end of the list.
There is a list operation, list_rotate_left(), which performs
exactly this operation, and by using it, what's really going on
becomes more obvious.
So replace these two list_move_tail() calls with list_rotate_left().
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Alex Elder [Sat, 9 Mar 2013 00:51:04 +0000 (18:51 -0600)]
libceph: define and use in_msg_pos_next()
Define a new function in_msg_pos_next() to match out_msg_pos_next(),
and use it in place of code at the end of read_partial_message_pages()
and read_partial_message_bio().
Note that the page number is incremented and offset reset under
slightly different conditions from before. The result is
equivalent, however, as explained below.
Each time an incoming message is going to arrive, we find out how
much room is left--not surpassing the current page--and provide that
as the number of bytes to receive. So the amount we'll use is the
lesser of: all that's left of the entire request; and all that's
left in the current page.
If we received exactly how many were requested, we either reached
the end of the request or the end of the page. In the first case,
we're done, in the second, we move onto the next page in the array.
In all cases but (possibly) on the last page, after adding the
number of bytes received, page_pos == PAGE_SIZE. On the last page,
it doesn't really matter whether we increment the page number and
reset the page position, because we're done and we won't come back
here again. The code previously skipped over that last case,
basically. The new code handles that case the same as the others,
incrementing and resetting.
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Alex Elder [Sat, 9 Mar 2013 00:51:03 +0000 (18:51 -0600)]
libceph: kill args in read_partial_message_bio()
There is only one caller for read_partial_message_bio(), and it
always passes &msg->bio_iter and &bio_seg as the second and third
arguments. Furthermore, the message in question is always the
connection's in_msg, and we can get that inside the called function.
So drop those two parameters and use their derived equivalents.
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Alex Elder [Tue, 5 Mar 2013 15:25:10 +0000 (09:25 -0600)]
libceph: clean up skipped message logic
In ceph_con_in_msg_alloc() it is possible for a connection's
alloc_msg method to indicate an incoming message should be skipped.
By default, read_partial_message() initializes the skip variable
to 0 before it gets provided to ceph_con_in_msg_alloc().
The osd client, mon client, and mds client each supply an alloc_msg
method. The mds client always assigns skip to be 0.
The other two leave the skip value of as-is, or assigns it to zero,
except:
- if no (osd or mon) request having the given tid is found, in
which case skip is set to 1 and NULL is returned; or
- in the osd client, if the data of the reply message is not
adequate to hold the message to be read, it assigns skip
value 1 and returns NULL.
So the returned message pointer will always be NULL if skip is ever
non-zero.
Clean up the logic a bit in ceph_con_in_msg_alloc() to make this
state of affairs more obvious. Add a comment explaining how a null
message pointer can mean either a message that should be skipped or
a problem allocating a message.
This resolves:
http://tracker.ceph.com/issues/4324
Alex Elder [Thu, 14 Feb 2013 18:16:43 +0000 (12:16 -0600)]
libceph: separate read and write data
An osd request defines information about where data to be read
should be placed as well as where data to write comes from.
Currently these are represented by common fields.
Keep information about data for writing separate from data to be
read by splitting these into data_in and data_out fields.
This is the key patch in this whole series, in that it actually
identifies which osd requests generate outgoing data and which
generate incoming data. It's less obvious (currently) that an osd
CALL op generates both outgoing and incoming data; that's the focus
of some upcoming work.
This resolves:
http://tracker.ceph.com/issues/4127
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Alex Elder [Thu, 14 Feb 2013 18:16:43 +0000 (12:16 -0600)]
libceph: distinguish page and bio requests
An osd request uses either pages or a bio list for its data. Use a
union to record information about the two, and add a data type
tag to select between them.
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Alex Elder [Sat, 2 Mar 2013 00:00:15 +0000 (18:00 -0600)]
libceph: don't assign page info in ceph_osdc_new_request()
Currently ceph_osdc_new_request() assigns an osd request's
r_num_pages and r_alignment fields. The only thing it does
after that is call ceph_osdc_build_request(), and that doesn't
need those fields to be assigned.
Move the assignment of those fields out of ceph_osdc_new_request()
and into its caller. As a result, the page_align parameter is no
longer used, so get rid of it.
Note that in ceph_sync_write(), the value for req->r_num_pages had
already been calculated earlier (as num_pages, and fortunately
it was computed the same way). So don't bother recomputing it,
but because it's not needed earlier, move that calculation after the
call to ceph_osdc_new_request(). Hold off making the assignment to
r_alignment, doing it instead r_pages and r_num_pages are
getting set.
Similarly, in start_read(), nr_pages already holds the number of
pages in the array (and is calculated the same way), so there's no
need to recompute it. Move the assignment of the page alignment
down with the others there as well.
This and the next few patches are preparation work for:
http://tracker.ceph.com/issues/4127
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
(This is being reposted. The first one had a problem because it
erroneously added a similar change elsewhere; that change has been
dropped.)
The next patch in this series points out that the calculation for
the number of pages in an osd request is getting done twice. It
is not obvious, but the result of both calculations is identical.
This patch simplifies one of them--as a separate step--to make
it clear that the transformation in the next patch is valid.
In ceph_sync_write() there is some magic that computes page_align
for an osd request. But a little analysis shows it can be
simplified.
First, we have:
io_align = pos & ~PAGE_MASK;
which is used here:
page_align = (pos - io_align + buf_align) & ~PAGE_MASK;
Note (pos - io_align) simply rounds "pos" down to the nearest multiple
of the page size.
We also have:
buf_align = (unsigned long)data & ~PAGE_MASK;
Adding buf_align to that rounded-down "pos" value will stay within
the same page; the result will just be offset by the page offset for
the "data" pointer. The final mask therefore leaves just the value
of "buf_align".
One more simplification. Note that the result of calc_pages_for()
is invariant of which page the offset starts in--the only thing that
matters is the offset within the starting page. We will have
put the proper page offset to use into "page_align", so just use
that in calculating num_pages.
This resolves:
http://tracker.ceph.com/issues/4166
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Alex Elder [Sat, 2 Mar 2013 00:00:15 +0000 (18:00 -0600)]
ceph: use calc_pages_for() in start_read()
There's a spot that computes the number of pages to allocate for a
page-aligned length by just shifting it. Use calc_pages_for()
instead, to be consistent with usage everywhere else. The result
is the same.
The reason for this is to make it clearer in an upcoming patch that
this calculation is duplicated.
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Alex Elder [Sat, 2 Mar 2013 00:00:14 +0000 (18:00 -0600)]
libceph: define mds_alloc_msg() method
The only user of the ceph messenger that doesn't define an alloc_msg
method is the mds client. Define one, such that it works just like
it did before, and simplify ceph_con_in_msg_alloc() by assuming the
alloc_msg method is always present.
This and the next patch resolve:
http://tracker.ceph.com/issues/4322
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Alex Elder [Sat, 2 Mar 2013 00:00:15 +0000 (18:00 -0600)]
libceph: rename ceph_calc_object_layout()
The purpose of ceph_calc_object_layout() is to fill in the pool
number and seed for a ceph_pg structure provided, based on a given
osd map and target object id.
Currently that function takes a file layout parameter, but the only
thing used out of that is its pool number.
Change the function so it takes a pool number rather than the full
file layout structure. Only update the ceph_pg if the pool is found
in the osd map. Get rid of few useless lines of code from the
function while there.
Since the function now very clearly just fills in the ceph_pg
structure it's provided, rename it ceph_calc_ceph_pg().
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Alex Elder [Sat, 2 Mar 2013 00:00:15 +0000 (18:00 -0600)]
libceph: use (void *) for untyped data in osd ops
Two of the fields defining osd operations are defined using (char *)
while the data they represent are really untyped, not character
strings. Change them to have type (void *).
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Alex Elder [Mon, 4 Mar 2013 17:08:29 +0000 (11:08 -0600)]
libceph: fix wrong opcode use in osd_req_encode_op()
The new cases added to osd_req_encode_op() caused a new sparse
error, which highlighted an existing problem that had been
overlooked since it was originally checked in. When an unsupported
opcode is found the destination rather than the source opcode was
being used in the error message. The two differ in their byte
order, and we want to be using the one in the source.
Alex Elder [Wed, 27 Feb 2013 16:26:25 +0000 (10:26 -0600)]
libceph: complete lingering requests only once
An osd request marked to linger will be re-submitted in the event
a connection to the target osd gets dropped. Currently, if there
is a callback function associated with a request it will be called
each time a request is submitted--which for lingering requests can
be more than once.
Change it so a request--including lingering ones--will get completed
(from the perspective of the user of the osd client) exactly once.
This resolves:
http://tracker.ceph.com/issues/3967
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Yan, Zheng [Fri, 1 Mar 2013 02:55:39 +0000 (10:55 +0800)]
ceph: don't early drop Fw cap
ceph_aio_write() has an optimization that marks CEPH_CAP_FILE_WR
cap dirty before data is copied to page cache and inode size is
updated. The optimization avoids slow cap revocation caused by
balance_dirty_pages(), but introduces inode size update race. If
ceph_check_caps() flushes the dirty cap before the inode size is
updated, MDS can miss the new inode size. So just remove the
optimization.
Yan, Zheng [Mon, 18 Feb 2013 08:38:14 +0000 (16:38 +0800)]
ceph: use I_COMPLETE inode flag instead of D_COMPLETE flag
commit c6ffe10015 moved the flag that tracks if the dcache contents
for a directory are complete to dentry. The problem is there are
lots of places that use ceph_dir_{set,clear,test}_complete() while
holding i_ceph_lock. but ceph_dir_{set,clear,test}_complete() may
sleep because they call dput().
This patch basically reverts that commit. For ceph_d_prune(), it's
called with both the dentry to prune and the parent dentry are
locked. So it's safe to access the parent dentry's d_inode and
clear I_COMPLETE flag.
Yan, Zheng [Wed, 27 Feb 2013 01:26:09 +0000 (09:26 +0800)]
ceph: set mds_want according to cap import message
MDS ignores cap update message if migrate_seq mismatch, so when
receiving a cap import message with higher migrate_seq, set mds_want
according to the cap import message.
Alex Elder [Thu, 14 Feb 2013 18:16:43 +0000 (12:16 -0600)]
libceph: set page alignment in start_request()
The page alignment field for a request is currently set in
ceph_osdc_build_request(). It's not needed at that point
nor do either of its callers need that value assigned at
any point before they call ceph_osdc_start_request().
So move that assignment into ceph_osdc_start_request().
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Alex Elder [Mon, 25 Feb 2013 23:35:46 +0000 (17:35 -0600)]
libceph: distinguish page array and pagelist count
Use distinct fields for tracking the number of pages in a message's
page array and in a message's page list. Currently only one or the
other is used at a time, but that will be changing soon.
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Alex Elder [Sat, 16 Feb 2013 04:10:17 +0000 (22:10 -0600)]
libceph: don't pass request to calc_layout()
The only remaining reason to pass the osd request to calc_layout()
is to fill in its r_num_pages and r_page_alignment fields. Once it
fills those in, it doesn't do anything more with them.
We can therefore move those assignments into the caller, and get rid
of the "req" parameter entirely.
Note, however, that the only caller is ceph_osdc_new_request(),
and that immediately overwrites those fields with values based on
its passed-in page offset. So the assignment inside calc_layout()
was redundant anyway.
This resolves:
http://tracker.ceph.com/issues/4262
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Alex Elder [Sat, 16 Feb 2013 04:10:17 +0000 (22:10 -0600)]
libceph: format target object name in caller
Move the formatting of the object name (oid) to use for an object
request into the caller of calc_layout(). This makes the "vino"
parameter no longer necessary, so get rid of it.
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Alex Elder [Sat, 16 Feb 2013 04:10:17 +0000 (22:10 -0600)]
libceph: make ceph_msg->bio_seg be unsigned
The bio_seg field is used by the ceph messenger in iterating through
a bio. It should never have a negative value, so make it an
unsigned. (I contemplated making it unsigned short to match the
struct bio definition, but it offered no benefit.)
Change variables used to hold bio_seg values to all be unsigned as
well. Change two variable names in init_bio_iter() to match the
convention used everywhere else.
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Alex Elder [Sat, 16 Feb 2013 04:10:17 +0000 (22:10 -0600)]
libceph: fix a osd request memory leak
If an invalid layout is provided to ceph_osdc_new_request(), its
call to calc_layout() might return an error. At that point in the
function we've already allocated an osd request structure, so we
need to free it (drop a reference) in the event such an error
occurs.
The only other value calc_layout() will return is 0, so make that
explicit in the successful case.
This resolves:
http://tracker.ceph.com/issues/4240
Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Merge tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC fix from Olof Johansson:
"A late-arriving fix for musb on OMAP4, resolving an issue where the
musb IP won't be clocked and thus not functional. Small in scope,
most of the lines changed is a longish comment."
* tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
ARM: OMAP4: hwmod data: make 'ocp2scp_usb_phy_phy_48m" as the main clock
I think we could just move the full vm_iomap_memory() function into
util.h or similar, but I didn't get any reply from anybody actually
using nommu even to this trivial patch, so I'm not going to touch it any
more than required.
Here's the fairly minimal stub to make the nommu case at least
potentially work. It doesn't seem like anybody cares, though.
Olof Johansson [Sat, 27 Apr 2013 00:35:13 +0000 (17:35 -0700)]
Merge tag 'omap-for-v3.9-rc6/fixes-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap into fixes
From Tony Lindgren:
One MUSB regression fix that I forgot to send earlier. Without
this MUSB no longer works on omap4 based devices.
* tag 'omap-for-v3.9-rc6/fixes-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap:
ARM: OMAP4: hwmod data: make 'ocp2scp_usb_phy_phy_48m" as the main clock
Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
Pull media fixes from Mauro Carvalho Chehab:
"Two driver fixes.
One avoids reading any file at a system with a cx25821 board
(fortunately, this is not a common device). The other one prevents
reading after a buffer with ISDB-T devices based on mb86a20s."
* 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
[media] cx25821: do not expose broken video output streams
[media] mb86a20s: Fix estimate_rate setting
Merge branch 'fixes-3.9-late' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull late parisc fixes from Helge Deller:
"I know it's *very* late in the 3.9 release cycle, but since there
aren't that many people testing the parisc linux kernel, a few (for
our port) critical issues just showed up a few days back for the first
time.
What's in it?
- add missing __ucmpdi2 symbol, which is required for btrfs on 32bit
kernel.
- change kunmap() macro to static inline function. This fixes a
debian/gcc-4.4 build error.
- add locking when doing PTE updates. This fixes random userspace
crashes.
- disable (optional) -mlong-calls compiler option for modules, else
modules can't be loaded at runtime.
- a smart patch by Will Deacon which fixes 64bit put_user() warnings
on 32bit kernel."
* 'fixes-3.9-late' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: use spin_lock_irqsave/spin_unlock_irqrestore for PTE updates
parisc: disable -mlong-calls compiler option for kernel modules
parisc: uaccess: fix compiler warnings caused by __put_user casting
parisc: Change kunmap macro to static inline function
parisc: Provide __ucmpdi2 to resolve undefined references in 32 bit builds.
Matt Fleming [Fri, 26 Apr 2013 09:10:55 +0000 (10:10 +0100)]
efivars: only check for duplicates on the registered list
variable_is_present() accesses '__efivars' directly, but when called via
gsmi_init() Michel reports observing the following crash,
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: variable_is_present+0x55/0x170
Call Trace:
register_efivars+0x106/0x370
gsmi_init+0x2ad/0x3da
do_one_initcall+0x3f/0x170
The reason for the crash is that '__efivars' hasn't been initialised nor
has it been registered with register_efivars() by the time the google
EFI SMI driver runs. The gsmi code uses its own struct efivars, and
therefore, a different variable list. Fix the above crash by passing
the registered struct efivars to variable_is_present(), so that we
traverse the correct list.
Reported-by: Michel Lespinasse <walken@google.com> Tested-by: Michel Lespinasse <walken@google.com> Cc: Mike Waychison <mikew@google.com> Cc: Matthew Garrett <matthew.garrett@nebula.com> Cc: Seiji Aguchi <seiji.aguchi@hds.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In commit b0de59b5733d ("TTY: do not update atime/mtime on read/write")
we removed timestamps from tty inodes to fix a security issue and waited
if something breaks. Well, 'w', the utility to find out logged users
and their inactivity time broke. It shows that users are inactive since
the time they logged in.
To revert to the old behaviour while still preventing attackers to
guess the password length, we update the timestamps in one-minute
intervals by this patch.
H. Peter Anvin [Thu, 25 Apr 2013 21:00:22 +0000 (14:00 -0700)]
Merge tag 'efi-urgent' into x86/urgent
* The EFI variable anti-bricking algorithm merged in -rc8 broke booting
on some Apple machines because they implement EFI spec 1.10, which
doesn't provide a QueryVariableInfo() runtime function and the logic
used to check for the existence of that function was insufficient.
Fix from Josh Boyer.
* The anti-bricking algorithm also introduced a compiler warning on
32-bit. Fix from Borislav Petkov.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
parisc: use spin_lock_irqsave/spin_unlock_irqrestore for PTE updates
User applications running on SMP kernels have long suffered from instability
and random segmentation faults. This patch improves the situation although
there is more work to be done.
One of the problems is the various routines in pgtable.h that update page table
entries use different locking mechanisms, or no lock at all (set_pte_at). This
change modifies the routines to all use the same lock pa_dbit_lock. This lock
is used for dirty bit updates in the interruption code. The patch also purges
the TLB entries associated with the PTE to ensure that inconsistent values are
not used after the page table entry is updated. The UP and SMP code are now
identical.
The change also includes a minor update to the purge_tlb_entries function in
cache.c to improve its efficiency.
Signed-off-by: John David Anglin <dave.anglin@bell.net> Cc: Helge Deller <deller@gmx.de> Signed-off-by: Helge Deller <deller@gmx.de>
parisc: disable -mlong-calls compiler option for kernel modules
CONFIG_MLONGCALLS was introduced in commit ec758f98328da3eb933a25dc7a2eed01ef44d849 to overcome linker issues when linking
huge linux kernels, e.g. with many modules linked in.
But in the kernel module loader there is no support yet for the new relocation
types, which is why modules built with -mlong-calls can't be loaded.
Furthermore, for modules long calls are not really necessary, since we already
use stub sections which resolve long distance calls.
So, let's just disable this compiler option when compiling kernel modules.
Will Deacon [Mon, 22 Apr 2013 12:53:43 +0000 (12:53 +0000)]
parisc: uaccess: fix compiler warnings caused by __put_user casting
When targetting 32-bit processors, __put_user emits a pair of stw
instructions for the 8-byte case. If the type of __val is a pointer, the
marshalling code casts it to the wider integer type of u64, resulting
in the following compiler warnings:
kernel/signal.c: In function 'copy_siginfo_to_user':
kernel/signal.c:2752:11: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
kernel/signal.c:2752:11: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
[...]
This patch fixes the warnings by removing the marshalling code and using
the correct output modifiers in the __put_{user,kernel}_asm64 macros
so that GCC will allocate the right registers without the need to
extract the two words explicitly.
parisc: Change kunmap macro to static inline function
Change kunmap macro to static inline function to fix build error
compiling drivers/base/dma-buf.c.
Without the change, the following error can occur:
CC drivers/base/dma-buf.o
drivers/base/dma-buf.c: In function 'dma_buf_kunmap':
drivers/base/dma-buf.c:427:46:
error: macro "kunmap" passed 3 arguments, but takes just 1
I believe parisc is the only arch to implement kunmap using a macro.
Signed-off-by: John David Anglin <dave.anglin@bell.net> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Helge Deller <deller@gmx.de> Signed-off-by: Helge Deller <deller@gmx.de>
parisc: Provide __ucmpdi2 to resolve undefined references in 32 bit builds.
The Debian experimental linux source package (3.8.5-1) build fails
with the following errors:
...
MODPOST 2016 modules
ERROR: "__ucmpdi2" [fs/btrfs/btrfs.ko] undefined!
ERROR: "__ucmpdi2" [drivers/md/dm-verity.ko] undefined!
The attached patch resolves this problem. It is based on the s390
implementation of ucmpdi2.c.
Signed-off-by: John David Anglin <dave.anglin@bell.net> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Signed-off-by: Helge Deller <deller@gmx.de>
Merge tag 'gpio-v3.9-lastminute' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio
Pull gpi fix from Linus Walleij:
"This is a last minute revert for the GPIO tree, as Mike Dunn noticed
breakage on some older PXA machines due to moving PXA GPIO initcalls
to the module_init initlevel"
* tag 'gpio-v3.9-lastminute' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio:
Revert "gpio: pxa: set initcall level to module init"
We need to check the runtime sys_table for the EFI version the firmware
specifies instead of just checking for a NULL QueryVariableInfo. Older
implementations of EFI don't have QueryVariableInfo but the runtime is
a smaller structure, so the pointer to it may be pointing off into garbage.
This is apparently the case with several Apple firmwares that support EFI
1.10, and the current check causes them to no longer boot. Fix based on
a suggestion from Matthew Garrett.
Signed-off-by: Josh Boyer <jwboyer@redhat.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com>
arch/x86/boot/compressed/eboot.c: In function ‘setup_efi_vars’:
arch/x86/boot/compressed/eboot.c:269:2: warning: passing argument 1 of ‘efi_call_phys’ makes pointer from integer without a cast [enabled by default]
In file included from arch/x86/boot/compressed/eboot.c:12:0:
/w/kernel/linux/arch/x86/include/asm/efi.h:8:33: note: expected ‘void *’ but argument is of type ‘long unsigned int’
after cc5a080c5d40 ("efi: Pass boot services variable info to runtime
code").
Reported-by: Paul Bolle <pebolle@tiscali.nl> Cc: Matthew Garrett <matthew.garrett@nebula.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Matt Fleming <matt.fleming@intel.com>
lmo commit c17a6554 (MIPS: page.h: Provide more readable definition for
PAGE_MASK) apparently breaks ioremap of 36-bit addresses on my Alchemy
systems (PCI and PCMCIA) The reason is that in arch/mips/mm/ioremap.c
line 157 (phys_addr &= PAGE_MASK) bits 32-35 are cut off. Seems the
new PAGE_MASK is explicitly 32bit, or one could make it signed instead
of unsigned long.
Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
"Misc fixes"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86: Fix offcore_rsp valid mask for SNB/IVB
perf: Treat attr.config as u64 in perf_swevent_init()
Merge branch 'x86-kdump-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull kdump fixes from Peter Anvin:
"The kexec/kdump people have found several problems with the support
for loading over 4 GiB that was introduced in this merge cycle. This
is partly due to a number of design problems inherent in the way the
various pieces of kdump fit together (it is pretty horrifically manual
in many places.)
After a *lot* of iterations this is the patchset that was agreed upon,
but of course it is now very late in the cycle. However, because it
changes both the syntax and semantics of the crashkernel option, it
would be desirable to avoid a stable release with the broken
interfaces."
I'm not happy with the timing, since originally the plan was to release
the final 3.9 tomorrow. But apparently I'm doing an -rc8 instead...
* 'x86-kdump-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
kexec: use Crash kernel for Crash kernel low
x86, kdump: Change crashkernel_high/low= to crashkernel=,high/low
x86, kdump: Retore crashkernel= to allocate under 896M
x86, kdump: Set crashkernel_low automatically
Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Peter Anvin:
"Three groups of fixes:
1. Make sure we don't execute the early microcode patching if family
< 6, since it would touch MSRs which don't exist on those
families, causing crashes.
2. The Xen partial emulation of HyperV can be dealt with more
gracefully than just disabling the driver.
3. More EFI variable space magic. In particular, variables hidden
from runtime code need to be taken into account too."
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, microcode: Verify the family before dispatching microcode patching
x86, hyperv: Handle Xen emulation of Hyper-V more gracefully
x86,efi: Implement efi_no_storage_paranoia parameter
efi: Export efi_query_variable_store() for efivars.ko
x86/Kconfig: Make EFI select UCS2_STRING
efi: Distinguish between "remaining space" and actually used space
efi: Pass boot services variable info to runtime code
Move utf16 functions to kernel core and rename
x86,efi: Check max_size only if it is non-zero.
x86, efivars: firmware bug workarounds should be in platform code
Merge branch 'fixes' of git://git.linaro.org/people/rmk/linux-arm
Pull ARM fixes from Russell King:
"A set of fixes from various people - Will Deacon gets a prize for
removing code this time around. The biggest fix in this lot is
sorting out the ARM740T mess. The rest are relatively small fixes."
* 'fixes' of git://git.linaro.org/people/rmk/linux-arm:
ARM: 7699/1: sched_clock: Add more notrace to prevent recursion
ARM: 7698/1: perf: fix group validation when using enable_on_exec
ARM: 7697/1: hw_breakpoint: do not use __cpuinitdata for dbg_cpu_pm_nb
ARM: 7696/1: Fix kexec by setting outer_cache.inv_all for Feroceon
ARM: 7694/1: ARM, TCM: initialize TCM in paging_init(), instead of setup_arch()
ARM: 7692/1: iop3xx: move IOP3XX_PERIPHERAL_VIRT_BASE
ARM: modules: don't export cpu_set_pte_ext when !MMU
ARM: mm: remove broken condition check for v4 flushing
ARM: mm: fix numerous hideous errors in proc-arm740.S
ARM: cache: remove ARMv3 support code
ARM: tlbflush: remove ARMv3 support
1) Fix race in sparc64 TLB shootdowns, we have to synchronize with the
sibling cpus completing if we are passing them a reference via
pointer to a data structure.
2) Fix cleaning of bitmaps in sparc32, from Akinobu Mita.
3) Fix various sparc header mistakes, some of which resulted in
userland build breakage. From Sam Ravnborg.
4) Kill ghost declarations and defines missed when several bits of code
got deleted recently.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
sparc64: Fix race in TLB batch processing.
sparc: use asm-generic version of types.h
bbc_i2c: fix section mismatch warning
sparc: use generic headers
sparc:cleanup unused code in smp_32.h
sparc/iommu: fix typo s/265KB/256KB/
sparc/srmmu: clear trailing edge of bitmap properly
sparc:remove unused declaration smp_boot_cpus()
1) ax88796 does 64-bit divides which causes link errors on ARM, fix
from Arnd Bergmann.
2) Once an improper offload setting is detected on an SKB we don't rate
limit the log message so we can very easily live lock. From Ben
Greear.
3) Openvswitch cannot report vport configuration changes reliably
because it didn't preallocate the netlink notification message
before changing state. From Jesse Gross.
4) The effective UID/GID SCM credentials fix, from Linus.
5) When a user explicitly asks for wireless authentication, cfg80211
isn't told about the AP detachment leaving inconsistent state. Fix
from Johannes Berg.
6) Fix self-MAC checks in batman-adv on multi-mesh nodes, from Antonio
Quartulli.
7) Revert build_skb() change sin IGB driver, can result in memory
corruption. From Alexander Duyck.
8) Fix setting VLANs on virtual functions in IXGBE, from Greg Rose.
9) Fix TSO races in qlcnic driver, from Sritej Velaga.
10) In bnx2x the kernel driver and UNDI firmware can try to program the
chip at the same time, resulting in corruption. Add proper
synchronization. From Dmitry Kravkov.
11) Fix corruption of status block in firmware ram in bxn2x, from Ariel
Elior.
12) Fix load balancing hash regression of bonding driver in forwarding
configurations, from Eric Dumazet.
13) Fix TS ECR regression in TCP by calling tcp_replace_ts_recent() in
all the right spots, from Eric Dumazet.
14) Fix several bonding bugs having to do with address manintainence,
including not removing address when configuration operations
encounter errors, missed locking on the address lists, missing
refcounting on VLAN objects, etc. All from Nikolay Aleksandrov.
15) Add workarounds for firmware bugs in LTE qmi_wwan devices, wherein
the devices fail to add a proper ethernet header while on LTE
networks but otherwise properly do so on 2G and 3G ones. From Bjørn
Mork.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (38 commits)
net: fix incorrect credentials passing
net: rate-limit warn-bad-offload splats.
net: ax88796: avoid 64 bit arithmetic
qlge: Update version to 1.00.00.32.
qlge: Fix ethtool autoneg advertising.
qlge: Fix receive path to drop error frames
net: qmi_wwan: prevent duplicate mac address on link (firmware bug workaround)
net: qmi_wwan: fixup destination address (firmware bug workaround)
net: qmi_wwan: fixup missing ethernet header (firmware bug workaround)
bonding: in bond_mc_swap() bond's mc addr list is walked without lock
bonding: disable netpoll on enslave failure
bonding: primary_slave & curr_active_slave are not cleaned on enslave failure
bonding: vlans don't get deleted on enslave failure
bonding: mc addresses don't get deleted on enslave failure
pkt_sched: fix error return code in fw_change_attrs()
irda: small read past the end of array in debug code
tcp: call tcp_replace_ts_recent() from tcp_ack()
netfilter: xt_rpfilter: skip locally generated broadcast/multicast, too
netfilter: ipset: bitmap:ip,mac: fix listing with timeout
bonding: fix l23 and l34 load balancing in forwarding path
...
Commit 257b5358b32f ("scm: Capture the full credentials of the scm
sender") changed the credentials passing code to pass in the effective
uid/gid instead of the real uid/gid.
Obviously this doesn't matter most of the time (since normally they are
the same), but it results in differences for suid binaries when the wrong
uid/gid ends up being used.
This just undoes that (presumably unintentional) part of the commit.
Reported-by: Andy Lutomirski <luto@amacapital.net> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Serge E. Hallyn <serge@hallyn.com> Cc: David S. Miller <davem@davemloft.net> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
H. Peter Anvin [Sat, 20 Apr 2013 00:09:03 +0000 (17:09 -0700)]
Merge remote-tracking branch 'efi/urgent' into x86/urgent
Matt Fleming (1):
x86, efivars: firmware bug workarounds should be in platform
code
Matthew Garrett (3):
Move utf16 functions to kernel core and rename
efi: Pass boot services variable info to runtime code
efi: Distinguish between "remaining space" and actually used
space
Richard Weinberger (2):
x86,efi: Check max_size only if it is non-zero.
x86,efi: Implement efi_no_storage_paranoia parameter
Sergey Vlasov (2):
x86/Kconfig: Make EFI select UCS2_STRING
efi: Export efi_query_variable_store() for efivars.ko
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
H. Peter Anvin [Fri, 19 Apr 2013 23:36:03 +0000 (16:36 -0700)]
x86, microcode: Verify the family before dispatching microcode patching
For each CPU vendor that implements CPU microcode patching, there will
be a minimum family for which this is implemented. Verify this
minimum level of support.
This can be done in the dispatch function or early in the application
functions. Doing the latter turned out to be somewhat awkward because
of the ineviable split between the BSP and the AP paths, and rather
than pushing deep into the application functions, do this in
the dispatch function.
Ben Greear [Fri, 19 Apr 2013 10:45:52 +0000 (10:45 +0000)]
net: rate-limit warn-bad-offload splats.
If one does do something unfortunate and allow a
bad offload bug into the kernel, this the
skb_warn_bad_offload can effectively live-lock the
system, filling the logs with the same error over
and over.
Add rate limitation to this so that box remains otherwise
functional in this case.
Signed-off-by: Ben Greear <greearb@candelatech.com> Signed-off-by: David S. Miller <davem@davemloft.net>
When building ax88796 on an ARM platform with 64-bit resource_size_t,
we currently get
drivers/net/ethernet/8390/ax88796.c:875: undefined reference to `__aeabi_uldivmod'
because we do a division on the length of the MMIO resource.
Since we know that this resource is very short, using an
"unsigned long" instead of "resource_size_t" is entirely
sufficient, and avoids this link-time error.
Cc: Ben Dooks <ben-linux@fluff.org> Cc: netdev@vger.kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
o Fix the driver to drop error frames in the receive path
o Update error counter which was not getting incremented
Signed-off-by: Sritej Velaga <sritej.velaga@qlogic.com> Signed-off-by: Jitendra Kalsaria <jitendra.kalsaria@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 19 Apr 2013 21:51:26 +0000 (17:51 -0400)]
Merge branch 'qmi_wwan'
Bjørn Mork says:
====================
This series adds workarounds for 3 different firmware bugs, each
preventing the affected devices from working at all. I therefore
humbly request that these fixes go to stable-3.8 (if still
maintained) and 3.9 (either via net if still possible, or via
stable if not).
All 3 workarounds are applied to all devices supported by the driver.
Adding quirks for specific devices was considered as an alternative,
but was rejected because we have too little information about the
exact distribution of the buggy firmwares. All we know is that the
same bug shows up in devices from at least 3 different, and presumably
independent, vendors.
The workarounds have instead been designed to automatically apply
when necessary, and to have as little impact as possible on unaffected
devices. The series has been tested on a number of devices both with
and without these bugs.
The series should apply cleanly to net/master, net-next/master and
stable/linux-3.8.y
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
net: qmi_wwan: prevent duplicate mac address on link (firmware bug workaround)
We normally trust and use the CDC functional descriptors provided by a
number of devices. But some of these will erroneously list the address
reserved for the device end of the link. Attempting to use this on
both the device and host side will naturally not work.
Work around this bug by ignoring the functional descriptor and assign a
random address instead in this case.
Signed-off-by: Bjørn Mork <bjorn@mork.no> Signed-off-by: David S. Miller <davem@davemloft.net>
The bogus address is always the same, and matches the address
suggested by many devices as a default address. It is likely a
hardcoded firmware default.
The circumstances where this bug has been observed indicates that
the trigger is related to timing or some other factor the host
cannot control. Repeating the exact same configuration sequence
that caused it to trigger once, will not necessarily cause it to
trigger the next time. Reproducing the bug is therefore difficult.
This opens up a possibility that the bug is more common than we can
confirm, because affected devices often will work properly again
after a reset. A procedure most users are likely to try out before
reporting a bug.
Unconditionally rewriting the destination address if the first digit
of the received packet is 0, is considered an acceptable compromise
since we already have to inspect this digit. The simplification will
cause unnecessary rewrites if the real address starts with 0, but this
is still better than adding additional tests for this particular case.
Signed-off-by: Bjørn Mork <bjorn@mork.no> Signed-off-by: David S. Miller <davem@davemloft.net>
A number of LTE devices from different vendors all suffer from the
same firmware bug: Most of the packets received from the device while
it is attached to a LTE network will not have an ethernet header. The
devices work as expected when attached to 2G or 3G networks, sending
an ethernet header with all packets.
This driver is not aware of which network the modem attached to, and
even if it were there are still some packet types which are always
received with the header intact.
All devices supported by this driver have severely limited
networking capabilities:
- can only transmit IPv4, IPv6 and possibly ARP
- can only support a single host hardware address at any time
- will only do point-to-point communcation with the host
Because of this, we are able to reliably identify any bogus raw IP
packets by simply looking at the 4 IP version bits. All we need to
do is to avoid 4 or 6 in the first digit of the mac address. This
workaround ensures this, and fix up the received packets as necessary.
Given the distribution of the bug, it is believed that the source is
the chipset vendor. The devices which are verified to be affected are:
Huawei E392u-12 (Qualcomm MDM9200)
Pantech UML290 (Qualcomm MDM9600)
Novatel USB551L (Qualcomm MDM9600)
Novatel E362 (Qualcomm MDM9600)
It is believed that the bug depend on firmware revision, which means
that possibly all devices based on the above mentioned chipset may be
affected if we consider all available firmware revisions.
The information about affected devices and versions is likely
incomplete. As the additional overhead for packets not needing this
fixup is very small, it is considered acceptable to apply the
workaround to all devices handled by this driver.
Reported-by: Dan Williams <dcbw@redhat.com> Signed-off-by: Bjørn Mork <bjorn@mork.no> Signed-off-by: David S. Miller <davem@davemloft.net>