git.proxmox.com Git - mirror

block: Convert bdrv_aio_discard() to byte-based

Another step towards byte-based interfaces everywhere. Replace
the sector-based bdrv_aio_discard() with a new byte-based
bdrv_aio_pdiscard(), which silently ignores any unaligned head
or tail. Driver callbacks will be converted in followup patches.

Signed-off-by: Eric Blake <eblake@redhat.com>
Message-id: 1468624988-423-5-git-send-email-eblake@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

block: Switch BlockRequest to byte-based

BlockRequest is the internal struct used by bdrv_aio_*. At the
moment, all such calls were sector-based, but we will eventually
convert to byte-based; start by changing the internal variables
to be byte-based. No change to behavior, although the read and
write code can now go byte-based through more of the stack.

Signed-off-by: Eric Blake <eblake@redhat.com>
Message-id: 1468624988-423-4-git-send-email-eblake@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

block: Convert bdrv_discard() to byte-based

Another step towards byte-based interfaces everywhere. Replace
the sector-based bdrv_discard() with a new byte-based
bdrv_pdiscard(), which silently ignores any unaligned head
or tail.

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1468624988-423-3-git-send-email-eblake@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

block: Convert bdrv_co_discard() to byte-based

Another step towards byte-based interfaces everywhere. Replace
the sector-based bdrv_co_discard() with a new byte-based
bdrv_co_pdiscard(), which silently ignores any unaligned head
or tail. Driver callbacks will be converted in followup patches.

By calculating the alignment outside of the loop, and clamping
the max discard to an aligned value, we can simplify the actions
done within the loop.

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1468624988-423-2-git-send-email-eblake@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

iscsi: Rely on block layer to break up large requests

Now that the block layer honors max_request, we don't need to
bother with an EINVAL on overlarge requests, but can instead
assert that requests are well-behaved.

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 1468607524-19021-7-git-send-email-eblake@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

nbd: Drop unused offset parameter

Now that NBD relies on the block layer to fragment things, we no
longer need to track an offset argument for which fragment of
a request we are actually servicing.

While at it, use true and false instead of 0 and 1 for a bool
parameter.

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 1468607524-19021-6-git-send-email-eblake@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

nbd: Rely on block layer to break up large requests

Now that the block layer will honor max_transfer, we can simplify
our code to rely on that guarantee.

The readv code can call directly into nbd-client, just as the
writev code has done since commit 52a4650.

Interestingly enough, while qemu-io 'w 0 40m' splits into a 32M
and 8M transaction, 'w -z 0 40m' splits into two 16M and an 8M,
because the block layer caps the bounce buffer for writing zeroes
at 16M. When we later introduce support for NBD_CMD_WRITE_ZEROES,
we can get a full 32M zero write (or larger, if the client and
server negotiate that write zeroes can use a larger size than
ordinary writes).

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 1468607524-19021-5-git-send-email-eblake@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

block: Fragment writes to max transfer length

Drivers should be able to rely on the block layer honoring the
max transfer length, rather than needing to return -EINVAL
(iscsi) or manually fragment things (nbd). We already fragment
write zeroes at the block layer; this patch adds the fragmentation
for normal writes, after requests have been aligned (fragmenting
before alignment would lead to multiple unaligned requests, rather
than just the head and tail).

When fragmenting a large request where FUA was requested, but
where we know that FUA is implemented by flushing all requests
rather than the given request, then we can still get by with
only one flush. Note, however, that we need a followup patch
to the raw format driver to avoid a regression in the number of
flushes actually issued.

The return value was previously nebulous on success (sometimes
zero, sometimes the length written); since we never have a short
write, and since fragmenting may store yet another positive
value in 'ret', change the function to always return 0 on success,
matching what we do in bdrv_aligned_preadv().

Signed-off-by: Eric Blake <eblake@redhat.com>
Message-id: 1468607524-19021-4-git-send-email-eblake@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

raw_bsd: Don't advertise flags not supported by protocol layer

The raw format layer supports all flags via passthrough - but
it only makes sense to pass through flags that the lower layer
actually supports.

The next patch gives stronger reasoning for why this is correct.
At the moment, the raw format layer ignores the max_transfer
limit of its protocol layer, and an attempt to do the qemu-io
'w -f 0 40m' to an NBD server that lacks FUA will pass the entire
40m request to the NBD driver, which then fragments the request
itself into a 32m write, 8m write, and flush. But once the block
layer starts honoring limits and fragmenting packets, the raw
driver will hand the NBD driver two separate requests; if both
requests have BDRV_REQ_FUA set, then this would result in a 32m
write, flush, 8m write, and second flush. By having the raw
layer no longer advertise FUA support when the protocol layer
lacks it, we are back to a single flush at the block layer for
the overall 40m request.

Note that 'w -f -z 0 40m' does not currently exhibit the same
problem, because there, the fragmentation does not occur until
at the NBD layer (the raw layer has .bdrv_co_pwrite_zeroes, and
the NBD layer doesn't advertise max_pwrite_zeroes to constrain
things at the raw layer) - but the problem is latent and we
would again have too many flushes without this patch once the
NBD layer implements support for the new NBD_CMD_WRITE_ZEROES
command, if it sets max_pwrite_zeroes to the same 32m limit as
recommended by the NBD protocol.

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1468607524-19021-3-git-send-email-eblake@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

block: Fragment reads to max transfer length

Drivers should be able to rely on the block layer honoring the
max transfer length, rather than needing to return -EINVAL
(iscsi) or manually fragment things (nbd). This patch adds
the fragmentation in the block layer, after requests have been
aligned (fragmenting before alignment would lead to multiple
unaligned requests, rather than just the head and tail).

The return value was previously nebulous on success on whether
it was zero or the length read; and fragmenting may introduce
yet other non-zero values if we use the last length read. But
as at least some callers are sloppy and expect only zero on
success, it is easiest to just guarantee 0.

[Fix uninitialized ret local variable in bdrv_aligned_preadv().
--Stefan]

Signed-off-by: Eric Blake <eblake@redhat.com>
Message-id: 1468607524-19021-2-git-send-email-eblake@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

Merge remote-tracking branch 'remotes/pmaydell/tags/pull-target-arm-20160719' into staging

target-arm queue:
* fix two minor Coverity complaints

# gpg: Signature made Tue 19 Jul 2016 18:02:34 BST
# gpg:                using RSA key 0x3C2525ED14360CDE
# gpg: Good signature from "Peter Maydell <peter.maydell@linaro.org>"
# gpg:                 aka "Peter Maydell <pmaydell@gmail.com>"
# gpg:                 aka "Peter Maydell <pmaydell@chiark.greenend.org.uk>"
# Primary key fingerprint: E1A5 C593 CD41 9DE2 8E83  15CF 3C25 25ED 1436 0CDE

* remotes/pmaydell/tags/pull-target-arm-20160719:
  arm_gicv3: Add assert()s to tell Coverity that offsets are aligned
  target-arm: Fix unreachable code in gicv3_class_name()

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

Merge remote-tracking branch 'remotes/riku/tags/pull-linux-user-20160719-2' into staging

linux-user fixes before 2.7 freeze, fix commit message

# gpg: Signature made Tue 19 Jul 2016 14:18:54 BST
# gpg:                using RSA key 0xB44890DEDE3C9BC0
# gpg: Good signature from "Riku Voipio <riku.voipio@iki.fi>"
# gpg:                 aka "Riku Voipio <riku.voipio@linaro.org>"
# Primary key fingerprint: FF82 03C8 C391 98AE 0581  41EF B448 90DE DE3C 9BC0

* remotes/riku/tags/pull-linux-user-20160719-2:
  linux-user: AArch64 has sync_file_range, not sync_file_range2
  linux-user: Fix type for SIOCATMARK ioctl
  linux-user: define missing sparc syscalls
  linux-user: Fix terminal control ioctls
  linux-user: Add some new blk ioctls
  linux-user: Handle short lengths in host_to_target_sockaddr()
  linux-user: Forget about synchronous signal once it is delivered
  linux-user: Correct type for LOOP_GET_STATUS{,64} ioctls
  linux-user: Correct type for BLKSSZGET
  linux-user: Add loop control ioctls
  linux-user: Check sigsetsize argument to syscalls
  linux-user: add nested netlink types
  linux-user: convert sockaddr_ll from host to target
  linux-user: add fd_trans helper in do_recvfrom()
  linux-user: fix netlink memory corruption
  linux-user: fd_trans_*_data() returns the length

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

arm_gicv3: Add assert()s to tell Coverity that offsets are aligned

Coverity complains that the GICR_IPRIORITYR case in gicv3_readl()
can overflow an array, because it doesn't know that the offsets
passed to that function must be word aligned. Add some assert()s
which hopefully tell Coverity that this isn't possible.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Message-id: 1468261372-17508-1-git-send-email-peter.maydell@linaro.org

target-arm: Fix unreachable code in gicv3_class_name()

Coverity complains that the exit() in gicv3_class_name()
can be unreachable, because if TARGET_AARCH64 is defined
then all code paths return before reaching it. Move the
exit() up to the error_report() that it belongs with.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Shannon Zhao <shannon.zhao@linaro.org>
Message-id: 1468260552-8400-1-git-send-email-peter.maydell@linaro.org

disas: Fix ATTRIBUTE_UNUSED define clash with ALSA headers

disas/bfd.h defines ATTRIBUTE_UNUSED, but unfortunately the
ALSA system headers also define this macro, which means that
you can get a compilation failure if building with ALSA and
any files happen to include the alsa headers before bfd.h
rather than the other way around.

This is unfortunate namespace pollution by the ALSA headers but
we can work around it. Add an #ifndef guard to bfd.h and remove
the unnecessary extra definition in disas/arm.c to fix this.

Reported-by: BALATON Zoltan <balaton@eik.bme.hu>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1468937076-21503-1-git-send-email-peter.maydell@linaro.org

Merge remote-tracking branch 'remotes/bonzini/tags/for-upstream' into staging

* two old patches from prospective GSoC students
* i386 -kernel device tree support
* Coverity fix
* memory usage improvement from Peter
* checkpatch fix
* g_path_get_dirname cleanup
* caching of block status for iSCSI

# gpg: Signature made Tue 19 Jul 2016 07:43:41 BST
# gpg:                using RSA key 0xBFFBD25F78C7AE83
# gpg: Good signature from "Paolo Bonzini <bonzini@gnu.org>"
# gpg:                 aka "Paolo Bonzini <pbonzini@redhat.com>"
# Primary key fingerprint: 46F5 9FBD 57D6 12E7 BFD4  E2F7 7E15 100C CD36 69B1
#      Subkey fingerprint: F133 3857 4B66 2389 866C  7682 BFFB D25F 78C7 AE83

* remotes/bonzini/tags/for-upstream:
  target-i386: Remove redundant HF_SOFTMMU_MASK
  block/iscsi: allow caching of the allocation map
  block/iscsi: fix rounding in iscsi_allocationmap_set
  Move README to markdown
  cpu-exec: Move down some declarations in cpu_exec()
  exec: avoid realloc in phys_map_node_reserve
  checkpatch: consider git extended headers valid patches
  megasas: remove useless check for cmd->frame
  compiler: never omit assertions if using a static analysis tool
  hw/i386: add device tree support
  Changed malloc to g_malloc, free to g_free in bsd-user/qemu.h
  use g_path_get_dirname instead of dirname

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

Merge remote-tracking branch 'remotes/stsquad/tags/pull-travis-20160718-1' into staging

Make IRC a little less noisy

# gpg: Signature made Mon 18 Jul 2016 16:42:57 BST
# gpg:                using RSA key 0xFBD0DB095A9E2A44
# gpg: Good signature from "Alex Bennée (Master Work Key) <alex.bennee@linaro.org>"
# Primary key fingerprint: 6685 AE99 E751 67BC AFC8  DF35 FBD0 DB09 5A9E 2A44

* remotes/stsquad/tags/pull-travis-20160718-1:
  .travis.yml: Disable IRC build status updates from forks

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

linux-user: AArch64 has sync_file_range, not sync_file_range2

The AArch64 Linux ABI syscall 84 is sync_file_range, not
sync_file_range2 (in the kernel it uses the asm-generic
headers and does not define __ARCH_WANT_SYNC_FILE_RANGE2).
Update our TARGET_NR_* definitions accordingly.

This fixes the sync_file_range syscall which otherwise
gets its arguments in the wrong order.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Riku Voipio <riku.voipio@linaro.org>

linux-user: Fix type for SIOCATMARK ioctl

The SIOCATMARK ioctl takes an argument which should be a
pointer to an integer where the kernel will write the result.
We were incorrectly declaring it as TYPE_NULL which would mean
it would always fail (with EFAULT) when it should succeed.
Correct the type.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Riku Voipio <riku.voipio@linaro.org>

linux-user: define missing sparc syscalls

NR_lookup_dcookie, NR_fadvise64, NR_fadvise64_64

Signed-off-by: Laurent Vivier <laurent@vivier.eu>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Riku Voipio <riku.voipio@linaro.org>

linux-user: Fix terminal control ioctls

TIOCGPTN and related terminal control ioctls were not converted to the guest ioctl format on x86_64 targets. Convert these ioctls to enable terminal functionality on x86_64 guests.

Signed-off-by: Timothy Pearson <tpearson@raptorengineering.com>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Riku Voipio <riku.voipio@linaro.org>

Merge remote-tracking branch 'remotes/kraxel/tags/pull-vnc-20160719-1' into staging

vnc: bugfixes for -rc0

# gpg: Signature made Tue 19 Jul 2016 08:27:05 BST
# gpg:                using RSA key 0x4CB6D8EED3E87138
# gpg: Good signature from "Gerd Hoffmann (work) <kraxel@redhat.com>"
# gpg:                 aka "Gerd Hoffmann <gerd@kraxel.org>"
# gpg:                 aka "Gerd Hoffmann (private) <kraxel@gmail.com>"
# Primary key fingerprint: A032 8CFF B93A 17A7 9901  FE7D 4CB6 D8EE D3E8 7138

* remotes/kraxel/tags/pull-vnc-20160719-1:
  vnc-tight: fix regression with libxenstore
  vnc-enc-tight: fix off-by-one bug
  vnc: make sure we finish disconnect

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

Merge remote-tracking branch 'remotes/mcayland/tags/qemu-openbios-signed' into staging

Update OpenBIOS images

# gpg: Signature made Tue 19 Jul 2016 07:42:30 BST
# gpg:                using RSA key 0x5BC2C56FAE0F321F
# gpg: Good signature from "Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>"
# Primary key fingerprint: CC62 1AB9 8E82 200D 915C  C9C4 5BC2 C56F AE0F 321F

* remotes/mcayland/tags/qemu-openbios-signed:
  Update OpenBIOS images to e79bca6 built from submodule.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

linux-user: Add some new blk ioctls

Add some new blk ioctls (these are 0x12,119 through
to 0x12,127). Several of these are used by mke2fs; this silences
the warnings:

mke2fs 1.42.12 (29-Aug-2014)
Unsupported ioctl: cmd=0x127b
Unsupported ioctl: cmd=0x127a
warning: Unable to get device geometry for /dev/loop5
Unsupported ioctl: cmd=0x127c
Unsupported ioctl: cmd=0x127c
Unsupported ioctl: cmd=0x1277

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Riku Voipio <riku.voipio@linaro.org>

linux-user: Handle short lengths in host_to_target_sockaddr()

If userspace specifies a short buffer for a target sockaddr,
the kernel will only copy in as much as it has space for
(or none at all if the length is zero) -- see the kernel
move_addr_to_user() function. Mimic this in QEMU's
host_to_target_sockaddr() routine.

In particular, this fixes a segfault running the LTP
recvfrom01 test, where the guest makes a recvfrom()
call with a bad buffer pointer and other parameters which
cause the kernel to set the addrlen to zero; because we
did not skip the attempt to swap the sa_family field we
segfaulted on the bad address.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Riku Voipio <riku.voipio@linaro.org>

linux-user: Forget about synchronous signal once it is delivered

Commit 655ed67c2a248cf which switched synchronous signals to
benig recorded in ts->sync_signal rather than in a queue
with every other signal had a bug: we failed to clear
the flag indicating that a synchronous signal was pending
when we delivered it. This meant that we would take the signal
again and again every time the guest made a syscall.
(This is a bug introduced in my refactoring of Timothy Baldwin's
original code.)

Fix this by passing in the struct emulated_sigtable* to
handle_pending_signal(), so that we clear the pending flag
in the ts->sync_signal struct when handling a synchronous signal.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Riku Voipio <riku.voipio@linaro.org>

linux-user: Correct type for LOOP_GET_STATUS{,64} ioctls

The LOOP_GET_STATUS and LOOP_GET_STATUS64 ioctls were incorrectly
defined as IOC_W rather than IOC_R, which meant we weren't
correctly copying the information back from the kernel to the guest.
The loop_info64 structure definition was also missing a member
and using the wrong type for several 32-bit fields.

In particular, this meant that "kpartx -d image.img" didn't work
and "losetup -a" behaved strangely. Correct the ioctl type definitions.

Reported-by: Chanho Park <chanho61.park@samsung.com>
Reviewed-by: Laurent Vivier <laurent@vivier.eu>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Riku Voipio <riku.voipio@linaro.org>

linux-user: Correct type for BLKSSZGET

The BLKSSZGET ioctl takes an argument which is a pointer to an int.
We were incorrectly declaring it to take a pointer to a long, which
meant that we would incorrectly write to memory which we should not
if the guest is a 64-bit architecture.

In particular, kpartx uses this ioctl to write to an int on the
stack, which tends to result in it crashing immediately.

Reported-by: Chanho Park <chanho61.park@samsung.com>
Reviewed-by: Laurent Vivier <laurent@vivier.eu>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Riku Voipio <riku.voipio@linaro.org>

linux-user: Add loop control ioctls

Add support for the /dev/loop-control ioctls:
LOOP_CTL_ADD
LOOP_CTL_REMOVE
LOOP_CTL_GET_FREE

[RV: fixed to apply to new header guards]
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Laurent Vivier <laurent@vivier.eu>
Signed-off-by: Riku Voipio <riku.voipio@linaro.org>

linux-user: Check sigsetsize argument to syscalls

Many syscalls which take a sigset_t argument also take an argument
giving the size of the sigset_t. The kernel insists that this
matches its idea of the type size and fails EINVAL if it is not.
Implement this logic in QEMU. (This mostly just means some LTP test
cases which check error cases now pass.)

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Riku Voipio <riku.voipio@linaro.org>
Reviewed-by: Laurent Vivier <laurent@vivier.eu>

linux-user: add nested netlink types

Nested types are used by the kernel to send link information and
protocol properties.

We can see following errors with "ip link show":

Unimplemented nested type 26
Unimplemented nested type 26
Unimplemented nested type 18
Unimplemented nested type 26
Unimplemented nested type 18
Unimplemented nested type 26

This patch implements nested types 18 (IFLA_LINKINFO) and
26 (IFLA_AF_SPEC).

Signed-off-by: Laurent Vivier <laurent@vivier.eu>
Signed-off-by: Riku Voipio <riku.voipio@linaro.org>

linux-user: convert sockaddr_ll from host to target

As we convert sockaddr for AF_PACKET family for sendto() (target to
host) we need also to convert this for getsockname() (host to target).

arping uses getsockname() to get the the interface address and uses
this address with sendto().

Tested with:

    /sbin/arping -D -q -c2 -I eno1 192.168.122.88

...
getsockname(3, {sa_family=AF_PACKET, proto=0x806, if2,
pkttype=PACKET_HOST, addr(6)={1, 10c37b6b9a76}, [18]) = 0
...
sendto(3, "..." 28, 0,
       {sa_family=AF_PACKET, proto=0x806, if2, pkttype=PACKET_HOST,
       addr(6)={1, ffffffffffff}, 20) = 28
...

Signed-off-by: Laurent Vivier <laurent@vivier.eu>
Signed-off-by: Riku Voipio <riku.voipio@linaro.org>

linux-user: add fd_trans helper in do_recvfrom()

Fix passwd using netlink audit.

Signed-off-by: Laurent Vivier <laurent@vivier.eu>
Signed-off-by: Riku Voipio <riku.voipio@linaro.org>

linux-user: fix netlink memory corruption

Netlink is byte-swapping data in the guest memory (it's bad).

It's ok when the data come from the host as they are generated by the
host.

But it doesn't work when data come from the guest: the guest can
try to reuse these data whereas they have been byte-swapped.

This is what happens in glibc:

glibc generates a sequence number in nlh.nlmsg_seq and calls
sendto() with this nlh. In sendto(), we byte-swap nlmsg.seq.

Later, after the recvmsg(), glibc compares nlh.nlmsg_seq with
sequence number given in return, and of course it fails (hangs),
because nlh.nlmsg_seq is not valid anymore.

The involved code in glibc is:

sysdeps/unix/sysv/linux/check_pf.c:make_request()
...
  req.nlh.nlmsg_seq = time (NULL);
...
  if (TEMP_FAILURE_RETRY (__sendto (fd, (void *) &req, sizeof (req), 0,
                                    (struct sockaddr *) &nladdr,
                                    sizeof (nladdr))) < 0)
<here req.nlh.nlmsg_seq has been byte-swapped>
...
  do
    {
...
      ssize_t read_len = TEMP_FAILURE_RETRY (__recvmsg (fd, &msg, 0));
...
      struct nlmsghdr *nlmh;
      for (nlmh = (struct nlmsghdr *) buf;
           NLMSG_OK (nlmh, (size_t) read_len);
           nlmh = (struct nlmsghdr *) NLMSG_NEXT (nlmh, read_len))
        {
<we compare nlmh->nlmsg_seq with corrupted req.nlh.nlmsg_seq>
          if (nladdr.nl_pid != 0 || (pid_t) nlmh->nlmsg_pid != pid
              || nlmh->nlmsg_seq != req.nlh.nlmsg_seq)
            continue;
...
          else if (nlmh->nlmsg_type == NLMSG_DONE)
            /* We found the end, leave the loop.  */
            done = true;
        }
    }
  while (! done);

As we have a continue on "nlmh->nlmsg_seq != req.nlh.nlmsg_seq",
"done" cannot be set to "true" and we have an infinite loop.

It's why commands like "apt-get update" or "dnf update hangs".

Signed-off-by: Laurent Vivier <laurent@vivier.eu>
Signed-off-by: Riku Voipio <riku.voipio@linaro.org>

linux-user: fd_trans_*_data() returns the length

fd_trans_target_to_host_data() and fd_trans_host_to_target_data() must
return the length of processed data.

Signed-off-by: Laurent Vivier <laurent@vivier.eu>
Signed-off-by: Riku Voipio <riku.voipio@linaro.org>

Merge remote-tracking branch 'remotes/jasowang/tags/net-pull-request' into staging

# gpg: Signature made Tue 19 Jul 2016 03:33:40 BST
# gpg:                using RSA key 0xEF04965B398D6211
# gpg: Good signature from "Jason Wang (Jason Wang on RedHat) <jasowang@redhat.com>"
# gpg: WARNING: This key is not certified with sufficiently trusted signatures!
# gpg:          It is not certain that the signature belongs to the owner.
# Primary key fingerprint: 215D 46F4 8246 689E C77F  3562 EF04 965B 398D 6211

* remotes/jasowang/tags/net-pull-request:
  e1000e: fix building without CONFIG_VMXNET3_PCI
  MAINTAINERS: release Scott from being a rocker maintainer
  tap: fix memory leak on failure to create a multiqueue tap device
  net: fix incorrect argument to iov_to_buf
  net: fix incorrect access to pointer
  e1000e: fix incorrect access to pointer

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

Merge remote-tracking branch 'remotes/jnsnow/tags/ide-pull-request' into staging

# gpg: Signature made Mon 18 Jul 2016 23:53:15 BST
# gpg:                using RSA key 0x7DEF8106AAFC390E
# gpg: Good signature from "John Snow (John Huston) <jsnow@redhat.com>"
# Primary key fingerprint: FAEB 9711 A12C F475 812F  18F2 88A9 064D 1835 61EB
#      Subkey fingerprint: F9B7 ABDB BCAC DF95 BE76  CBD0 7DEF 8106 AAFC 390E

* remotes/jnsnow/tags/ide-pull-request:
  block: ignore flush requests when storage is clean
  tests: in IDE and AHCI tests perform DMA write before flushing
  ide: set retry_unit for PIO and FLUSH requests
  ide: refactor retry_unit set and clear into separate function

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

Merge remote-tracking branch 'remotes/stefanha/tags/tracing-pull-request' into staging

# gpg: Signature made Mon 18 Jul 2016 22:59:55 BST
# gpg:                using RSA key 0x9CA4ABB381AB73C8
# gpg: Good signature from "Stefan Hajnoczi <stefanha@redhat.com>"
# gpg:                 aka "Stefan Hajnoczi <stefanha@gmail.com>"
# Primary key fingerprint: 8695 A8BF D3F9 7CDA AC35  775A 9CA4 ABB3 81AB 73C8

* remotes/stefanha/tags/tracing-pull-request:
  trace: Add QAPI/QMP interfaces to query and control per-vCPU tracing state
  trace: Allow event name pattern in "info trace-events"
  trace: Conditionally trace events based on their per-vCPU state
  trace: Add per-vCPU tracing states for events with the 'vcpu' property
  trace: Cosmetic changes on fast-path tracing
  disas: Remove unused macro '_'
  trace: Identify events with the 'vcpu' property
  trace: [bsd-user] Commandline arguments to control tracing
  trace: [linux-user] Commandline arguments to control tracing

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

Merge remote-tracking branch 'remotes/awilliam/tags/vfio-update-20160718.0' into staging

VFIO update 2016-07-18

One fix for 2.7-rc0 which hides the ARI extended capability, fixing
multifunction support in PCIe configurations where the assigned device
function topology does not match the host (Alex Williamson)

# gpg: Signature made Mon 18 Jul 2016 18:02:27 BST
# gpg:                using RSA key 0x239B9B6E3BB08B22
# gpg: Good signature from "Alex Williamson <alex.williamson@redhat.com>"
# gpg:                 aka "Alex Williamson <alex@shazbot.org>"
# gpg:                 aka "Alex Williamson <alwillia@redhat.com>"
# gpg:                 aka "Alex Williamson <alex.l.williamson@gmail.com>"
# Primary key fingerprint: 42F6 C04E 540B D1A9 9E7B  8A90 239B 9B6E 3BB0 8B22

* remotes/awilliam/tags/vfio-update-20160718.0:
  vfio/pci: Hide ARI capability

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

Update OpenBIOS images to e79bca6 built from submodule.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>

target-i386: Remove redundant HF_SOFTMMU_MASK

'HF_SOFTMMU_MASK' is only set when 'CONFIG_SOFTMMU' is defined. So
there's no need in this flag: test 'CONFIG_SOFTMMU' instead.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Message-Id: <20160715175852.30749-6-sergey.fedorov@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

block/iscsi: allow caching of the allocation map

until now the allocation map was used only as a hint if a cluster
is allocated or not. If a block was not allocated (or Qemu had
no info about the allocation status) a get_block_status call was
issued to check the allocation status and possibly avoid
a subsequent read of unallocated sectors. If a block known to be
allocated the get_block_status call was omitted. In the other case
a get_block_status call was issued before every read to avoid
the necessity for a consistent allocation map. To avoid the
potential overhead of calling get_block_status for each and
every read request this took only place for the bigger requests.

This patch enhances this mechanism to cache the allocation
status and avoid calling get_block_status for blocks where
the allocation status has been queried before. This allows
for bypassing the read request even for smaller requests and
additionally omits calling get_block_status for known to be
unallocated blocks.

Signed-off-by: Peter Lieven <pl@kamp.de>
Message-Id: <1468831940-15556-3-git-send-email-pl@kamp.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

block/iscsi: fix rounding in iscsi_allocationmap_set

when setting clusters as alloacted the boundaries have
to be expanded. As Paolo pointed out the calculation of
the number of clusters is wrong:

Suppose cluster_sectors is 2, sector_num = 1, nb_sectors = 6:

In the "mark allocated" case, you want to set 0..8, i.e.
cluster_num=0, nb_clusters=4.

   0--.--2--.--4--.--6--.--8
   <--|_________________|-->  (<--> = expanded)

Instead you are setting nb_clusters=3, so that 6..8 is not marked.

   0--.--2--.--4--.--6--.--8
   <--|______________|!!!     (! = wrong)

Cc: qemu-stable@nongnu.org
Reported-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Lieven <pl@kamp.de>
Message-Id: <1468831940-15556-2-git-send-email-pl@kamp.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Move README to markdown

Move the README file to markdown so that it makes the github page look
prettier. I know that github repo is a mirror and not the official
repo, but I think it doesn't hurt to have it in markdown format.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Message-Id: <20160715043111.29007-1-bobby.prani@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

block: ignore flush requests when storage is clean

Some guests (win2008 server for example) do a lot of unnecessary
flushing when underlying media has not changed. This adds additional
overhead on host when calling fsync/fdatasync.

This change introduces a write generation scheme in BlockDriverState.
Current write generation is checked against last flushed generation to
avoid unnessesary flushes.

The problem with excessive flushing was found by a performance test
which does parallel directory tree creation (from 2 processes).
Results improved from 0.424 loops/sec to 0.432 loops/sec.
Each loop creates 10^3 directories with 10 files in each.

This affected some blkdebug testcases that were expecting error logs from
failure-injected flushes which are now skipped entirely
(tests 026 071 089).

This also affects the performance of block jobs and thus BLOCK_JOB_READY
events for driver-mirror and active block-commit commands now arrives
faster, before QMP send successfully returns to caller (tests 141 144).

Signed-off-by: Evgeny Yakovlev <eyakovlev@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 1468870792-7411-5-git-send-email-den@openvz.org
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Fam Zheng <famz@redhat.com>
CC: John Snow <jsnow@redhat.com>
Signed-off-by: John Snow <jsnow@redhat.com>

tests: in IDE and AHCI tests perform DMA write before flushing

Due to changes in flush behaviour clean disks stopped generating
flush_to_disk events and IDE and AHCI tests that test flush commands
started to fail.

This change adds additional DMA writes to affected tests before sending
flush commands so that bdrv_flush actually generates flush_to_disk event.

Signed-off-by: Evgeny Yakovlev <eyakovlev@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 1468870792-7411-4-git-send-email-den@openvz.org
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Fam Zheng <famz@redhat.com>
CC: John Snow <jsnow@redhat.com>
Signed-off-by: John Snow <jsnow@redhat.com>

ide: set retry_unit for PIO and FLUSH requests

The following sequence of tests discovered a problem in IDE emulation:
1. Send DMA write to IDE device 0
2. Send CMD_FLUSH_CACHE to same IDE device which will be failed by block
layer using blkdebug script in tests/ide-test:test_retry_flush

When doing DMA request ide/core.c will set s->retry_unit to s->unit in
ide_start_dma. When dma completes ide_set_inactive sets retry_unit to -1.
After that ide_flush_cache runs and fails thanks to blkdebug.
ide_flush_cb calls ide_handle_rw_error which asserts that s->retry_unit
== s->unit. But s->retry_unit is still -1 after previous DMA completion
and flush does not use anything related to retry.

This patch restricts retry unit assertion only to ops that actually use
retry logic.

Signed-off-by: Evgeny Yakovlev <eyakovlev@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 1468870792-7411-3-git-send-email-den@openvz.org
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Fam Zheng <famz@redhat.com>
CC: John Snow <jsnow@redhat.com>
Signed-off-by: John Snow <jsnow@redhat.com>

ide: refactor retry_unit set and clear into separate function

Code to set and clear state associated with retry in moved into
ide_set_retry and ide_clear_retry to make adding retry setups easier.

Signed-off-by: Evgeny Yakovlev <eyakovlev@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 1468870792-7411-2-git-send-email-den@openvz.org
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Fam Zheng <famz@redhat.com>
CC: John Snow <jsnow@redhat.com>
Signed-off-by: John Snow <jsnow@redhat.com>

trace: Add QAPI/QMP interfaces to query and control per-vCPU tracing state

Signed-off-by: Lluís Vilanova <vilanova@ac.upc.edu>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

trace: Allow event name pattern in "info trace-events"

Homogenizes the command capabilities with QMP.

Signed-off-by: Lluís Vilanova <vilanova@ac.upc.edu>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

trace: Conditionally trace events based on their per-vCPU state

Events with the 'vcpu' property are conditionally emitted according to
their per-vCPU state. Other events are emitted normally based on their
global tracing state.

Note that the per-vCPU condition check applies to all tracing backends.

Signed-off-by: Lluís Vilanova <vilanova@ac.upc.edu>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

trace: Add per-vCPU tracing states for events with the 'vcpu' property

Each vCPU gets a 'trace_dstate' bitmap to control the per-vCPU dynamic
tracing state of events with the 'vcpu' property.

Signed-off-by: Lluís Vilanova <vilanova@ac.upc.edu>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

trace: Cosmetic changes on fast-path tracing

Signed-off-by: Lluís Vilanova <vilanova@ac.upc.edu>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

disas: Remove unused macro '_'

Eliminates a future compilation error when UI code includes the tracing
headers (indirectly pulling "disas/bfd.h" through "qom/cpu.h") and
GLib's i18n '_' macro.

Signed-off-by: Lluís Vilanova <vilanova@ac.upc.edu>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

trace: Identify events with the 'vcpu' property

A new event attribute 'cpu_id' is added to have a separate ID
space ('TRACE_VCPU_*') for all events with the 'vcpu' property.

These are later used to identify which events are enabled on each vCPU.

Signed-off-by: Lluís Vilanova <vilanova@ac.upc.edu>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

trace: [bsd-user] Commandline arguments to control tracing

[Changed const char *trace_file to char *trace_file since it's a
heap-allocated string that needs to be freed. This type is also
returned by trace_opt_parse() and used in vl.c.

Also fixed coding style on for(;;) and else statement as suggested by
Eric Blake <eblake@redhat.com> since the patch modifies these lines or
close enough.
--Stefan]

Signed-off-by: Lluís Vilanova <vilanova@ac.upc.edu>
Message-id: 146860252322.30668.18276041739086338328.stgit@fimbulvetr.bsc.es
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

trace: [linux-user] Commandline arguments to control tracing

[Changed const char *trace_file to char *trace_file since it's a
heap-allocated string that needs to be freed. This type is also
returned by trace_opt_parse() and used in vl.c.
--Stefan]

Signed-off-by: Lluís Vilanova <vilanova@ac.upc.edu>
Message-id: 146860251784.30668.17339867835129075077.stgit@fimbulvetr.bsc.es
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

Merge remote-tracking branch 'remotes/stefanha/tags/block-pull-request' into staging

# gpg: Signature made Mon 18 Jul 2016 17:58:27 BST
# gpg:                using RSA key 0x9CA4ABB381AB73C8
# gpg: Good signature from "Stefan Hajnoczi <stefanha@redhat.com>"
# gpg:                 aka "Stefan Hajnoczi <stefanha@gmail.com>"
# Primary key fingerprint: 8695 A8BF D3F9 7CDA AC35  775A 9CA4 ABB3 81AB 73C8

* remotes/stefanha/tags/block-pull-request:
  MAINTAINERS: Add include/block/aio.h to block I/O path section
  virtio-blk: dataplane cleanup
  checkpatch: consider git extended headers valid patches
  aio-posix: remove useless parameter
  linux-aio: prevent submitting more than MAX_EVENTS
  aio_ctx_check: follow CODING_STYLE
  linux-aio: share one LinuxAioState within an AioContext
  spec/parallels: fix a mistake

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

vfio/pci: Hide ARI capability

QEMU supports ARI on downstream ports and assigned devices may support
ARI in their extended capabilities.  The endpoint ARI capability
specifies the next function, such that the OS doesn't need to walk
each possible function, however this next function is relative to the
host, not the guest.  This leads to device discovery issues when we
combine separate functions into virtual multi-function packages in a
guest.  For example, SR-IOV VFs are not enumerated by simply probing
the function address space, therefore the ARI next-function field is
zero.  When we combine multiple VFs together as a multi-function
device in the guest, the guest OS identifies ARI is enabled, relies on
this next-function field, and stops looking for additional function
after the first is found.

Long term we should expose the ARI capability to the guest to enable
configurations with more than 8 functions per slot, but this requires
additional QEMU PCI infrastructure to manage the next-function field
for multiple, otherwise independent devices.  In the short term,
hiding this capability allows equivalent functionality to what we
currently have on non-express chipsets.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Marcel Apfelbaum <marcel@redhat.com>

.travis.yml: Disable IRC build status updates from forks

We want the travis build bot to post notifications on IRC only for the
master qemu repository and not the various forks/branches of
others. Currently there is no direct option to restrict the updates to
one repository. This is being worked upon by the developers and
tracked in https://github.com/travis-ci/travis-ci/issues/1094.

Until such time, we can use the workaround as posted in
ref. https://github.com/facebook/flow/pull/1822.

This basically creates an ecrypted string which decrypts to qemu IRC
channel only on "qemu/qemu" repo and not on the forks. This enables
the build bot to notify the IRC only for the main repo.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>

MAINTAINERS: Add include/block/aio.h to block I/O path section

This file is actually the header for async.c and aio-*.c., so add it to
the same section.

Suggested-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Message-id: 1468826387-10473-1-git-send-email-famz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

virtio-blk: dataplane cleanup

No need duplicate the judgment, there is one in function entry.

Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Max Reitz <mreitz@redhat.com>
Signed-off-by: Cao jin <caoj.fnst@cn.fujitsu.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 1468814749-14510-1-git-send-email-caoj.fnst@cn.fujitsu.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

checkpatch: consider git extended headers valid patches

Renames look like this with git-diff(1) when diff.renames = true is set:

  diff --git a/a b/b
  similarity index 100%
  rename from a
  rename to b

This raises the "Does not appear to be a unified-diff format patch"
error because checkpatch.pl only considers a diff valid if it contains
at least one "@@" hunk.

This patch accepts renames and copies too so that checkpatch.pl exits
successfully when a diff only renames/copies files.  The git diff
extended header format is described on the git-diff(1) man page.

Reported-by: Colin Lord <clord@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 1468576014-28788-1-git-send-email-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

aio-posix: remove useless parameter

Parameter **errp of aio_context_setup() is useless, remove it
and clean up the related code.

Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Fam Zheng <famz@redhat.com>
Cc: Eric Blake <eblake@redhat.com>
Signed-off-by: Cao jin <caoj.fnst@cn.fujitsu.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 1468578524-23433-1-git-send-email-caoj.fnst@cn.fujitsu.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

linux-aio: prevent submitting more than MAX_EVENTS

Invoking io_setup(MAX_EVENTS) we ask kernel to create ring buffer for us
with specified number of events.  But kernel ring buffer allocation logic
is a bit tricky (ring buffer is page size aligned + some percpu allocation
are required) so eventually more than requested events number is allocated.

From a userspace side we have to follow the convention and should not try
to io_submit() more or logic, which consumes completed events, should be
changed accordingly.  The pitfall is in the following sequence:

    MAX_EVENTS = 128
    io_setup(MAX_EVENTS)

    io_submit(MAX_EVENTS)
    io_submit(MAX_EVENTS)

    /* now 256 events are in-flight */

    io_getevents(MAX_EVENTS) = 128

    /* we can handle only 128 events at once, to be sure
     * that nothing is pended the io_getevents(MAX_EVENTS)
     * call must be invoked once more or hang will happen. */

To prevent the hang or reiteration of io_getevents() call this patch
restricts the number of in-flights, which is now limited to MAX_EVENTS.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1468415004-31755-1-git-send-email-roman.penyaev@profitbricks.com
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: qemu-devel@nongnu.org
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

aio_ctx_check: follow CODING_STYLE

replace tab with spaces

Signed-off-by: Cao jin <caoj.fnst@cn.fujitsu.com>
Message-id: 1468501843-14927-1-git-send-email-caoj.fnst@cn.fujitsu.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

linux-aio: share one LinuxAioState within an AioContext

This has better performance because it executes fewer system calls
and does not use a bottom half per disk.

Originally proposed by Ming Lei.

[Changed #include "raw-aio.h" to "block/raw-aio.h" in win32-aio.c to fix
build error as reported by Peter Maydell <peter.maydell@linaro.org>.
--Stefan]

Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 1467650000-51385-1-git-send-email-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
squash! linux-aio: share one LinuxAioState within an AioContext

spec/parallels: fix a mistake

We have only one flag for now - Empty Image flag. The patch fixes unused
bits specification and marks bit 1 as usused.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

Merge remote-tracking branch 'remotes/dgibson/tags/ppc-for-2.7-20160718' into staging

ppc patch queue 2016-07-18

Here's what ought to be the final ppc pull request before the 2.7 hard
freeze.  This set contains a rework of the DBDMA device for Mac
platforms, and some assorted cleanups and bugfixes.

# gpg: Signature made Mon 18 Jul 2016 05:35:27 BST
# gpg:                using RSA key 0x6C38CACA20D9B392
# gpg: Good signature from "David Gibson <david@gibson.dropbear.id.au>"
# gpg:                 aka "David Gibson (Red Hat) <dgibson@redhat.com>"
# gpg:                 aka "David Gibson (ozlabs.org) <dgibson@ozlabs.org>"
# gpg: WARNING: This key is not certified with sufficiently trusted signatures!
# gpg:          It is not certain that the signature belongs to the owner.
# Primary key fingerprint: 75F4 6586 AE61 A66C C44E  87DC 6C38 CACA 20D9 B392

* remotes/dgibson/tags/ppc-for-2.7-20160718:
  ppc: Yet another fix for the huge page support detection mechanism
  target-ppc: fix left shift overflow in hpte_page_shift
  ppc/mmu-hash64: Remove duplicated #include statement
  ppc: abort if compat property contains an unknown value
  spapr: Ensure CPU cores are added contiguously and removed in LIFO order
  vfio/spapr: Remove stale ioctl() call
  ppc: Fix support for odd MSR combinations
  dbdma: reset io->processing flag for unassigned DBDMA channel rw accesses
  dbdma: set FLUSH bit upon reception of flush command for unassigned DBDMA channels
  dbdma: fix load_word/store_word value endianness
  dbdma: fix endian of DBDMA_CMDPTR_LO during branch
  dbdma: add per-channel debugging enabled via DEBUG_DBDMA_CHANMASK
  dbdma: always define DBDMA_DPRINTF and enable debug with DEBUG_DBDMA
  spapr: fix core unplug crash

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

e1000e: fix building without CONFIG_VMXNET3_PCI

e1000e needs net_tx_pkt.o and net_rx_pkt.o too.

Cc: Dmitry Fleytman <dmitry.fleytman@ravellosystems.com>
Cc: Leonid Bloch <leonid.bloch@ravellosystems.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>

MAINTAINERS: release Scott from being a rocker maintainer

As requested by Scott, removing him.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>

tap: fix memory leak on failure to create a multiqueue tap device

Reported by Coverity.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>

net: fix incorrect argument to iov_to_buf

Coverity reports a "suspicious sizeof" which is indeed wrong.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>

net: fix incorrect access to pointer

This is not dereferencing the pointer, and instead checking only
the value of the pointer.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>

e1000e: fix incorrect access to pointer

This is not dereferencing the pointer, and instead checking only
the value of the pointer.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>

ppc: Yet another fix for the huge page support detection mechanism

Commit 86b50f2e1bef ("Disable huge page support if it is not available
for main RAM") already made sure that huge page support is not announced
to the guest if the normal RAM of non-NUMA configurations is not backed
by a huge page filesystem. However, there is one more case that can go
wrong: NUMA is enabled, but the RAM of the NUMA nodes are not configured
with huge page support (and only the memory of a DIMM is configured with
it). When QEMU is started with the following command line for example,
the Linux guest currently crashes because it is trying to use huge pages
on a memory region that does not support huge pages:

qemu-system-ppc64 -enable-kvm ... -m 1G,slots=4,maxmem=32G -object \
   memory-backend-file,policy=default,mem-path=/hugepages,size=1G,id=mem-mem1 \
   -device pc-dimm,id=dimm-mem1,memdev=mem-mem1 -smp 2 \
   -numa node,nodeid=0 -numa node,nodeid=1

To fix this issue, we've got to make sure to disable huge page support,
too, when there is a NUMA node that is not using a memory backend with
huge page support.

Fixes: 86b50f2e1befc33407bdfeb6f45f7b0d2439a740
Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>

target-ppc: fix left shift overflow in hpte_page_shift

ps->pte_enc is a 32-bit value, which is shifted left and then compared
to a 64-bit value. It needs a cast before the shift.

Reported by Coverity.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>

ppc/mmu-hash64: Remove duplicated #include statement

No need to include error-report.h twice here.

Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>

ppc: abort if compat property contains an unknown value

It is not possible to set the compat property to an unknown value with
powerpc_set_compat(). Something must have gone terribly wrong in QEMU,
if we detect an "Internal error" in powerpc_get_compat(). Let's abort then.

This patch also drops the "max_compat ? *max_compat : -1" construct. It is
useless since max_compat is dereferenced a few lines above.

Signed-off-by: Greg Kurz <groug@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>

spapr: Ensure CPU cores are added contiguously and removed in LIFO order

If CPU core addition or removal is allowed in random order leading to
holes in the core id range (and hence in the cpu_index range), migration
can fail as migration with holes in cpu_index range isn't yet handled
correctly.

Prevent this situation by enforcing the addition in contiguous order
and removal in LIFO order so that we never end up with holes in
cpu_index range.

Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>

vfio/spapr: Remove stale ioctl() call

This ioctl() call to VFIO_IOMMU_SPAPR_TCE_REMOVE was left over from an
earlier version of the code and has since been folded into
vfio_spapr_remove_window().

It wasn't caught because although the argument structure has been removed,
the libc function remove() means this didn't trigger a compile failure.
The ioctl() was also almost certain to fail silently and harmlessly with
the bogus argument, so this wasn't caught in testing.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>

ppc: Fix support for odd MSR combinations

MacOS uses an architecturally illegal MSR combination that
seems nonetheless supported by 32-bit processors, which is
to have MSR[PR]=1 and one or more of MSR[DR/IR/EE]=0.

This adds support for it. To work properly we need to also
properly include support for PR=1,{I,D}R=0 to the MMU index
used by the qemu TLB.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Tested-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>

dbdma: reset io->processing flag for unassigned DBDMA channel rw accesses

Otherwise MacOS 9 hangs upon shutdown.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>

dbdma: set FLUSH bit upon reception of flush command for unassigned DBDMA channels

This fixes MacOS 9 whereby it continually flushes and polls the status bits
until they are set to indicate a successful flush.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>

dbdma: fix load_word/store_word value endianness

The values to read/write to/from physical memory are copied directly to the
physical address with no endian swapping required.

Also add some extra information to debugging output while we are here.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>

dbdma: fix endian of DBDMA_CMDPTR_LO during branch

The current DBDMA command is stored in little-endian format, so make sure
we convert it to match our CPU when updating the DBDMA_CMDPTR_LO register.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>

dbdma: add per-channel debugging enabled via DEBUG_DBDMA_CHANMASK

By default large amounts of DBDMA debugging are produced when often it is just
1 or 2 channels that are of interest. Introduce DEBUG_DBDMA_CHANMASK to allow
the developer to select the channels of interest at compile time, and then
further add the extra channel information to each debug statement where
possible.

Also clearly mark the start/end of DBDMA_run_bh to allow tracking the bottom
half execution.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>

dbdma: always define DBDMA_DPRINTF and enable debug with DEBUG_DBDMA

Enabling DBDMA_DPRINTF unconditionally ensures that any errors in debug
statements are picked up immediately.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>

spapr: fix core unplug crash

If the host has 8 threads/core and the guest is started with:

-smp cores=1,threads=4,maxcpus=12

It is possible to crash QEMU by doing:

(qemu) device_add host-spapr-cpu-core,core-id=16,id=foo
(qemu) device_del foo
Segmentation fault

This happens because spapr_core_unplug() assumes cpu_dt_id == core_id.
As long as cpu_dt_id is derived from the non-table cpu_index, this is
only true when you plug cores with contiguous ids.

It is safer to be consistent: the DR connector was created with an
index that is immediately written to cc->core_id, and spapr_core_plug()
also relies on cc->core_id.

Let's use it also in spapr_core_unplug().

Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>

cpu-exec: Move down some declarations in cpu_exec()

This will fix a compiler warning with -Wclobbered:

http://lists.nongnu.org/archive/html/qemu-devel/2016-07/msg03347.html

Reported-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org>
Message-Id: <20160715193123.28113-1-sergey.fedorov@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

exec: avoid realloc in phys_map_node_reserve

this is the first step in reducing the brk heap fragmentation
created by the map->nodes memory allocation. Since the introduction
of RCU the freeing of the PhysPageMaps is delayed so that sometimes
several hundred are allocated at the same time.

Even worse the memory for map->nodes is allocated and shortly
afterwards reallocated. Since the number of nodes it grows
to in the end is the same for all PhysPageMaps remember this value
and at least avoid the reallocation.

The large number of simultaneous allocations (about 450 x 70kB in
my configuration) has to be addressed later.

Signed-off-by: Peter Lieven <pl@kamp.de>
Message-Id: <1468577030-21097-1-git-send-email-pl@kamp.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

checkpatch: consider git extended headers valid patches

Renames look like this with git-diff(1) when diff.renames = true is set:

  diff --git a/a b/b
  similarity index 100%
  rename from a
  rename to b

This raises the "Does not appear to be a unified-diff format patch"
error because checkpatch.pl only considers a diff valid if it contains
at least one "@@" hunk.

This patch accepts renames and copies too so that checkpatch.pl exits
successfully when a diff only renames/copies files.  The git diff
extended header format is described on the git-diff(1) man page.

Reported-by: Colin Lord <clord@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <1468576014-28788-1-git-send-email-stefanha@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

megasas: remove useless check for cmd->frame

megasas_enqueue_frame always returns with non-NULL cmd->frame.
Remove the "else" part as it is dead code.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

compiler: never omit assertions if using a static analysis tool

Assertions help both Coverity and the clang static analyzer avoid
false positives, but on the other hand both are confused when
the condition is compiled as (void)(x != FOO). Always expand
assertion macros when using Coverity or clang, through a new
QEMU_STATIC_ANALYSIS preprocessor symbol.

This fixes a couple false positives in TCG.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

hw/i386: add device tree support

With "-dtb" on command-line:
- append the device tree blob to the kernel image;
- pass the blob's pointer to the kernel through setup_data, as
requested by upstream kernel commit da6b737b9ab7 ("x86: Add
device tree support").

The device tree blob is passed as-is to the guest; none of its
fields is modified nor updated. This is not an issue; the kernel
commit above uses the device tree only as an extension to the
traditional kernel configuration.

To: "Michael S. Tsirkin" <mst@redhat.com>
To: Paolo Bonzini <pbonzini@redhat.com>
To: Richard Henderson <rth@twiddle.net>
To: Eduardo Habkost <ehabkost@redhat.com>
Cc: qemu-devel@nongnu.org
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Antonio Borneo <borneo.antonio@gmail.com>
Message-Id: <1459973054-2777-1-git-send-email-borneo.antonio@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Changed malloc to g_malloc, free to g_free in bsd-user/qemu.h

Signed-off-by: Md Haris Iqbal <haris.phnx@gmail.com>
Message-Id: <1459861743-4514-1-git-send-email-haris.phnx@gmail.com>
Reviewed-by: Sean Bruno <sbruno@freebsd.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

use g_path_get_dirname instead of dirname

Use g_path_get_basename to get the directory components of
a file name, and free its return when no longer needed.

Signed-off-by: Wei Jiangang <weijg.fnst@cn.fujitsu.com>
Message-Id: <1459997185-15669-3-git-send-email-weijg.fnst@cn.fujitsu.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Merge remote-tracking branch 'remotes/mcayland/tags/qemu-openbios-signed' into staging

Update OpenBIOS images

# gpg: Signature made Fri 15 Jul 2016 15:22:36 BST
# gpg:                using RSA key 0x5BC2C56FAE0F321F
# gpg: Good signature from "Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>"
# Primary key fingerprint: CC62 1AB9 8E82 200D 915C  C9C4 5BC2 C56F AE0F 321F

* remotes/mcayland/tags/qemu-openbios-signed:
  Update OpenBIOS images to b747b6a built from submodule.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

Update OpenBIOS images to b747b6a built from submodule.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>

vnc-tight: fix regression with libxenstore

commit 095497ff added thread local storage for the color counting
palette. Unfortunately, a VncPalette is about 7kB on a x86_64 system.
This memory is reserved from the stack of every thread and it
exhausted the stack space of a libxenstore thread.

Fix this by allocating memory only for the VNC encoding thread.

Fixes: 095497ffc66b7f031ff2a17f1e50f5cb105ce588
Reported-by: Juergen Gross <jgross@suse.com>
Tested-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Peter Lieven <pl@kamp.de>
Message-id: 1468575911-20656-1-git-send-email-pl@kamp.de
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>