git.proxmox.com Git - mirror

Uninitialized variable spa_autoreplace used

Caught by ztest and valgrind.

Signed-off-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2259

Use ddi_time_after and friends to compare time

Also, make sure we use clock_t for ddi_get_lbolt to prevent type conversion
from screwing things.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2142

Make zimport.sh bash dependency explicit

Unfortunately, the zimport.sh test script really does depend on
bash. Moving to /bin/sh should be possible once the shared
infrastructure scripts it depends on is made portable.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Linux 3.14 compat: rq_for_each_segment in dmu_req_copy

rq_for_each_segment changed from taking bio_vec * to taking bio_vec.
We provide rq_for_each_segment4 which takes both.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2124

Revert "Fix zvol+btrfs hang"

After the dmu_req_copy change, bi_io_vecs are not touched, so this is no
longer needed.

This reverts commit e26ade5101ba1d8e8350ff1270bfca4258e1ffe3.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2124

Refactor dmu_req_copy for immutable biovec changes

Originally, dmu_req_copy modifies bv_len and bv_offset in bio_vec so that it
can continue in subsequent passes. However, after the immutable biovec changes
in Linux 3.14, this is not allowed. So instead, we just tell dmu_req_copy how
many bytes are already copied and it will skip to the right spot accordingly.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2124

Linux 3.14 compat: Immutable biovec changes in vdev_disk.c

bi_sector, bi_size and bi_idx are moved from bio to bio->bi_iter.
This patch creates BIO_BI_*(bio) macros to hide the differences.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2124

Linux 3.14 compat: posix_acl_{create,chmod}

posix_acl_{create,chmod} is changed to __posix_acl_{create_chmod}

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2124

Improve zfs.sh error messages

Ensure an error message is logged when the 'zfs.sh' script fails
to either load a module or if udev fails to create the /dev/zfs
device. Error messages for missing KERNEL_MODULES are suppressed
because that functionality may just be built-in to the kernel.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Replace zed_file_create_dirs() with mkdirp()

When processing directory components starting from the root dir,
zed_file_create_dirs() contained a bug in checking the return value of
mkdir().  A typo was made, and the test for (mkdir_errno != EEXIST) was
erroneously written as (mkdir_errno == EEXIST).  If some of the leading
directory components already existed, this bug would cause the routine
to exit before creating the remaining directory components.

Instead of fixing the above mkdir_errno test, this commit replaces
zed_file_create_dirs() with mkdirp().  This cleanup was already
planned, and zed_file_create_dirs() only existed because I didn't
realize mkdirp() was already in tree at the time.

Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2248

Set errno for mkdirp() called with NULL path ptr

If mkdirp() is called with a NULL ptr for the path arg, it will return
-1 with errno unchanged.  This is unexpected since on error it should
return -1 and set errno to one of the error values listed for mkdir(2).

This commit sets errno = ENOENT for this NULL ptr case.  This is in
accordance with the errors specified by mkdir(2):

  ENOENT
  A component of the path prefix does not exist or is a null pathname.

Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2248

Dynamically create loop devices

Several of the in-tree regression tests depend on the availability
of loop devices.  If for some reason no loop devices are available
the tests will fail.

Normally this isn't an issue because most Linux distributions create
8 loop devices by default.  This is enough for our purposes.  However,
recent Fedora releases have only been creating a single loop device
and this leads to failures.  Alternately, if something else of the
system is using the loop devices we may see failures.

The fix for this is to update the support scripts to dynamically
create loop devices as needed.  The scripts need only create a node
under /dev/ and the loop driver with create the minor.  This behavior
has been supported by the loop driver for ages.

Additionally this patch updates cleanup_loop_devices() to cleanup
loop devices which have already had their file store deleted.  This
helps prevent stale loop devices from accumulating on the system due
to test failures.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Closes #2249

Improve partition detection on lesser used devices

The format strings in efi_get_info() are intended to extract both the
main device and partition number. However, this is only done correctly
for hd, sd and vd devices. The format strings for ram, dm-, md and loop
devices misparse the input. This causes the partition device to be
incorrectly labelled as the main device with the partition being
labelled 0.

Reported-by: ilovezfs <ilovezfs@icloud.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2175

Allow specifying '-o <opts>' in defaults/init script.

Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2103

Support using overlay mounts in defaults/init script.

Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2103

Fix for re-reading /etc/mtab.

This is a continuation of fb5c53ea65b75c67c23f90ebbbb1134a5bb6c140:

    When /etc/mtab is updated on Linux it's done atomically with
    rename(2).  A new mtab is written, the existing mtab is unlinked,
    and the new mtab is renamed to /etc/mtab.  This means that we
    must close the old file and open the new file to get the updated
    contents.  Using rewind(3) will just move the file pointer back
    to the start of the file, freopen(3) will close and open the file.

In this commit, a few more rewind(3) calls were replaced with freopen(3)
to allow updated mtab entries to be picked up immediately.

Signed-off-by: John M. Layman <jml@frijid.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2215
Issue #1611

Fix locking order in zfs_zget()

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Fix deadlock in zfs_zget()

zfsonlinux/zfs#180 occurred because of a race between inode eviction and
zfs_zget(). zfsonlinux/zfs@36df284 tried to address it by making a call
to the VFS to learn whether an inode is being evicted.  If it was being
evicted the operation was retried after dropping and reacquiring the
relevant resources.  Unfortunately, this introduced another deadlock.

  INFO: task kworker/u24:6:891 blocked for more than 120 seconds.
        Tainted: P           O 3.13.6 #1
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  kworker/u24:6   D ffff88107fcd2e80     0   891      2 0x00000000
  Workqueue: writeback bdi_writeback_workfn (flush-zfs-5)
   ffff8810370ff950 0000000000000002 ffff88103853d940 0000000000012e80
   ffff8810370fffd8 0000000000012e80 ffff88103853d940 ffff880f5c8be098
   ffff88107ffb6950 ffff8810370ff980 ffff88103a9a5b78 0000000000000000
  Call Trace:
   [<ffffffff813dd1d4>] schedule+0x24/0x70
   [<ffffffff8115fc09>] __wait_on_freeing_inode+0x99/0xc0
   [<ffffffff8115fdd8>] find_inode_fast+0x78/0xb0
   [<ffffffff811608c5>] ilookup+0x65/0xd0
   [<ffffffffa035c5ab>] zfs_zget+0xdb/0x260 [zfs]
   [<ffffffffa03589d6>] zfs_get_data+0x46/0x340 [zfs]
   [<ffffffffa035fee1>] zil_add_block+0xa31/0xc00 [zfs]
   [<ffffffffa0360642>] zil_commit+0x12/0x20 [zfs]
   [<ffffffffa036a6e4>] zpl_putpage+0x174/0x840 [zfs]
   [<ffffffff811071ec>] do_writepages+0x1c/0x40
   [<ffffffff8116df2b>] __writeback_single_inode+0x3b/0x2b0
   [<ffffffff8116ecf7>] writeback_sb_inodes+0x247/0x420
   [<ffffffff8116f5f3>] wb_writeback+0xe3/0x320
   [<ffffffff81170b8e>] bdi_writeback_workfn+0xfe/0x490
   [<ffffffff8106072c>] process_one_work+0x16c/0x490
   [<ffffffff810613f3>] worker_thread+0x113/0x390
   [<ffffffff81066edf>] kthread+0xdf/0x100

This patch implements the original fix in a slightly different manner in
order to avoid both deadlocks.  Instead of relying on a call to ilookup()
which can block in __wait_on_freeing_inode() the return value from igrab()
is used.  This gives us the information that ilookup() provided without
the risk of a deadlock.

Alternately, this race could be closed by registering an sops->drop_inode()
callback.  The callback would need to detect the active SA hold thereby
informing the VFS that this inode should not be evicted.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #180

Revert "Fixed a use-after-free bug in zfs_zget()."

This reverts commit 36df284366caa77cb40083d2e6bcce02274e2f05.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Merge branch 'zed-initial'

zed monitors ZFS events. When a zevent is posted, zed will run any
scripts that have been enabled for the corresponding zevent class.
Multiple scripts may be invoked for a given zevent. The zevent nvpairs
are passed to the scripts as environment variables. Refer to the zed(8)
manpage for details.

Events are processed synchronously by the single thread, and there is
no maximum timeout for script execution. Consequently, a misbehaving
script can delay (or forever block) the processing of subsequent
zevents. Plans are to address this in future commits.

An EID (Event IDentifier) has been added to each event to uniquely
identify it throughout the lifetime of the loaded ZFS kernel module;
it is a monotonically increasing integer that resets to 1 each time
the module is loaded.

Initial scripts have been developed to log zevents to syslog,
automatically rebuild to a hot spare device, and send email in
response to checksum / data / io / resilver.finish / scrub.finish
zevents. To enable email notifications, configure ZED_EMAIL in zed.rc
(which is serving as a config file of sorts until a proper
configuration file is implemented). To enable hot sparing, uncomment
ZED_SPARE_ON_IO_ERRORS and ZED_SPARE_ON_CHECKSUM_ERRORS in zed.rc;
note that the autoexpand property is not yet supported.

zed is a work-in-progress.

Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2

Replace check for _POSIX_MEMLOCK w/ HAVE_MLOCKALL

zed supports a '-M' cmdline opt to lock all pages in memory via
mlockall().  The _POSIX_MEMLOCK define is checked to determine whether
this function is supported.  The current test assumes mlockall()
is supported if _POSIX_MEMLOCK is non-zero.  However, this test is
insufficient according to mlock(2) and sysconf(3).  If _POSIX_MEMLOCK
is -1, mlockall() is not supported; but if _POSIX_MEMLOCK is 0,
availability must be checked at runtime.

This commit adds an autoconf check for mlockall() to user.m4.  The zed
code block for mlockall() is now guarded with a test for HAVE_MLOCKALL.
If defined, mlockall() will be called and its runtime availability
checked via its return value.

Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2

Add automatic hot spare functionality

When a vdev starts getting I/O or checksum errors it is now
possible to automatically rebuild to a hot spare device.

To cleanly support this functionality in a shell script some
additional information was added to all zevent ereports which
include a vdev.  This covers both io and checksum zevents but
may be used but other scripts.

In the Illumos FMA solution the same information is required
but it is retrieved through the libzfs library interface.
Specifically the following members were added:

  vdev_spare_paths  - List of vdev paths for all hot spares.
  vdev_spare_guids  - List of vdev guids for all hot spares.
  vdev_read_errors  - Read errors for the problematic vdev
  vdev_write_errors - Write errors for the problematic vdev
  vdev_cksum_errors - Checksum errors for the problematic vdev.

By default the required hot spare scripts are installed but this
functionality is disabled.  To enable hot sparing uncomment the
ZED_SPARE_ON_IO_ERRORS and ZED_SPARE_ON_CHECKSUM_ERRORS in the
/etc/zfs/zed.d/zed.rc configuration file.

These scripts do no add support for the autoexpand property. At
a minimum this requires adding a new udev rule to detect when
a new device is added to the system.  It also requires that the
autoexpand policy be ported from Illumos, see:

  https://github.com/illumos/illumos-gate/blob/master/usr/src/cmd/syseventd/modules/zfs_mod/zfs_mod.c

Support for detecting the correct name of a vdev when it's not
a whole disk was added by Turbo Fredriksson.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Issue #2

Add missing DATA_TYPE_STRING_ARRAY output

This functionality has always been missing. But until now there
were no zevents which included an array of strings so it wasn't
missed. However, that's now changed so to ensure this information
is output correctly by 'zpool events -v' the DATA_TYPE_STRING_ARRAY
has been implemented.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Issue #2

Make command line guid parsing more tolerant

Several of the zfs utilities allow you to pass a vdev's guid rather
than the device name.  However, the utilities are not consistent in
how they parse that guid.  For example, 'zinject' expects the guid
to be passed as a hex value while 'zpool replace' wants it as a
decimal.  The user is forced to just know what format to use.

This patch improve things by making the parsing more tolerant.
When strtol(3) is called using 0 for the base, rather than say
10 or 16, it will then accept hex, decimal, or octal input based
on the prefix.  From the man page.

    If base is zero or 16, the string may then include a "0x"
    prefix, and  the number  will  be read in base 16; otherwise,
    a zero base is taken as 10 (decimal) unless the next character
    is '0', in which case it  is  taken as 8 (octal).

NOTE: There may be additional conversions not caught be this patch.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Issue #2

Add systemd unit file for zed

This commit adds a systemd unit file for zed.service and integrates
it into the zfs.target from commit 881f45c.

Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2108
Issue #2

Initial implementation of zed (ZFS Event Daemon)

zed monitors ZFS events.  When a zevent is posted, zed will run any
scripts that have been enabled for the corresponding zevent class.
Multiple scripts may be invoked for a given zevent.  The zevent
nvpairs are passed to the scripts as environment variables.

Events are processed synchronously by the single thread, and there is
no maximum timeout for script execution.  Consequently, a misbehaving
script can delay (or forever block) the processing of subsequent
zevents.  Plans are to address this in future commits.

Initial scripts have been developed to log events to syslog
and send email in response to checksum/data/io errors and
resilver.finish/scrub.finish events.  By default, email will only
be sent if the ZED_EMAIL variable is configured in zed.rc (which is
serving as a config file of sorts until a proper configuration file
is implemented).

Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2

Replace zpool_events_next() "block" parm w/ "flags"

zpool_events_next() can be called in blocking mode by specifying a
non-zero value for the "block" parameter.  However, the design of
the ZFS Event Daemon (zed) requires additional functionality from
zpool_events_next().  Instead of adding additional arguments to the
function, it makes more sense to use flags that can be bitwise-or'd
together.

This commit replaces the zpool_events_next() int "block" parameter with
an unsigned bitwise "flags" parameter.  It also defines ZEVENT_NONE
to specify the default behavior.  Since non-blocking mode can be
specified with the existing ZEVENT_NONBLOCK flag, the default behavior
becomes blocking mode.  This, in effect, inverts the previous use
of the "block" parameter.  Existing callers of zpool_events_next()
have been modified to check for the ZEVENT_NONBLOCK flag.

Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2

Add defs for makefile installation dir vars

Add macro definitions to AM_CPPFLAGS to propagate makefile installation
directory variables for libexecdir, runstatedir, sbindir, and
sysconfdir.

https://www.gnu.org/software/autoconf/manual/autoconf-2.69/html_node/Installation-Directory-Variables.html

  A corollary is that you should not use these variables except
  in makefiles. For instance, instead of trying to evaluate
  datadir in configure and hard-coding it in makefiles using e.g.,
  'AC_DEFINE_UNQUOTED([DATADIR], ["$datadir"], [Data directory.])',
  you should add -DDATADIR='$(datadir)' to your makefile's definition
  of CPPFLAGS (AM_CPPFLAGS if you are also using Automake).

The runstatedir directory is for "installing data files which the
programs modify while they run, that pertain to one specific machine,
and which need not persist longer than the execution of the program".

https://www.gnu.org/prep/standards/html_node/Directory-Variables.html

It will be defined by autoconf 2.70 or later, and default to
"$(localstatedir)/run".

http://git.savannah.gnu.org/gitweb/?p=autoconf.git;a=commit;h=a197431414088a417b407b9b20583b2e8f7363bd

Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2

Clarify zpool_events_next() comment

Due to the very poorly chosen argument name 'cleanup_fd' it was
completely unclear that this file descriptor is used to track the
current cursor location.  When the file descriptor is created by
opening ZFS_DEV a private cursor is created in the kernel for the
returned file descriptor.  Subsequent calls to zpool_events_next()
and zpool_events_seek() then require the file descriptor as an
argument to reposition the cursor.  When the file descriptor is
closed the kernel state tracking the cursor is destroyed.

This patch contains no functional change, it just changes a
few variable names and clarifies the documentation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Issue #2

Add zpool_events_seek() functionality

The ZFS_IOC_EVENTS_SEEK ioctl was added to allow user space callers
to seek around the zevent file descriptor by EID. When a specific
EID is passed and it exists the cursor will be positioned there.
If the EID is no longer cached by the kernel ENOENT is returned.
The caller may also pass ZEVENT_SEEK_START or ZEVENT_SEEK_END to seek
to those respective locations.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Issue #2

Add a unique "eid" value to all zevents

Tagging each zevent with a unique monotonically increasing EID
(Event IDentifier) provides the required infrastructure for a user
space daemon to reliably process zevents. By writing the EID to
persistent storage the daemon can safely resume where it left off
in the event stream when it's restarted.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Issue #2

Remount datasets for "zfs inherit".

Changing properties with "zfs inherit" should cause the datasets
to be remounted. This ensures that the modified property values
will be propagated in to the filesystem namespace where they can
be enforced. This change is modeled after an identical fix made
to zfs_prop_set().

Signed-off-by: Gunnar Beutner <gunnar@beutner.name>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2201

Linux 3.13 compat: Handle __must_check bdi_setup_and_register

torvalds/linux@8077c0d983ab276ec5f2700df56a64d671781905 added a
__must_check to the bdi_setup_and_register(), which caused our autotools
check to break. zfsonlinux/zfs@729210564a5325e190fc4fba22bf17bacf957ace
was intended to correct that, but it depended on -Wno-unused-result,
which is unrecognized in older GCC versions. That commit has been
reverted in favor of a solution that does not require
-Wno-unused-result.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2102
Closes #2135

Revert "Properly ignore bdi_setup_and_register return value"

Older GCC versions do not obey -Wno-unused-result. This reverts commit
729210564a5325e190fc4fba22bf17bacf957ace in favor of a solution that
does not require -Wno-unused-result.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1906

Illumos #4089 NULL pointer dereference in arc_read()

4089 NULL pointer dereference in arc_read()

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
https://www.illumos.org/issues/4089
illumos/illumos-gate@57815f6b95a743697e148327725b7f568e75e6ea

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2171
Issue #2165
Closes #2198

Illumos #3120 zinject hangs in zfsdev_ioctl() due to uninitialized zc

3120 zinject hangs in zfsdev_ioctl() due to uninitialized zc

Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
https://www.illumos.org/issues/3120
illumos/illumos-gate@f4c46b1eda9212fd32ba197043d52239ef5c0a7f

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2152

config: compile test rather than run test

When testing compiler flags, we only need to do compile test. Otherwise,
configure will fail with "configure: error: cannot run test program while
cross compiling" when cross compiling.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2191

Implement -t option to zpool import for temporary pool names

Originally, users had to handle spa namespace collisions by either
exporting the already imported pool or by specifying a new name for the
pool with a conflicting name. In the case of root pools from virtual
guests, neither approach to collision resolution is reasonable. This is
addressed by extending the new name syntax with a -t option to specify
that the new name is temporary. When specified, this sets an internal
flag that is passed into the kernel to tell it that all label updates
should refer to the name used in the original label. Consequently, the
original pool name will be retained on export.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2189

Assert alignment in umem_alloc_aligned

Valgrind suggests that the address we are returning is not properly
aligned, so lets add an assertion.

==87740==  Address 0x1012a22a is 554 bytes inside a block of size 4,096
alloc'd
==87740==    at 0x4C2BBA0: memalign (in
/usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==87740==    by 0x4C2BCC7: posix_memalign (in
/usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==87740==    by 0x52FA845: zio_buf_alloc (umem.h:101)
==87740==    by 0x52F6226: zil_alloc_lwb (zil.c:463)
==87740==    by 0x52F8559: zil_commit (zil.c:566)
==87740==    by 0x40611D: ztest_freeze (ztest.c:5909)
==87740==    by 0x4066A7: ztest_init (ztest.c:6048)
==87740==    by 0x407AF4: main (ztest.c:6226)

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2174

sighandler() should take 2 arguments

Stopping arcstat.py with ^C always ends up with error:
TypeError: sighandler() takes no arguments (2 given)

Since no special signal handling was done in sighandler(),
it's simpler to just set SIGINT handler to SIG_DFL, which
terminates the script.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Isaac Huang <he.huang@intel.com>
Closes #2179

Revert "sighandler() should take 2 arguments"

This reverts commit 0bb89b6c594259829556f6dea5a89e722f214fd3 in
favor of a cleaner implementation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2182

Fix regression introduced in port of Illumos #3744

Remove the redundant call to zfs_unmount_snap() which was being
done after char array was freed,

This fixes an upstream regression that was introduced in commit
zfsonlinux/zfs@d09f25dc66774959499a89bf3680d09c6e541ce8, which
ported the Illumos 3744 changes.

Signed-off-by: Andrey Vesnovaty <andrey.vesnovaty@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #2156

Remove unused option -r from 'zpool import'

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2178

Switch ztest mmap(2) ASSERTs to VERIFYs

This is just a small bit of cleanup to ensure ztest fails early
on systems where mmap(2) is not functioning. For the automated
testing which is the primary consumer of ztest there is no
functional change because debugging is always enabled.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2177

Free props in ztest_init()

Valgrind complained about this and it's absolutely right. The
props nvlist was not being freed in ztest_init.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #2174

sighandler() should take 2 arguments

Stopping arcstat.py with ^C always ends up with error:

TypeError: sighandler() takes no arguments (2 given)

This patch corrects the error by updating the signal handler
to take the correct number of arguments.

Signed-off-by: Isaac Huang <he.huang@intel.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2182

Make arcstat.py default to one header per screen

Today arcstat.py prints one header every hdr_intr (20 by default)
lines. It would be more consistent with out utilities like vmstat
if hdr_intr defaulted to terminal window size, i.e. one header
per screenful of outputs.

Signed-off-by: Isaac Huang <he.huang@intel.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2183

Fix NAME section of manpages zhack and fsck.zfs.

In Debian GNU/Linux a program called 'linitian' is used to make sure
that packages conforms to the Debian GNU/Linux packaging guide lines.
This fixes the problem reported as:

W: zfsutils: manpage-has-bad-whatis-entry usr/share/man/man1/zhack.1.gz
W: zfsutils: manpage-has-bad-whatis-entry usr/share/man/man8/fsck.zfs.8.gz

Not something that ZoL needs to addhere to, but every other man page
have their NAME section in a special way - why not these two as well?

Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2161

Illumos #4088 use after free in arc_release()

4088 use after free in arc_release()

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
https://www.illumos.org/issues/4088
illumos/illumos-gate@ccc22e130479b5bd7c0002267fee1e0602d3f772

From the illumos issue:

A race-induced use after free occurs in arc_release() where the
ARC header is used outside the critical section protected by the
hash_lock.

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #2162

Use KM_PUSHPAGE in spa_add() for spa_label_features.

The spa_label_features nvlist is used in the sync context during pool
version upgrade.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2168

Export symbols dsl_sync_task{_nowait}

These are needed by consumers (i.e. Lustre) who wish to perform a
callback in the syncing context. Both a blocking and non-blocking
version are available to callers.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Improve reporting of tx assignment wait times

Some callers of dmu_tx_assign() use the TXG_NOWAIT flag and call
dmu_tx_wait() themselves before retrying if the assignment fails.
The wait times for such callers are not accounted for in the
dmu_tx_assign kstat histogram, because the histogram only records
time spent in dmu_tx_assign().  This change moves the histogram
update to dmu_tx_wait() to properly account for all time spent there.

One downside of this approach is that it is possible to call
dmu_tx_wait() multiple times before successfully assigning a
transaction, in which case the cumulative wait time would not be
recorded.  However, this case should not often arise in practice,
because most callers currently use one of these forms:

  dmu_tx_assign(tx, TXG_WAIT);
  dmu_tx_assign(tx, waited ?  TXG_WAITED : TXG_NOWAIT);

The first form should make just one call to dmu_tx_delay() inside of
dmu_tx_assign(). The second form retries with TXG_WAITED if the first
assignment fails and incurs a delay, in which case no further waiting
is performed.  Therefore transaction delays normally occur in one
call to dmu_tx_wait() so the histogram should be fairly accurate.

Another possible downside of this approach is that the histogram will
no longer record overhead outside of dmu_tx_wait() such as in
dmu_tx_try_assign(). While I'm not aware of any reason for concern on
this point, it is conceivable that lock contention, long list
traversal, etc. could cause assignment delays that would not be
reflected in the histogram.  Therefore the histogram should strictly
be used for visibility in to the normal delay mechanisms and not as a
profiling tool for code performance.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1915

replace nreserved with ndirty in txgs kstat

The nreserved column in the txgs kstat file always contains 0
following the write throttle restructuring of commit
e8b96c6007bf97cdf34869c1ffbd0ce753873a3d.

Prior to that commit, the nreserved column showed the number of bytes
temporarily reserved in the pool by a transaction group at sync time.
The new write throttle did away with temporary reservations and uses
the amount of dirty data instead.  To approximate the old output of
the txgs kstat, the number of dirty bytes per-txg was passed in as
the nreserved value to spa_txg_history_set_io().  This approach did
not work as intended, because the per-txg dirty value is decremented
as data is written out to disk, so it is zero by the time we call
spa_txg_history_set_io().  To fix this, save the number of dirty
bytes before calling spa_sync(), and pass this value in to
spa_txg_history_set_io().

Also, since the name "nreserved" is now a misnomer, the column
heading is now labeled "ndirty".

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1696

dmu_tx kstat cleanup

A few counters in the dmu_tx kstats are obsolete or no longer
bumped properly.

- The sync task restructuring commit
  13fe019870c8779bf2f5b3ff731b512cf89133ef removed the code
  that bumpted dmu_tx_quota. The counter is now bumped in two
  cases, instead of just the one case as before (after the result
  of dsl_dataset_check_quota call). The second case is where
  we check the requested reservation against the actual pool size,
  as this is an implicit quota of sorts.

- The write throttle restructuring commit
  e8b96c6007bf97cdf34869c1ffbd0ce753873a3d makes dmu_tx_how and
  dmu_tx_inflight obsolete, so they are removed.

Signed-off-by: Kohsuke Kawaguchi <kk@kohsuke.org>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1914

Invalidate Linux buffer cache on vdevs upon each flush

Userland tools such as blkid, grub2-probe and zdb will go through the
buffer cache. However, ZFS uses on submit_bio() to bypass the buffer
cache when performing IO operations on vdevs for efficiency purposes.
This permits the on-disk state and buffer cache to fall out of
synchronization. That causes seemingly random failures when tools
reading stale metadata from the buffer cache try to access references to
data that is no longer there.

A particularly bad failure this causes involves grub2-probe, which is
used by grub2-mkconfig. Ordinarily, a rootfs might be called
rpool/ROOT/gentoo. However, when a failure occurs in grub2-probe,
grub2-mkconfig will generate a configuration file containing
/ROOT/gentoo, which omits the pool name and causes a boot failure.

This is avoidable by calling invalidate_bdev() on each flush, which is a
simple way to ensure that all non-dirty pages are wiped. Since userland
tools rarely access vdevs directly, this should be a fancy noop >99.999%
of the time and have little impact on IO. We could have tried a finer
grained approach for the rare instances in which the vdevs are accessed
frequently by userland. However, that would require consideration of
corner cases and it is not worth the effort.

Memory-wise, it would have been better to use a Linux kernel API hook to
disable the buffer cache on such devices, but it provides us no way of
doing that, so we opt for this approach instead. We should revisit that
idea in the future when higher priority issues have been tackled.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2150

Inform OpenRC that ZFS uses mtab

p_l in #zfsonlinux reported that he had issues mounting filesystems that
were resolved by adding rc_need="mtab" to /etc/init.d/zfs. Closer
inspection revealed that we do have a race, but it is not clear how this
race caused mounting to fail. What is clear is that this race should be
fixed, so lets add the proper `use mtab` line to handle it.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2148

Illumos #4574 get_clones_stat does not call zap_count in non-debug kernel

4574 get_clones_stat does not call zap_count in non-debug kernel

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Marcel Telka <marcel@telka.sk>
Approved by: Gordon Ross <gwr@nexenta.com>

References:
https://www.illumos.org/issues/4574
illumos/illumos-gate@03d1795fa6f720eafbee821ad37f4343c391cfe4

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2147

Fix zap_lookup() in feature_is_supported().

The length (number of integers) argument passed to zap_lookup was wrong;
likely as a result of performing stack-reduction on the function.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2141

Include l2asize in arcstat

For consistency with upstream pull in the l2asize update after
reworking it from Perl to Python.

References:
https://github.com/mharsch/arcstat/pull/11
https://github.com/mharsch/arcstat/pull/12

Signed-off-by: cburroughs <chris.burroughs@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2122

Remove recursion from dsl_dir_willuse_space()

Remove recursion from dsl_dir_willuse_space() to reduce stack usage.
Issues with stack overflow were observed in zfs recv of zvols,
likelihood of an overflow is proportional to the depth of the dataset
as dsl_dir_willuse_space() recurses to parent datasets.

Signed-off-by: Andrew Barnes <barnes333@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2069

Merge branch 'arc-changes'

This stack of patches has been empirically shown to drastically improve
the hit rate of the ARC for certain workloads. As a result, fewer reads
to disk are required, which is generally a good thing and can
drastically improve performance if the workload is disk limited.

For the impatient, I'll summarize the results of the tests performed:

    * Test 1 - Creating many empty directories. This test saw 99.9%
               fewer reads and 12.8% more inodes created when running
               *with* these changes.

    * Test 2 - Creating many empty files. This test saw 4% fewer reads
               and 0% more inodes created when running *with* these
               changes.

    * Test 3 - Creating many 4 KiB files. This test saw 96.7% fewer
               reads and 4.9% more inodes created when running *with*
               these changes.

    * Test 4 - Creating many 4096 KiB files. This test saw 99.4% fewer
               reads and 0% more inodes created (but took 6.9% fewer
               seconds to complete) when running *with* these changes.

    * Test 5 - Rsync'ing a dataset with many empty directories. This
               test saw 36.2% fewer reads and 66.2% more inodes created
               when running *with* these changes.

    * Test 6 - Rsync'ing a dataset with many empty files. This test saw
               30.9% fewer reads and 0% more inodes created (but took
               24.3% fewer seconds to complete) when running *with*
               these changes.

    * Test 7 - Rsync'ing a dataset with many 4 KiB files. This test saw
               30.8% fewer reads and 173.3% more inodes created when
               running *with* these changes.

For the patient, the following consists of more a more detailed
description of the tests performed and the results gathered.

All the tests were run using identical machines, each with a pool
consisting of 5 mirror pairs with 2TB 7200 RPM disks. Each test was run
twice, once *without* this set of patches and again *with* this set of
patches to highlight the performance changes introduced. The first four
workloads tested were:

    ** NOTE: None of these tests were run to completion. They ran for a
             set amount of time and then were terminated or hit ENOSPC.

    1. Creating many empty directories:

       * fdtree -d 10 -l 8 -s 0 -f 0 -C
         -> 111,111,111 Directories
         ->           0 Files
         ->           0 KiB File Data

    2. Creating many empty files:

       * fdtree -d 10 -l 5 -s 0 -f 10000 -C
         ->       111,111 Directories
         -> 1,111,110,000 Files
         ->             0 KiB File Data

    3. Creating many 4 KiB files:

       * fdtree -d 10 -l 5 -s 1 -f 10000 -C
         ->       111,111 Directories
         -> 1,111,110,000 Files
         -> 4,444,440,000 KiB File Data

    4. Creating many 4096 KiB files:

       * fdtree -d 10 -l 5 -s 1024 -f 10000 -C
         ->           111,111 Directories
         ->     1,111,110,000 Files
         -> 4,551,106,560,000 KiB File Data

Results for these first four tests are below:

                  | Time (s) |   inodes |  reads |    writes |
                --+----------+----------+--------+-----------+
    Test 1 Before |    65069 | 37845363 | 831975 |   3214646 |
    Test 1 After  |    65069 | 42703608 |    778 |   3327674 |
                --+----------+----------+--------+-----------+
    Test 2 Before |    65073 | 54257583 | 208647 |   2413056 |
    Test 2 After  |    65069 | 54255782 | 200038 |   2533759 |
                --+----------+----------+--------+-----------+
    Test 3 Before |    65068 | 49857744 | 487130 |   5533348 |
    Test 3 After  |    65071 | 52294311 |  16078 |   5648354 |
                --+----------+----------+--------+-----------+
    Test 4 Before |    34854 |  2448329 | 385870 | 162116572 |
    Test 4 After  |    32419 |  2448329 |   2339 | 162175706 |
                --+----------+----------+--------+-----------+

    * "Time (s)" - The run time of the test in seconds
    * "inodes"   - The number of inodes created by the test
    * "reads"    - The number of reads performed by the test
    * "writes"   - The number of writes performed by the test

As you can see from the table above, running with this patch stack
*significantly* reduced the number of reads performed in 3 out of the 4
tests (due to an improved ARC hit rate).

In addition to the tests described above, which specifically targeted
creates only, three other workloads were tested. These additional tests
were targeting rsync performance against the datasets created in the
previous tests. A brief description of the workloads and results for
these tests are below:

    ** NOTE: Aside from (6), these tests didn't run to completion. They
             ran for a set amount of time and then were terminated.

    5. Rsync the dataset created in Test 1 to a new dataset:

       * rsync -a /tank/test-1 /tank/test-5

    6. Rsync the dataset created in Test 2 to a new dataset:

       * rsync -a /tank/test-2 /tank/test-6

    7. Rsync the dataset created in Test 3 to a new dataset:

       * rsync -a /tank/test-3 /tank/test-7

Results for Test 5, 6, and 7 are below:

                  | Time (s) |   inodes |    reads |  writes |
                --+----------+----------+----------+---------+
    Test 5 Before |    93041 | 17921014 | 47632823 | 4094848 |
    Test 5 After  |    93029 | 29785847 | 30376206 | 4484459 |
                --+----------+----------+----------+---------+
    Test 6 Before |    15290 | 54264474 |  6018331 |  733087 |
    Test 6 After  |    11573 | 54260826 |  4155661 |  617285 |
                --+----------+----------+----------+---------+
    Test 7 Before |    93057 | 10093749 | 41561635 | 3659098 |
    Test 7 After  |    93045 | 27587043 | 28773151 | 5612234 |
                --+----------+----------+----------+---------+

    * "Time (s)" - The run time of the test in seconds
    * "inodes"   - The number of inodes created by the test
    * "reads"    - The number of reads performed by the test
    * "writes"   - The number of writes performed by the test

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2110

Set "arc_meta_limit" to 3/4 arc_c_max by default

Unfortunately, this change is an cheap attempt to work around a
pathological workload for the ARC. A "real" solution still needs to be
fleshed out, so this patch is intended to alleviate the situation in the
meantime. Let me try and describe the problem..

Data buffers residing in the dbuf hash table (dbuf cache) will keep a
hold on their respective dnode, this dnode will in turn keep a hold on
its backing dbuf (the physical block of the dnode object backing it).
Since the dnode has a hold on its backing dbuf, the arc buffer for this
dbuf is unevictable. What this essentially boils down to, "data" buffers
have the potential to pin "metadata" in the arc (as a result of these
dnode object buffers being unevictable).

This scenario becomes a real problem when the workload consists of many
small files (e.g. creating millions of 4K files). With this workload,
the arc's "arc_meta_used" space get filled up with buffers for any
resident directories as well as buffers for the objset's dnode object.
Once the "arc_meta_limit" is reached, the directory buffers will be
evicted and only the unevictable dnode object buffers will reside. If
the workload is simply creating new small files, these dnode object
buffers will never even be needed again, whereas the directory buffers
will be used constantly until the creates move to a new directory.

If "arc_c" and "arc_meta_limit" are sized appropriately, this
situation wont occur. This is because as the data buffers accumulate,
"arc_size" will eventually approach "arc_c" (before "arc_meta_used"
reaches "arc_meta_limit"); at that point the data buffers will be
evicted, which releases the hold on the dnode, which releases the hold
on the dnode object's dbuf, which allows that buffer to be evicted from
the arc in preference to more "useful" metadata.

So, to side step the issue, we simply need to ensure "arc_size" reaches
"arc_c" before "arc_meta_used" reaches "arc_meta_limit". In order to
pick a proper limit, we have to do some math.

To make things a little easier to follow, it is assumed that there will
only be a single data buffer per file (which is probably always the case
for "small" files anyways).

Based on the current internals of the arc, if N files residing in the
dbuf cache all pin a single dnode buffer (i.e. their dnodes all share
the same physical dnode object block), then the following amount of
"arc_meta_used" space will be consumed:

    - 16K for the dnode object's block - [        16384 bytes]
    - N * sizeof(dnode_t) -------------- [      N * 928 bytes]
    - (N + 1) * sizeof(arc_buf_t) ------ [(N + 1) *  72 bytes]
    - (N + 1) * sizeof(arc_buf_hdr_t) -- [(N + 1) * 264 bytes]
    - (N + 1) * sizeof(dmu_buf_impl_t) - [(N + 1) * 280 bytes]

To simplify, these N files will pin the following amount of
"arc_meta_used" space as unevictable:

    Pinned "arc_meta_used" bytes = 16384 + N * 928 + (N + 1) * (72 + 264 + 280)
    Pinned "arc_meta_used" bytes = 17000 + N * 1544

This pinned space is regardless of the size of the files, and is only
dependent on the number of pinned dnodes sharing a physical block
(i.e. N). For example, 32 512b files sharing a single dnode object
block would consume the same "arc_meta_used" space as 32 4K files
sharing a single dnode object block.

Now, given a files size of S, we can determine the total amount of
space that will be consumed in the arc:

    Total = 17000 + N * 1544 + S * N
            ^^^^^^^^^^^^^^^^   ^^^^^
                metadata        data

So, given these formulas, we can generate a table which states the ratio
of pinned metadata to total arc (meta + data) using different values of
N (number of pinned dnodes per pinned physical dnode block) and S (size
of the file).

                                  File Sizes (S)
       |    512   |   1024   |   2048   |   4096   |   8192   |   16384  |
    ---+----------+----------+----------+----------+----------+----------+
     1 | 0.973132 | 0.947670 | 0.900544 | 0.819081 | 0.693597 | 0.530921 |
     2 | 0.951497 | 0.907481 | 0.830632 | 0.710325 | 0.550779 | 0.380051 |
N   4 | 0.918807 | 0.849809 | 0.738842 | 0.585844 | 0.414271 | 0.261250 |
     8 | 0.877541 | 0.781803 | 0.641770 | 0.472505 | 0.309333 | 0.182965 |
    16 | 0.835819 | 0.717945 | 0.559996 | 0.388885 | 0.241376 | 0.137253 |
    32 | 0.802106 | 0.669597 | 0.503304 | 0.336277 | 0.202123 | 0.112423 |

As you can see, if we wanted to support the absolute worst case of 1
dnode per physical dnode block and 512b files, we would have to set the
"arc_meta_limit" to something greater than 97.3132% of "arc_c_max". At
that point, it essentially defeats the purpose of having an
"arc_meta_limit" at all.

This patch changes the default value of "arc_meta_limit" to be 75% of
"arc_c_max", which should be good enough for "most" workloads (I think).

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110

Split "data_size" into "meta" and "data"

Previously, the "data_size" field in the arcstats kstat contained the
amount of cached "metadata" and "data" in the ARC. The problem is this
then made it difficult to extract out just the "metadata" size, or just
the "data" size.

To make it easier to distinguish the two values, "data_size" has been
modified to count only buffers of type ARC_BUFC_DATA, and "meta_size"
was added to count only buffers of type ARC_BUFC_METADATA. If one wants
the old "data_size" value, simply sum the new "data_size" and
"meta_size" values.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110

Prioritize "metadata" in arc_get_data_buf

When the arc is at it's size limit and a new buffer is added, data will
be evicted (or recycled) from the arc to make room for this new buffer.
As far as I can tell, this is to try and keep the arc from over stepping
it's bounds (i.e. keep it below the size limitation placed on it).

This makes sense conceptually, but there appears to be a subtle flaw in
its current implementation, resulting in metadata buffers being
throttled. When it evicts from the arc's lists, it also passes in a
"type" so as to remove a buffer of the same type that it is adding. The
problem with this is that once the size limit is hit, the ratio of
"metadata" to "data" contained in the arc essentially becomes fixed.

For example, consider the following scenario:

    * the size of the arc is capped at 10G
    * the meta_limit is capped at 4G
    * 9G of the arc contains "data"
    * 1G of the arc contains "metadata"

Now, every time a new "metadata" buffer is created and added to the arc,
an older "metadata" buffer(s) will be removed from the arc; preserving
the 9G "data" to 1G "metadata" ratio that was in-place when the size
limit was reached. This occurs even though the amount of "metadata" is
far below the "metadata" limit. This can result in the arc behaving
pathologically for certain workloads.

To fix this, the arc_get_data_buf function was modified to evict "data"
from the arc even when adding a "metadata" buffer; unless it's at the
"metadata" limit. In addition, arc_evict now more closely resembles
arc_evict_ghost; such that when evicting "data" from the arc, it may
make a second pass over the arc lists and evict "metadata" if it cannot
meet the eviction size the first time around.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110

Remove "arc_meta_used" from arc_adjust calculation

Using "arc_meta_used" to determine if the arc's mru list is over it's
target value of "arc_p" doesn't seem correct. The size of the mru list
and the value of "arc_meta_used", although related, are completely
independent. Buffers contained in "arc_meta_used" may not even be
contained in the arc's mru list. As such, this patch removes
"arc_meta_used" from the calculation in arc_adjust.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110

Prune metadata from ghost lists in arc_adjust_meta

To maintain a strict limit on the metadata contained in the arc, while
preventing the arc buffer headers from completely consuming the
"arc_meta_used" space, we need to evict metadata buffers from the arc's
ghost lists along with the regular lists.

This change modifies arc_adjust_meta such that it more closely models
the adjustments made in arc_adjust. "arc_meta_used" is used similarly to
"arc_size", and "arc_meta_limit" is used similarly to "arc_c".

Testing metadata intensive workloads (e.g. creating, copying, and
removing millions of small files and/or directories) has shown this
change to make a dramatic improvement to the hit rate maintained in the
arc. While I think there is still room for improvement, this is a big
step in the right direction.

In addition, zpl_free_cached_objects was made into a no-op as I'm not
yet sure how to properly implement that function.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110

Revert "Return -1 from arc_shrinker_func()"

This reverts commit c11a12bc3b2e5ee9a6bd74e26f1a396b6025fbd4.

Out of memory events were fixed by reverting this patch.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110

Disable arc_p adapt dampener by default

It's unclear why adjustments to arc_p need to be dampened as they are in
arc_adjust. With that said, it's removal significantly improves the arc's
ability to "warm up" to a given workload. Thus, I'm disabling by default
until its usefulness is better understood.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110

Allow "arc_p" to drop to zero or grow to "arc_c"

Setting a limit on the minimum value of "arc_p" has been shown to have
detrimental effects on the arc hit rate for certain "metadata" intensive
workloads. Specifically, this has been exhibited with a workload that
constantly dirties new "metadata" but also frequently touches a "small"
amount of mfu data (e.g. mkdir's).

What is seen is that the new anon data throttles the mfu list to a
negligible size (because arc_p > anon + mru in arc_get_data_buf), even
though the mfu ghost list receives a constant stream of hits. To remedy
this, arc_p is now allowed to drop to zero if the algorithm deems it
necessary.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110

Disable aggressive arc_p growth by default

For specific workloads consisting mainly of mfu data and new anon data
buffers, the aggressive growth of arc_p found in the arc_get_data_buf()
function can have detrimental effects on the mfu list size and ghost
list hit rate.

Running a workload consisting of two processes:

    * Process 1 is creating many small files
    * Process 2 is tar'ing a directory consisting of many small files

I've seen arc_p and the mru grow to their maximum size, while the mru
ghost list receives 100K times fewer hits than the mfu ghost list.

Ideally, as the mfu ghost list receives hits, arc_p should be driven
down and the size of the mfu should increase. Given the specific
workload I was testing with, the mfu list size should grow to a point
where almost no mfu ghost list hits would occur. Unfortunately, this
does not happen because the newly dirtied anon buffers constancy drive
arc_p to its maximum value and keep it there (effectively prioritizing
the mru list and starving the mfu list down to a negligible size).

The logic to increment arc_p from within the arc_get_data_buf() function
was introduced many years ago in this upstream commit:

    commit 641fbdae3a027d12b3c3dcd18927ccafae6d58bc
    Author: maybee <none@none>
    Date:   Wed Dec 20 15:46:12 2006 -0800

        6505658 target MRU size (arc.p) needs to be adjusted more aggressively

and since I don't fully understand the motivation for the change, I am
reluctant to completely remove it.

As a way to test out how it's removal might affect performance, I've
disabled that code by default, but left it tunable via a module option.
Thus, if its removal is found to be grossly detrimental for certain
workloads, it can be re-enabled on the fly, without a code change.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110

Adjust arc_p based on "bytes" in arc_shrink

In an attempt to prevent arc_c from collapsing "too fast", the
arc_shrink() function was updated to take a "bytes" parameter by this
change:

    commit 302f753f1657c05a4287226eeda1f53ae431b8a7
    Author: Brian Behlendorf <behlendorf1@llnl.gov>
    Date:   Tue Mar 13 14:29:16 2012 -0700

        Integrate ARC more tightly with Linux

Unfortunately, that change failed to make a similar change to the way
that arc_p was updated. So, there still exists the possibility for arc_p
to collapse to near 0 when the kernel start calling the arc's shrinkers.

This change attempts to fix this, by decrementing arc_p by the "bytes"
parameter in the same way that arc_c is updated.

In addition, a minimum value of arc_p is attempted to be maintained,
similar to the way a minimum arc_p value is maintained in arc_adapt().

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110

Set zfs_arc_min to 4MB

Decrease the mimimum ARC size from 1/32 of total system memory
(or 64MB) to a much smaller 4MB.

1) Large systems with over a 1TB of memory are being deployed
   and reserving 1/32 of this memory (32GB) as the mimimum
   requirement is overkill.

2) Tiny systems like the raspberry pi may only have 256MB of
   memory in which case 64MB is far too large.

The ARC should be reclaimable if the VFS determines it needs
the memory for some other purpose.  If you want to ensure the
ARC is never completely reclaimed due to memory pressure you
may still set a larger value with zfs_arc_min.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Issue #2110

Merge branch 'issue-2094'

This patch stack was designed to accomplish the following:

* Cleanly address the pool incompatibility which was accidentally
  introduced post-0.6.2 but pre-0.6.3.  This required adding
  infrastructure which can be used from now on to notify system
  administrators of errata which affect their pool.

* Add additional automated regression testing to ensure this type
  of compatibility issue is caught prior to the change being merged.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2094

Add erratum for issue #2094

ZoL commit 1421c89 unintentionally changed the disk format in a forward-
compatible, but not backward compatible way. This was accomplished by
adding an entry to zbookmark_t, which is included in a couple of
on-disk structures. That lead to the creation of pools with incorrect
dsl_scan_phys_t objects that could only be imported by versions of ZoL
containing that commit.  Such pools cannot be imported by other versions
of ZFS or past versions of ZoL.

The additional field has been removed by the previous commit.  However,
affected pools must be imported and scrubbed using a version of ZoL with
this commit applied.  This will return the pools to a state in which they
may be imported by other implementations.

The 'zpool import' or 'zpool status' command can be used to determine if
a pool is impacted.  A message similar to one of the following means your
pool must be scrubbed to restore compatibility.

$ zpool import
   pool: zol-0.6.2-173
     id: 1165955789558693437
  state: ONLINE
status: Errata #1 detected.
action: The pool can be imported using its name or numeric identifier,
         however there is a compatibility issue which should be corrected
         by running 'zpool scrub'
    see: http://zfsonlinux.org/msg/ZFS-8000-ER
config:
...

$ zpool status
  pool: zol-0.6.2-173
state: ONLINE
  scan: pool compatibility issue detected.
   see: https://github.com/zfsonlinux/zfs/issues/2094
action: To correct the issue run 'zpool scrub'.
config:
...

If there was an async destroy in progress 'zpool import' will prevent
the pool from being imported.  Further advice on how to proceed will be
provided by the error message as follows.

$ zpool import
   pool: zol-0.6.2-173
     id: 1165955789558693437
  state: ONLINE
status: Errata #2 detected.
action: The pool can not be imported with this version of ZFS due to an
         active asynchronous destroy.  Revert to an earlier version and
         allow the destroy to complete before updating.
         see: http://zfsonlinux.org/msg/ZFS-8000-ER
config:
...

Pools affected by the damaged dsl_scan_phys_t can be detected prior to
an upgrade by running the following command as root:

zdb -dddd poolname 1 | grep -P '^\t\tscan = ' | sed -e 's;scan = ;;' | wc -w

Note that `poolname` must be replaced with the name of the pool you wish
to check. A value of 25 indicates the dsl_scan_phys_t has been damaged.
A value of 24 indicates that the dsl_scan_phys_t is normal. A value of 0
indicates that there has never been a scrub run on the pool.

The regression caused by the change to zbookmark_t never made it into a
tagged release, Gentoo backports, Ubuntu, Debian, Fedora, or EPEL
stable respositorys.  Only those using the HEAD version directly from
Github after the 0.6.2 but before the 0.6.3 tag are affected.

This patch does have one limitation that should be mentioned.  It will not
detect errata #2 on a pool unless errata #1 is also present.  It expected
this will not be a significant problem because pools impacted by errata #2
have a high probably of being impacted by errata #1.

End users can ensure they do no hit this unlikely case by waiting for all
asynchronous destroy operations to complete before updating ZoL.  The
presence of any background destroys on any imported pools can be checked
by running `zpool get freeing` as root.  This will display a non-zero
value for any pool with an active asynchronous destroy.

Lastly, it is expected that no user data has been lost as a result of
this erratum.

Original-patch-by: Tim Chase <tim@chase2k.com>
Reworked-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2094

Add generic errata infrastructure

From time to time it may be necessary to inform the pool administrator
about an errata which impacts their pool.  These errata will by shown
to the administrator through the 'zpool status' and 'zpool import'
output as appropriate.  The errata must clearly describe the issue
detected, how the pool is impacted, and what action should be taken
to resolve the situation.  Additional information for each errata will
be provided at http://zfsonlinux.org/msg/ZFS-8000-ER.

To accomplish the above this patch adds the required infrastructure to
allow the kernel modules to notify the utilities that an errata has
been detected.  This is done through the ZPOOL_CONFIG_ERRATA uint64_t
which has been added to the pool configuration nvlist.

To add a new errata the following changes must be made:

* A new errata identifier must be assigned by adding a new enum value
  to the zpool_errata_t type.  New enums must be added to the end to
  preserve the existing ordering.

* Code must be added to detect the issue.  This does not strictly
  need to be done at pool import time but doing so will make the
  errata visible in 'zpool import' as well as 'zpool status'.  Once
  detected the spa->spa_errata member should be set to the new enum.

* If possible code should be added to clear the spa->spa_errata member
  once the errata has been resolved.

* The show_import() and status_callback() functions must be updated
  to include an informational message describing the errata.  This
  should include an action message describing what an administrator
  should do to address the errata.

* The documentation at http://zfsonlinux.org/msg/ZFS-8000-ER must be
  updated to describe the errata.  This space can be used to provide
  as much additional information as needed to fully describe the errata.
  A link to this documentation will be automatically generated in the
  output of 'zpool import' and 'zpool status'.

Original-idea-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Richard Yao <ryao@gentoo.or
Issue #2094

Use expected zpool_status_t type

Both the zpool_import_status() and zpool_get_status() functions
return the zpool_status_t enum. This explicit type should be used
rather than the more generic int type.

This patch makes no functional change and should only be considered
code cleanup. It happens to have been done in the context of #2094
because that's when I noticed this issue.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Richard Yao <ryao@gentoo.or
Issue #2094

Revert changes to zbookmark_t

Commit 1421c89142376bfd41e4de22ed7c7846b9e41f95 added a field to
zbookmark_t that unintentinoally caused a disk format change. This
negatively affected backward compatibility and platform portability.
Therefore, this field is being removed.

The function that field permitted is left unimplemented until a later
patch that will reimplement the field in a way that does not affect the
disk format.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2094

Add zimport.sh compatibility test script

Verify that an assortment of known good reference pools can be imported
using different versions of the ZoL code.

By default references pools for the major ZFS implementation will be
checked against the most recent ZoL tags and the master development branch.
Alternate tags or branches may be verified with the '-s <src-tag> option.
Passing the keyword "installed" will instruct the script to test whatever
version is installed.

Preferentially a reference pool is used for all tests.  However, if one
does not exist and the pool-tag matches one of the src-tags then a new
reference pool will be created using binaries from that source build.
This is particularly useful when you need to test your changes before
opening a pull request.

New reference pools may be added by placing a bzip2 compressed tarball
of the pool in the scripts/zpool-example directory and then passing
the -p <pool-tag> option.  To increase the test coverage reference pools
should be collected for all the major ZFS implementations.  Having these
pools easily available is also helpful to the developers.

Care should be taken to run these tests with a kernel supported by all
the listed tags.  Otherwise build failure will cause false positives.

EXAMPLES:

The following example will verify the zfs-0.6.2 tag, the master branch,
and the installed zfs version can correctly import the listed pools.
Note there is no reference pool available for master and installed but
because binaries are available one is automatically constructed.  The
working directory is also preserved between runs (-k) preventing the
need to rebuild from source for multiple runs.

zimport.sh -k -f /var/tmp/zimport \
     -s "zfs-0.6.1 zfs-0.6.2 master installed" \
     -p "all master installed"

--------------------- ZFS on Linux Source Versions --------------
                zfs-0.6.1       zfs-0.6.2       master          0.6.2-180
-----------------------------------------------------------------
Clone SPL       Skip Skip Skip Skip
Clone ZFS       Skip Skip Skip Skip
Build SPL       Skip Skip Skip Skip
Build ZFS       Skip Skip Skip Skip
-----------------------------------------------------------------
zevo-1.1.1      Pass Pass Pass Pass
zol-0.6.1       Pass Pass Pass Pass
zol-0.6.2-173   Fail Fail Pass Pass
zol-0.6.2       Pass Pass Pass Pass
master          Fail Fail Pass Pass
installed       Pass Pass Pass Pass

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Issue #2094

Propagate errors when registering "relatime" property callback.

Various errors can occur when registering property callbacks. As the
author's comments indicate, the code is very paranoid about preserving
the first-seen error when registering callbacks. This patch causes an
error seen while registering the "relatime" callback to not clobber a
previously-seen error.

Reported-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2117

Add default FILEDIR path to zpool-config scripts

Allow the caller of the zpool-create.sh script to override
the default path where file vdevs are created. This allows
for greated flexibilty when scripting.

Additionally, update the default path from /tmp/ to /var/tmp/
because these days /tmp/ is likely a ramdisk. Even though
these files are sparse they may grow large in which case they
should be backed by a physical device.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2120

Add explicit Conflicts for zfs-fuse packages

zfs-fuse provides the same commands and man page names as ZoL.
Changing the names on either side would make each incompatible with
all existing documentation about ZFS. Providing bit identical files
is not possible due to differing codebases.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1866

Fix zconfig.sh test 9

Commit ba6a240 adjusted the behavior of 'zfs create -V'. The
caller is no longer guaranteed that udev will have finished
creating the /dev/ entries by the time to command exits. It
is therefore required that we explicitly block waiting for
udev to settle for this test to run reliably.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Add systemd unit files for ZFS startup

This adds systemd unit files replacing the functionality offered by
the SysV init script found in etc/init.d.

It has been developed and tested on Fedora 19, Fedora 20
and openSuSE 13.1.

Four unit files and one target are offered.

zfs-import-cache.service:
    Import pools from /etc/zfs/zpool.cache. This unit will wait for
    udev to settle.
zfs-import-scan.service:
    Import pools by scanning /dev/disk/by-id for zvols. This unit will
    only run if /etc/zfs/zpool.cache is not present. This unit will wait
    for udev to settle
zfs-mount.service:
    Mount ZFS native filesystems. It contains a dependency to be loaded
    before local-fs.target.
zfs-share.service:
    Share NFS/SMB filesystems. This unit contains a dependency that
    will cause it to be restarted whenever the smb or nfs-server unit
    is restarted, restoring the shares added.
zfs.target:
    This target pulls in the other units in order to start ZFS. It's
    the only unit that can be enabled/disabled, all other services
    are static and pulled in by dependencies. It will honour zfs=off
    and zfs=no options on the kernel command line.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2108

Fix corrupted l2_asize in arcstats

Commit e0b0ca9 accidentally corrupted the l2_asize displayed in
arcstats. This was caused by changing the l2arc_buf_hdr.b_asize
member from an int to uint32_t type. There are places in the
code where this field is cast to a uint64_t resulting in the
b_hits member being treated as part of b_asize.

To resolve the issue the type has been changed to a uint64_t,
and the b_hits member is placed after the enum to prevent the
size of the structure from increasing.

This is a good example of exactly why it's a bad idea to use
ambiguous types (int) in these structures.

Signed-off-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1990

4188 assertion failed in dmu_tx_hold_free(): dn_datablkshift != 0

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

Refences:
https://www.illumos.org/issues/4188
illumos/illumos-gate@bb411a08b05466bfe0c7095b6373bbc1587e259a

Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2091

Illumos 4504 traverse_visitbp: visit group before user

4504 traverse_visitbp: visit DMU_GROUPUSED_OBJECT before DMU_USERUSED_OBJECT

Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>

References:
  https://illumos.org/issues/4504
  http://code.delphix.com/illumos-4504
  http://svnweb.freebsd.org/base?view=revision&revision=260812

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #2079

Implement relatime.

Add the "relatime" property. When set to "on", a file's atime will only
be updated if the existing atime at least a day old or if the existing
ctime or mtime has been updated since the last access. This behavior
is compatible with the Linux "relatime" mount option.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2064
Closes #1917

Fix error message in zpios

The chunksize must always be strictly smaller than the regionsize.

Signed-off-by: Andrew Uselton <andrew.c.uselton@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2072

Call gethrtime() only once per new txg creation

When transitioning current open TXG into QUIESCE state and opening
a new one txg_quiesce() calls gethrtime():
  - to mark the birth time of the new TXG
  - to record the SPA txg history kstat
  - implicitely inside spa_txg_history_add()

These timestamps are practically the same, so that the first one
can be used instead of the other two.  The only visible difference
is that inside spa_txg_history_add() the time spent in kmem_zalloc()
will be counted towards the opened TXG.

Since at this point the new TXG already exists (tx->tx_open_txg
has been already incremented) it is actually a correct accounting.

In any case this extra work is only happening when spa_txg_history
kstat is activated (i.e. zfs_txg_history > 0) and doesn't affect
the normal processing in any way.

Signed-off-by: Cyril Plisko <cyril.plisko@mountall.com>
Issue #2075

Add additional state TXG_STATE_WAIT_FOR_SYNC for txg.

In several cases when digging into kstats we can found two txgs
in SYNC state, e.g.

txg     birth            state  nreserved  nread      nwritten ...
985452  258127184872561  C      0          373948416  2376272384 ...
985453  258129016180616  C      0          378173440  28793344 ...
985454  258129016271523  S      0          0          0 ...
985455  258130864245986  S      0          0          0 ...
985456  258130867458851  O      0          0          0 ...

However only first txg (985454) is really syncing at this moment.
The other one (985455) marked as SYNCED is actually in a post-QUIESCED
state and waiting to start sync.   So, the new TXG_STATE_WAIT_FOR_SYNC
state between TXG_STATE_QUIESCED and TXG_STATE_SYNCED was added to
reveal this situation.

txg     birth            state  nreserved  nread      nwritten ...
1086896 235261068743969  C      0          163577856  8437248 ...
1086897 235262870830801  C      0          280625152  822594048 ...
1086898 235264172219064  S      0          0          0 ...
1086899 235264936134407  W      0          0          0 ...
1086900 235264936296156  O      0          0          0 ...

Signed-off-by: Igor Lvovsky <ilvovsky@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2075

Use enum type(zfetch_dirn_t) instead

Fix code with zfetch_dirn_t, which is more readable and clear.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2068

Allow chown/chgrp when no ACL SAs exist.

From the comment in the commit:

Some ZFS implementations (ZEVO) create neither a ZNODE_ACL nor a DACL_ACES
SA in which case ENOENT is returned from zfs_acl_node_read() when the
SA can't be located. Allow chown/chgrp to succeed in these cases rather
than returning an error that makes no sense in the context of the caller.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfs-osx/zfs#86
Closes #1911
Closes #2029

vdev_file_io_start() to use taskq_dispatch(TQ_PUSHPAGE)

The vdev_file_io_start() function may be processing a zio that the
txg_sync thread is waiting on. In this case it is not safe to perform
memory allocations that may generate new I/O since this could cause a
deadlock. To avoid this, call taskq_dispatch() with TQ_PUSHPAGE
instead of TQ_SLEEP.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1928

Remove incorrect use of EXTRA_DIST for man pages

Setting the 'dist_' prefix is the correct way to instruct Automake
to include these files in the distribution. The EXTRA_DIST variable
is reserved for files which are not covered by the automatic rules.

http://www.gnu.org/software/automake/manual/automake.html#Basics

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

vdev_id: support per-channel slot mappings

The vdev_id udev helper currently applies slot renumbering rules to
every channel (JBOD) in the system.  This is too inflexible for systems
with non-homogeneous storage topologies.  The "slot" keyword now takes
an optional third parameter which names a channel to which the mapping
will apply.  If the third parameter is omitted then the rule applies to
all channels.  The first-specified rule that can match a slot takes
precedence.  Therefore a channel-specific rule for a given slot should
generally appear before a generic rule for the same slot number.  In
this way a custom slot mapping can be applied to a particular channel
and a default mapping applied to the rest.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2056

Use long holds in zvol_set_volsize()

Under Linux the zvol_set_volsize() function was originally written
to use dmu_objset_hold()/dmu_objset_rele(). Subsequently, the
dmu_objset_own()/dmu_objset_disown() interfaces were added but
the ZVOL code wasn't updated to take advantage of them.

This was never an issue but after the dsl_pool_config changes
the code now takes the config lock twice. The cleanest solution
is to shift to using dmu_objset_own() which takes a long hold
on the dataset and does not hold the dsl pool lock.

This patch also slightly restructures the existing code such
that it more closely resembles the upstream Illumos code.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2039

Disable GCCs aggressive loop optimization

GCC >+ 4.8's aggressive loop optimization breaks some of the iterators
over the dn_blkptr[] pseudo-array in dnode_phys. Since dn_blkptr[] is
defined as a single-element array, GCC believes an iterator can only
access index 0 and will unroll the loop into a single iteration.

One way to resolve the issue would be to cast the array to a pointer
and fix all the iterators that might break.  The only loop where it
is known to cause a problem is this loop in dmu_objset_write_ready():

    for (i = 0; i < dnp->dn_nblkptr; i++)
            bp->blk_fill += dnp->dn_blkptr[i].blk_fill;

In the common case where dn_nblkptr is 3, the loop is only executed a
single time and "i" is equal to 1 following the loop.

The specific breakage caused by this problem is that the blk_fill of
root block pointers wouldn't be set properly when more than one blkptr
is in use (when no indrect blocks are needed).

The simple reproducing sequence is:

zpool create tank /tank.img
zdb -ddddd tank 0

Notice that "fill=31", however, there are two L0 indirect blocks with
"F=31" and "F=5". The fill count should be 36 rather than 31. This
problem causes an assert to be hit in a simple "zdb tank" when built
with --enable-debug.

However, this approach was not taken because we need to be absolutely
sure we catch all instances of this unwanted optimization.  Therefore,
the build system has been updated to detect if GCC supports the
aggressive loop optimization.  If it does the optimization will be
explicitly disabled using the -fno-aggressive-loop-optimization option.

Original-fix-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2010
Closes #2051

Properly link zpool command to libblkid

31fc19399e597e3391f19f1392ab120f1de0d5f2 incorrectly removed $(LIBBLKID)
from cmd/zpool/Makefile.am. This meant that the toolchain was not given
-lblkid, which resulted in the following build failure on Ubuntu 13.10:

/usr/bin/ld: zpool_vdev.o: undefined reference to symbol
'blkid_put_cache@@BLKID_1.0'
/lib/x86_64-linux-gnu/libblkid.so.1: error adding symbols: DSO missing
from command line
collect2: error: ld returned 1 exit status

That commit reworked various Makefile.am to follow best practices, so we
reintroduce $(LIBBLKID) in a manner consistent with that, rather than
explicitly reverting the change.

Reproduction of this issue was done on a Gentoo Linux system by
executing the following commands:

zfs create -o mountpoint=/mnt/ubuntu-13.10 rpool/ROOT/ubuntu-13.10
debootstrap --variant=buildd --arch amd64 saucy /mnt/ubuntu-13.10 http://archive.ubuntu.com/ubuntu/
mount -o bind /dev /mnt/ubuntu-13.10/dev/
mount -o bind /proc/ /mnt/ubuntu-13.10/proc/
mount -o bind /sys/ /mnt/ubuntu-13.10/sys/
cp /etc/resolv.conf /mnt/ubuntu-13.10/etc/
(cd /mnt/ubuntu-13.10/root/ && git clone git://github.com/zfsonlinux/zfs.git)
chroot /mnt/ubuntu-13.10/
apt-get install git autoconf libtool zlib1g-dev uuid-dev libblkid-dev
\#apt-get install alien fakeroot vim
cd /root/zfs
./autogen.sh
./configure --with-config=user --prefix=/usr
make

That will create a Ubuntu 13.10 chroot, fetch the sources and build
test. At this point, cmd/zpool/Makefile.am was modified and the
following commands were run to verify that the build issue was resolved:

git clean -xdf
./autogen.sh
./configure --with-config=user --prefix=/usr
make

Although it is not shown here, the absence of libblkid-dev enables ZFS
to build successfully without the patch. This could explain how this
escaped detection until recently. A test without libblkid-dev was done
to verify that the patch did not cause a regression in the absence of
libblkid:

apt-get remove libblkid-dev
git clean -xdf
./autogen.sh
./configure --with-config=user --prefix=/usr
make

Additionally, the commands themselves were tested against my live system
from within the chroot to ensure basic functionality. My live system had
corresponding kernel modules already installed and basic commands such
as `zpool list` and `zfs list` worked without incident. Lastly, this
patch was also build tested on Gentoo Linux, where it caused no
problems.

At time of writing, these steps can be used to reproduce these results
on any modern Linux system that has debootstrap installed. On Gentoo,
installing debootstrap can be done with `emerge dev-util/debootstrap`.
The current ZFSOnLinux HEAD revision as of writing is
fd23720ae14dca926800ae70e6c8f4b4f82efc08. Once this is fixed in HEAD,
either that revision or another before this fix and after
31fc19399e597e3391f19f1392ab120f1de0d5f2 will be needed to reproduce
this issue.

Lastly, it remains to be seen why the toolchains on the systems
performing regression tests did not catch this. This is not a
ZFS-specific issue, but it is something that we will want to explore in
the future.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2038

Prevent duplicate mnttab cache entries

Under Linux its possible to mount the same filesystem multiple
times in the namespace.  This can be done either with bind mounts
or simply with multiple mount points.  Unfortunately, the mnttab
cache code is implemented using an AVL tree which does not support
duplicate entries.  To avoid this issue this patch updates the
code to check for a duplicate entry before adding a new one.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Michael Martin <mgmartin.mgm@gmail.com>
Closes #2041

Drain iput taskq outside z_teardown_lock

It's unsafe to drain the iput taskq while holding the z_teardown_lock
as a writer. This is because when the last reference on an inode is
dropped it may still have pages which need to be written to disk.
This will be done through zpl_writepages which will acquire the
z_teardown_lock as a reader in ZFS_ENTER. Therefore, if we're
holding the lock as a writer in zfs_sb_teardown the unmount will
deadlock.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Closes #1988