git.proxmox.com Git - mirror

Fix DKMS package upgrade and packager

Running 'yum upgrade zfs-dkms' package could appear to work properly
and still leave you with no zfs modules installed.  This will occur
when only the zfs release, and not the version, are incremented.
This may be the case for a fast moving zfs-testing repository.

During the upgrade process DKMS will realize that zfs-x.y.z is already
installed and remove it.  DKMS then correctly builds the new modules
for zfs-x.y.z.  However, as a final step when the old zfs-x.y.z-r is
removed the %preun script runs and removes the newly build modules.
To handle this case the %preun script has been updated to only run
when the installed version exactly matches the full spec file version.

This change also updated ChangeLog section based on the DKMS
reference spec file.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Restrict release number to META version

When creating packages in a git repository the release number
can be automatically set by 'git describe'. This normally works
well but if your repository has newer tags which match the form
NAME-VERSION* the release may be incorrectly calculated. To
prevent this the match patten has been restricted to the contents
of the META file, NAME-VERSION.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Use default slab types

We should not override the default memory type of the kmem cache.  This
was done previously to force certain objects which were slightly over
object size limit cut off in to KMC_KMEM caches for better performance.

The zfsonlinux/spl#356 patch slightly increases the default cut off
from 511 bytes 1024 bytes for x86_64.  This means there is long longer
a need to override the default for the caches.  And since the default
values are now being used the new spl_kmem_cache_slab_limit and
spl_kmem_cache_kmem_limit tunables will apply to all kmem caches.

The following is a list of caches that will be impacted:

                  | object size   | forced type   | default type
----------------- | ------------- | ------------- | --------------
dnode_t           | 936 bytes     | KMC_KMEM      | KMC_KMEM
zio_cache         | 1104 bytes    | *KMC_KMEM     | *KMC_VMEM
zio_link_cache    | 48 bytes      | KMC_KMEM      | KMC_KMEM
zio_vdev_cache    | 131088 bytes  | KMC_VMEM      | KMC_VMEM
zio_buf_512       | 512 bytes     | KMC_KMEM      | KMC_KMEM
zio_data_buf_512  | 512 bytes     | KMC_KMEM      | KMC_KMEM
zio_buf_1024      | 1024 bytes    | KMC_KMEM      | KMC_KMEM
zio_data_buf_1024 | 1024 bytes    | +KMC_VMEM     | +KMC_KMEM

* Cache memory type will change from KMC_KMEM to KMC_VMEM.
+ Cache memory type will change from KMC_VMEM to KMC_KMEM.

This patch removes another slight point of divergence between ZoL
and Illumos.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Closes #2337

Omit compiler warning by sticking to RAII

Resolve gcc 4.9.0 20140507 warnings about uninitialized 'ptr' when
using -Wmaybe-uninitialized. The first two cases appears appear
to be legitimate but not the second two. In general this is a
good practice so they are all initialized.

Signed-off-by: Marcel Huber <marcelhuberfoo@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2345

Added INTEL SSD 530 Series

INTEL SSD 530 Series... SSDSC2BW24

Signed-off-by: John Albietz <inthecloud247@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2184

Honor zfs_nocacheflush for file vdevs

For consistency with disk vdevs honor the zfs_nocacheflush tunable.
This setting is available primarily for debugging and performance
analysis.

Signed-off-by: HC <mmttdebbcc@yahoo.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2336

Calculate header size correctly in sa_find_sizes()

In the case where a variable-sized SA overlaps the spill block pointer and
a new variable-sized SA is being added, the header size was improperly
calculated to include the to-be-moved SA. This problem could be
reproduced when xattr=sa enabled as follows:

ln -s $(perl -e 'print "x" x 120') blah
setfattr -n security.selinux -v blahblah -h blah

The symlink is large enough to interfere with the spill block pointer and
has a typical SA registration as follows (shown in modified "zdb -dddd"
<SA attr layout obj> format):

[ ... ZPL_DACL_COUNT ZPL_DACL_ACES ZPL_SYMLINK ]

Adding the SA xattr will attempt to extend the registration to:

[ ... ZPL_DACL_COUNT ZPL_DACL_ACES ZPL_SYMLINK ZPL_DXATTR ]

but since the ZPL_SYMLINK SA interferes with the spill block pointer, it
must also be moved to the spill block which will have a registration of:

[ ZPL_SYMLINK ZPL_DXATTR ]

This commit updates extra_hdrsize when this condition occurs, allowing
hdrsize to be subsequently decreased appropriately.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Issue #2214
Issue #2228
Issue #2316
Issue #2343

Allow for lock-free reading zfsdev_state_list.

Restructure the zfsdev_state_list to allow for lock-free reading by
converting to a simple singly-linked list from which items are never
deleted and over which only forward iterations are performed. It depends
on, among other things, the atomicity of accessing the zs_minor integer
and zs_next pointer.

This fixes a lock inversion in which the zfsdev_state_lock is used by
both the sync task (txg_sync) and indirectly by any user program which
uses /dev/zfs; the zfsdev_release method uses the same lock and then
blocks on the sync task.

The most typical failure scenerio occurs when the sync task is cleaning
up a user hold while various concurrent "zfs" commands are in progress.

Neither Illumos nor Solaris are affected by this issue because they use
DDI interface which provides lock-free reading of device state via the
ddi_get_soft_state() function.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2301

Handle ZPOOL_STATUS_HOSTID_MISMATCH in zpool status

Verbatim imports can cause hostid mismatches, but things otherwise work. `zpool
status` does not handle this and will fail when assertions are enabled:

```
zpool: ../../cmd/zpool/zpool_main.c:4418: status_callback: Assertion `reason == ZPOOL_STATUS_OK' failed.

Program received signal SIGABRT, Aborted.
```

Lets instead add a case to display an informative message such as this:

```
  pool: rpool
state: ONLINE
status: Mismatch between pool hostid and system hostid on imported pool.
        This pool was previously imported into a system with a different hostid,
        and then was verbatim imported into this system.
action: Export this pool on all systems on which it is imported.
        Then import it to correct the mismatch.
   see: http://zfsonlinux.org/msg/ZFS-8000-EY
  scan: scrub repaired 0 in 0h8m with 0 errors on Thu Apr 17 19:43:57 2014
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          sda       ONLINE       0     0     0

errors: No known data errors
```

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2342

Use a dedicated taskq for vdev_file

Originally, vdev_file used system_taskq. This would cause a deadlock,
especially on system with few CPUs. The reason is that the prefetcher
threads, which are on system_taskq, will sometimes be blocked waiting
for I/O to finish. If the prefetcher threads consume all the tasks in
system_taskq, the I/O cannot be served and thus results in a deadlock.

We fix this by creating a dedicated vdev_file_taskq for vdev_file I/O.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2270

Handle vdev_lookup_top() failure in dva_get_dsize_sync()

The dva_get_dsize_sync() function incorrectly assumes that the call
to vdev_lookup_top() cannot fail.  However, the NULL dereference at
clearly shows that under certain circumstances it is possible.  Note
that offset 0x570 (1376) maps as expected to vd->vdev_deflate_ratio.

  BUG: unable to handle kernel NULL pointer dereference at 00000570

  crash> struct -o vdev
  struct vdev {
       [0] uint64_t vdev_id;
       ... ...
    [1376] uint64_t vdev_deflate_ratio;

Given that this can happen this patch add the required error handling.
In the case where vdev_lookup_top() fails assume that no deflation
will occur for the DVA and use the asize.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Alexey Zhuravlev <alexey.zhuravlev@intel.com>
Closes #1707
Closes #1987
Closes #1891

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Check the dataset type more rigorously when fetching properties.

When fetching property values of snapshots, a check against the head
dataset type must be performed.  Previously, this additional check was
performed only when fetching "version", "normalize", "utf8only" or "case".

This caused the ZPL properties "acltype", "exec", "devices", "nbmand",
"setuid" and "xattr" to be erroneously displayed with meaningless values
for snapshots of volumes.  It also did not allow for the display of
"volsize" of a snapshot of a volume.

This patch adds the headcheck flag paramater to zfs_prop_valid_for_type()
and zprop_valid_for_type() to indicate the check is being done
against a head dataset's type in order that properties valid only for
snapshots are handled correctly.  This allows the the head check in
get_numeric_property() to be performed when fetching a property for
a snapshot.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2265

Fix style

A minor style issue was accidentally introduced by aa7d06a.
This change resolves that style problem.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Illumos #4101 finer-grained control of metaslab_debug

Today the metaslab_debug logic performs two tasks:

- load all metaslabs on import/open
- don't unload metaslabs at the end of spa_sync

This change provides knobs for each of these independently.

References:
https://illumos.org/issues/4101
https://github.com/illumos/illumos-gate/commit/0713e23

Notes:

1) This is a small piece of the metaslab improvement patch from
Illumos. It was worth bringing over before the rest, since it's
low risk and it can be useful on fragmented pools (e.g. Lustre
MDTs). metaslab_debug_unload would give the performance benefit
of the old metaslab_debug option without causing unwanted delay
during pool import.

Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2227

Treat spill block dbufs as meta data

When the system attributes (SAs) for an object exceed what can
can be stored in the bonus area of a dnode a spill block is
allocated.  These spill blocks are currently considered data
blocks.  However, they should be accounted for as meta data
because they are effectively an extension of the dnode.

While this may seem like a minor accounting issue it has broader
implications.  The key thing to be aware of is that each spill
block will hold a reference on its parent dnode.  The dnode in
turn holds a reference on its dbuf in the dnode object.  This
means that a single 512 byte data buffer for a spill block can
pin over 16k of meta data.  This is analogous to the small file
situation described in 2b13331 where a relatively small number
of data buffer can cause the ARC to exceed the meta limit.

However, unlike the small file case a spill block can legitimately
be considered meta data.  By changing the spill block to meta data
they will now be dropped from the cache when the meta limit is
reached.  This then allows the dnodes and dbufs which the spill
block was pinning to be released.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Closes #2294

Remove SELinux enforcing check from init scripts

The default SELinux policy for RHEL and Fedora has been updated
to include ZFS in the list of filesystems which support xattrs.
Therefore, there's no longer a need to detect this in the init
scripts.

References:
https://bugzilla.redhat.com/show_bug.cgi?id=811532
https://bugzilla.redhat.com/show_bug.cgi?id=816543

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2166

ztest: Switch to LWP rwlock interface

ztest is intended to subject the ZFS code in userland to stress that it
should be able to withstand. Any failures that occur when running it are
failures that likely would occur inside the kernel. However, being in
userland, it is much easier to debug them. In practice, this prevents
a large number of problems from reaching production code.

A design decision was made by the original authors of ztest to make a
distinction between userland locking primitives and kernel locking
primitives. The ztest code itself calls userland locking primitives
while the kernel code being run in userland will call emulated kernel
locking primitives that wrap the userland locking primitives.

When ztest was first ported to Linux, a decision was made to use the
emulated kernel interfaces everywhere. In effect, the userland
rw_rdlock()/rw_wrlock() became the kernel rw_enter() and and the userland
rw_unlock() became the kernel rw_exit(). This caused a regression
because of an assertion in rw_enter() to catch recursive locking. That
is permitted in userland, but not in the kernel. Consequently, the ztest
code itself does recursive read locking. The use of the emulated kernel
interfaces consequently caused the following failure:

ztest: ../../lib/libzpool/kernel.c:384: Assertion `rwlp->rw_owner !=
zk_thread_current() (0x1c87150 != 0x1c87150)' failed.

That occurs because ztest_dmu_objset_create_destroy() will take a read
lock and call ztest_dmu_object_alloc_free(). That will call ztest_io(),
which will take a readlock only when asked to do ZTEST_IO_REWRITE. This
triggered the assertion.

The pthreads rwlock interface was based on the LWP rwlock interface
implemented in Illumos libc. Luckily enough, the subset used by ztest is
almost identical, so we can solve this problem by switching to the LWP
thread rwlock interface in ztest. This eliminates a point of divergence
with Illumos and should make code sharing slightly easier.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1970

libspl: Implement LWP rwlock interface

This implements a subset of the LWP rwlock interface by wrapping the
equivalent POSIX thread interface. It is a superset of the features
needed by ztest.

The missing bits are {,_}rw_read_held() and {,_}rw_write_held().

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1970

Fix libblkid ZFS detection when making new pools

zfsonlinux/zfs@1db7b9be75a225cedb3b7a60028ca5695e5b8346 should have
fixed this, but this particular string was overlooked.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2288

dmu_tx_assign() should not return ENOMEM

As described in the comment above dmu_tx_assign() this function must
only fail if the pool is out of space.  If for some other reason the
TX cannot be assigned (such as memory pressure) ERESTART must be
returned.  Alternately, EAGAIN could be returned to inject a delay
but that isn't required because the caller will block on the condition
variable waiting for the next TXG.

/*
* Assign tx to a transaction group.  txg_how can be one of:
*
* (1)  TXG_WAIT.  If the current open txg is full, waits until there's
*      a new one.  This should be used when you're not holding locks.
*      It will only fail if we're truly out of space (or over quota).
* ...
*/

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #2287

Implement File Attribute Support

We add support for lsattr and chattr to resolve a regression caused
by 88c283952f0bfeab54612f9ce666601d83c4244f that broke Python's
xattr.list(). That changet broke Gentoo Portage's FEATURES=xattr,
which depended on Python's xattr.list().

Only attributes common to both Solaris and Linux are supported. These
are 'a', 'd' and 'i' in Linux's lsattr and chattr commands. File
attributes exclusive to Solaris are present in the ZFS code, but cannot
be accessed or modified through this method. That was the case prior to
this patch. The resolution of issue zfsonlinux/zfs#229 should implement
some method to permit access and modification of Solaris-specific
attributes.

References:
https://bugs.gentoo.org/show_bug.cgi?id=483516

Original-patch-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1691

Refactor inode_owner_or_capable() autotools check

We need inode_owner_or_capable() for ZFS file attributes in addition to
xattrs, so it should go into its own file. This moves it into its own
file and changes it to be more comprehensive. It will now fail if no
known good API is detected.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1691

Fill in mountpoint buffer before using it in errors

zfs_is_mountable() fills in the mountpoint buffer, so, as in
upstream, it needs to have been called before the mountpoint
buffer can be used in error messages.

In particular,

return (zfs_error_fmt(hdl, EZFS_MOUNTFAILED,
dgettext(TEXT_DOMAIN, "cannot mount '%s'"),
mountpoint));

should not come before the call to zfs_is_mountable().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: ilovezfs <ilovezfs@icloud.com>
Closes #2284

Add assertion to catch 0-count page

Some network related block device uses tcp_sendpage, which doesn't
behave well when using 0-count page. Add assertion to catch them.

This has a runtime dependency on:
zfsonlinux/spl@ae16ed9 Fix crash when using ZFS on Ceph rbd

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2277

Add support for aarch64 (ARMv8)

Using the ARM reference simulation (fast model foundation v8) I
cross compiled spl and zfs, to confirm it works on ARMv8 (64 bit
arm architecture, called aarch64 in Linux).

As it is based on previous ARM porting, the resulting patch is
disappointingly small, there was very little to do. The code fixes
the compile issues and has light testing done.

Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2260

Fix LZ4 endianness autodetection

Endianness detection in LZ4 is broken in user-space builds.  This
bug corrupts compressed data and manifests itself in several ztest
failures.  When LZ4 was originally ported to Illumos ZFS, the proper
checks for Linux were stripped out. The Linux port then inherited
the remaining detection code that works on Illumos but not on Linux.

The current LZ4 endianness check misuses the condition
defined(__BIG_ENDIAN) to indicate a big-endian system.  On Linux
__BIG_ENDIAN is defined uncondtionally in the user-space header
/usr/include/endian.h, regardless of the endianness of the system.
The kernel does not use this header, so only user-space builds are
affected.

While we could fix this by restoring the upstream LZ4 endianness
detection code, reliable checks already exist in
libspl/include/sys/isa_defs.h. This change uses the libspl results
to replace the word-size and endianness checks in LZ4, simplifying
the code and reducing duplication.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: DHE <git@dehacked.net>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fixes #1963
Fixes #1964
Fixes #1965

Fix zfsdev_ioctl() kmem leak warning

Due to an asymmetry in the kmem accounting a memory leak was being
reported when it was only an accounting issue.  All memory allocated
with kmem_alloc() must be released with kmem_free() or it will not
be properly accounted for.

In this case the code used strfree() to release the memory allocated
by kmem_alloc().  Presumably this was done because the size of the
memory region wasn't available when the memory needed to be freed.

To resolve this issue the code has been updated to use strdup() instead
of kmem_alloc() to allocate the memory.  Like strfree(), strdup() is
not integrated with the memory accounting.  This means we can use
strfree() to release it like Illumos.

  SPL: kmem leaked 10/4368729 bytes
  address          size  data             func:line
  ffff880067e9aa40 10    ZZZZZZZZZZ       zfsdev_ioctl:5655

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes #2262

Various zimport.sh fixes

1) $SPLSRC and $SRCDIR should be changed to $SRC_DIR.  These are
   vestiges of an earlier version of the script and were missed when
   it was updated.  Additionally ensure the directory is created.

2) The 'fail' function should take an integer argument for the
   error code to return.  Otherwise 0 (success) will be mistakenly
   returned and errors will we incorrectly suppressed.  The error
   code should be meaningful enough to determine where the script
   failed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Report atime and relatime as the property's actual value.

Neither atime nor relatime should be considered to be "temporary mount
point properties". Their semantics are enforced completely within ZFS
and also they're (correctly) not documented as being temporary mount
point properties.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2257

Uninitialized variable spa_autoreplace used

Caught by ztest and valgrind.

Signed-off-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2259

Use ddi_time_after and friends to compare time

Also, make sure we use clock_t for ddi_get_lbolt to prevent type conversion
from screwing things.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2142

Make zimport.sh bash dependency explicit

Unfortunately, the zimport.sh test script really does depend on
bash. Moving to /bin/sh should be possible once the shared
infrastructure scripts it depends on is made portable.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Linux 3.14 compat: rq_for_each_segment in dmu_req_copy

rq_for_each_segment changed from taking bio_vec * to taking bio_vec.
We provide rq_for_each_segment4 which takes both.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2124

Revert "Fix zvol+btrfs hang"

After the dmu_req_copy change, bi_io_vecs are not touched, so this is no
longer needed.

This reverts commit e26ade5101ba1d8e8350ff1270bfca4258e1ffe3.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2124

Refactor dmu_req_copy for immutable biovec changes

Originally, dmu_req_copy modifies bv_len and bv_offset in bio_vec so that it
can continue in subsequent passes. However, after the immutable biovec changes
in Linux 3.14, this is not allowed. So instead, we just tell dmu_req_copy how
many bytes are already copied and it will skip to the right spot accordingly.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2124

Linux 3.14 compat: Immutable biovec changes in vdev_disk.c

bi_sector, bi_size and bi_idx are moved from bio to bio->bi_iter.
This patch creates BIO_BI_*(bio) macros to hide the differences.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2124

Linux 3.14 compat: posix_acl_{create,chmod}

posix_acl_{create,chmod} is changed to __posix_acl_{create_chmod}

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2124

Improve zfs.sh error messages

Ensure an error message is logged when the 'zfs.sh' script fails
to either load a module or if udev fails to create the /dev/zfs
device. Error messages for missing KERNEL_MODULES are suppressed
because that functionality may just be built-in to the kernel.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Replace zed_file_create_dirs() with mkdirp()

When processing directory components starting from the root dir,
zed_file_create_dirs() contained a bug in checking the return value of
mkdir().  A typo was made, and the test for (mkdir_errno != EEXIST) was
erroneously written as (mkdir_errno == EEXIST).  If some of the leading
directory components already existed, this bug would cause the routine
to exit before creating the remaining directory components.

Instead of fixing the above mkdir_errno test, this commit replaces
zed_file_create_dirs() with mkdirp().  This cleanup was already
planned, and zed_file_create_dirs() only existed because I didn't
realize mkdirp() was already in tree at the time.

Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2248

Set errno for mkdirp() called with NULL path ptr

If mkdirp() is called with a NULL ptr for the path arg, it will return
-1 with errno unchanged.  This is unexpected since on error it should
return -1 and set errno to one of the error values listed for mkdir(2).

This commit sets errno = ENOENT for this NULL ptr case.  This is in
accordance with the errors specified by mkdir(2):

  ENOENT
  A component of the path prefix does not exist or is a null pathname.

Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2248

Dynamically create loop devices

Several of the in-tree regression tests depend on the availability
of loop devices.  If for some reason no loop devices are available
the tests will fail.

Normally this isn't an issue because most Linux distributions create
8 loop devices by default.  This is enough for our purposes.  However,
recent Fedora releases have only been creating a single loop device
and this leads to failures.  Alternately, if something else of the
system is using the loop devices we may see failures.

The fix for this is to update the support scripts to dynamically
create loop devices as needed.  The scripts need only create a node
under /dev/ and the loop driver with create the minor.  This behavior
has been supported by the loop driver for ages.

Additionally this patch updates cleanup_loop_devices() to cleanup
loop devices which have already had their file store deleted.  This
helps prevent stale loop devices from accumulating on the system due
to test failures.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Closes #2249

Improve partition detection on lesser used devices

The format strings in efi_get_info() are intended to extract both the
main device and partition number. However, this is only done correctly
for hd, sd and vd devices. The format strings for ram, dm-, md and loop
devices misparse the input. This causes the partition device to be
incorrectly labelled as the main device with the partition being
labelled 0.

Reported-by: ilovezfs <ilovezfs@icloud.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2175

Allow specifying '-o <opts>' in defaults/init script.

Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2103

Support using overlay mounts in defaults/init script.

Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2103

Fix for re-reading /etc/mtab.

This is a continuation of fb5c53ea65b75c67c23f90ebbbb1134a5bb6c140:

    When /etc/mtab is updated on Linux it's done atomically with
    rename(2).  A new mtab is written, the existing mtab is unlinked,
    and the new mtab is renamed to /etc/mtab.  This means that we
    must close the old file and open the new file to get the updated
    contents.  Using rewind(3) will just move the file pointer back
    to the start of the file, freopen(3) will close and open the file.

In this commit, a few more rewind(3) calls were replaced with freopen(3)
to allow updated mtab entries to be picked up immediately.

Signed-off-by: John M. Layman <jml@frijid.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2215
Issue #1611

Fix locking order in zfs_zget()

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Fix deadlock in zfs_zget()

zfsonlinux/zfs#180 occurred because of a race between inode eviction and
zfs_zget(). zfsonlinux/zfs@36df284 tried to address it by making a call
to the VFS to learn whether an inode is being evicted.  If it was being
evicted the operation was retried after dropping and reacquiring the
relevant resources.  Unfortunately, this introduced another deadlock.

  INFO: task kworker/u24:6:891 blocked for more than 120 seconds.
        Tainted: P           O 3.13.6 #1
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  kworker/u24:6   D ffff88107fcd2e80     0   891      2 0x00000000
  Workqueue: writeback bdi_writeback_workfn (flush-zfs-5)
   ffff8810370ff950 0000000000000002 ffff88103853d940 0000000000012e80
   ffff8810370fffd8 0000000000012e80 ffff88103853d940 ffff880f5c8be098
   ffff88107ffb6950 ffff8810370ff980 ffff88103a9a5b78 0000000000000000
  Call Trace:
   [<ffffffff813dd1d4>] schedule+0x24/0x70
   [<ffffffff8115fc09>] __wait_on_freeing_inode+0x99/0xc0
   [<ffffffff8115fdd8>] find_inode_fast+0x78/0xb0
   [<ffffffff811608c5>] ilookup+0x65/0xd0
   [<ffffffffa035c5ab>] zfs_zget+0xdb/0x260 [zfs]
   [<ffffffffa03589d6>] zfs_get_data+0x46/0x340 [zfs]
   [<ffffffffa035fee1>] zil_add_block+0xa31/0xc00 [zfs]
   [<ffffffffa0360642>] zil_commit+0x12/0x20 [zfs]
   [<ffffffffa036a6e4>] zpl_putpage+0x174/0x840 [zfs]
   [<ffffffff811071ec>] do_writepages+0x1c/0x40
   [<ffffffff8116df2b>] __writeback_single_inode+0x3b/0x2b0
   [<ffffffff8116ecf7>] writeback_sb_inodes+0x247/0x420
   [<ffffffff8116f5f3>] wb_writeback+0xe3/0x320
   [<ffffffff81170b8e>] bdi_writeback_workfn+0xfe/0x490
   [<ffffffff8106072c>] process_one_work+0x16c/0x490
   [<ffffffff810613f3>] worker_thread+0x113/0x390
   [<ffffffff81066edf>] kthread+0xdf/0x100

This patch implements the original fix in a slightly different manner in
order to avoid both deadlocks.  Instead of relying on a call to ilookup()
which can block in __wait_on_freeing_inode() the return value from igrab()
is used.  This gives us the information that ilookup() provided without
the risk of a deadlock.

Alternately, this race could be closed by registering an sops->drop_inode()
callback.  The callback would need to detect the active SA hold thereby
informing the VFS that this inode should not be evicted.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #180

Revert "Fixed a use-after-free bug in zfs_zget()."

This reverts commit 36df284366caa77cb40083d2e6bcce02274e2f05.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Merge branch 'zed-initial'

zed monitors ZFS events. When a zevent is posted, zed will run any
scripts that have been enabled for the corresponding zevent class.
Multiple scripts may be invoked for a given zevent. The zevent nvpairs
are passed to the scripts as environment variables. Refer to the zed(8)
manpage for details.

Events are processed synchronously by the single thread, and there is
no maximum timeout for script execution. Consequently, a misbehaving
script can delay (or forever block) the processing of subsequent
zevents. Plans are to address this in future commits.

An EID (Event IDentifier) has been added to each event to uniquely
identify it throughout the lifetime of the loaded ZFS kernel module;
it is a monotonically increasing integer that resets to 1 each time
the module is loaded.

Initial scripts have been developed to log zevents to syslog,
automatically rebuild to a hot spare device, and send email in
response to checksum / data / io / resilver.finish / scrub.finish
zevents. To enable email notifications, configure ZED_EMAIL in zed.rc
(which is serving as a config file of sorts until a proper
configuration file is implemented). To enable hot sparing, uncomment
ZED_SPARE_ON_IO_ERRORS and ZED_SPARE_ON_CHECKSUM_ERRORS in zed.rc;
note that the autoexpand property is not yet supported.

zed is a work-in-progress.

Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2

Replace check for _POSIX_MEMLOCK w/ HAVE_MLOCKALL

zed supports a '-M' cmdline opt to lock all pages in memory via
mlockall().  The _POSIX_MEMLOCK define is checked to determine whether
this function is supported.  The current test assumes mlockall()
is supported if _POSIX_MEMLOCK is non-zero.  However, this test is
insufficient according to mlock(2) and sysconf(3).  If _POSIX_MEMLOCK
is -1, mlockall() is not supported; but if _POSIX_MEMLOCK is 0,
availability must be checked at runtime.

This commit adds an autoconf check for mlockall() to user.m4.  The zed
code block for mlockall() is now guarded with a test for HAVE_MLOCKALL.
If defined, mlockall() will be called and its runtime availability
checked via its return value.

Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2

Add automatic hot spare functionality

When a vdev starts getting I/O or checksum errors it is now
possible to automatically rebuild to a hot spare device.

To cleanly support this functionality in a shell script some
additional information was added to all zevent ereports which
include a vdev.  This covers both io and checksum zevents but
may be used but other scripts.

In the Illumos FMA solution the same information is required
but it is retrieved through the libzfs library interface.
Specifically the following members were added:

  vdev_spare_paths  - List of vdev paths for all hot spares.
  vdev_spare_guids  - List of vdev guids for all hot spares.
  vdev_read_errors  - Read errors for the problematic vdev
  vdev_write_errors - Write errors for the problematic vdev
  vdev_cksum_errors - Checksum errors for the problematic vdev.

By default the required hot spare scripts are installed but this
functionality is disabled.  To enable hot sparing uncomment the
ZED_SPARE_ON_IO_ERRORS and ZED_SPARE_ON_CHECKSUM_ERRORS in the
/etc/zfs/zed.d/zed.rc configuration file.

These scripts do no add support for the autoexpand property. At
a minimum this requires adding a new udev rule to detect when
a new device is added to the system.  It also requires that the
autoexpand policy be ported from Illumos, see:

  https://github.com/illumos/illumos-gate/blob/master/usr/src/cmd/syseventd/modules/zfs_mod/zfs_mod.c

Support for detecting the correct name of a vdev when it's not
a whole disk was added by Turbo Fredriksson.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Issue #2

Add missing DATA_TYPE_STRING_ARRAY output

This functionality has always been missing. But until now there
were no zevents which included an array of strings so it wasn't
missed. However, that's now changed so to ensure this information
is output correctly by 'zpool events -v' the DATA_TYPE_STRING_ARRAY
has been implemented.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Issue #2

Make command line guid parsing more tolerant

Several of the zfs utilities allow you to pass a vdev's guid rather
than the device name.  However, the utilities are not consistent in
how they parse that guid.  For example, 'zinject' expects the guid
to be passed as a hex value while 'zpool replace' wants it as a
decimal.  The user is forced to just know what format to use.

This patch improve things by making the parsing more tolerant.
When strtol(3) is called using 0 for the base, rather than say
10 or 16, it will then accept hex, decimal, or octal input based
on the prefix.  From the man page.

    If base is zero or 16, the string may then include a "0x"
    prefix, and  the number  will  be read in base 16; otherwise,
    a zero base is taken as 10 (decimal) unless the next character
    is '0', in which case it  is  taken as 8 (octal).

NOTE: There may be additional conversions not caught be this patch.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Issue #2

Add systemd unit file for zed

This commit adds a systemd unit file for zed.service and integrates
it into the zfs.target from commit 881f45c.

Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2108
Issue #2

Initial implementation of zed (ZFS Event Daemon)

zed monitors ZFS events.  When a zevent is posted, zed will run any
scripts that have been enabled for the corresponding zevent class.
Multiple scripts may be invoked for a given zevent.  The zevent
nvpairs are passed to the scripts as environment variables.

Events are processed synchronously by the single thread, and there is
no maximum timeout for script execution.  Consequently, a misbehaving
script can delay (or forever block) the processing of subsequent
zevents.  Plans are to address this in future commits.

Initial scripts have been developed to log events to syslog
and send email in response to checksum/data/io errors and
resilver.finish/scrub.finish events.  By default, email will only
be sent if the ZED_EMAIL variable is configured in zed.rc (which is
serving as a config file of sorts until a proper configuration file
is implemented).

Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2

Replace zpool_events_next() "block" parm w/ "flags"

zpool_events_next() can be called in blocking mode by specifying a
non-zero value for the "block" parameter.  However, the design of
the ZFS Event Daemon (zed) requires additional functionality from
zpool_events_next().  Instead of adding additional arguments to the
function, it makes more sense to use flags that can be bitwise-or'd
together.

This commit replaces the zpool_events_next() int "block" parameter with
an unsigned bitwise "flags" parameter.  It also defines ZEVENT_NONE
to specify the default behavior.  Since non-blocking mode can be
specified with the existing ZEVENT_NONBLOCK flag, the default behavior
becomes blocking mode.  This, in effect, inverts the previous use
of the "block" parameter.  Existing callers of zpool_events_next()
have been modified to check for the ZEVENT_NONBLOCK flag.

Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2

Add defs for makefile installation dir vars

Add macro definitions to AM_CPPFLAGS to propagate makefile installation
directory variables for libexecdir, runstatedir, sbindir, and
sysconfdir.

https://www.gnu.org/software/autoconf/manual/autoconf-2.69/html_node/Installation-Directory-Variables.html

  A corollary is that you should not use these variables except
  in makefiles. For instance, instead of trying to evaluate
  datadir in configure and hard-coding it in makefiles using e.g.,
  'AC_DEFINE_UNQUOTED([DATADIR], ["$datadir"], [Data directory.])',
  you should add -DDATADIR='$(datadir)' to your makefile's definition
  of CPPFLAGS (AM_CPPFLAGS if you are also using Automake).

The runstatedir directory is for "installing data files which the
programs modify while they run, that pertain to one specific machine,
and which need not persist longer than the execution of the program".

https://www.gnu.org/prep/standards/html_node/Directory-Variables.html

It will be defined by autoconf 2.70 or later, and default to
"$(localstatedir)/run".

http://git.savannah.gnu.org/gitweb/?p=autoconf.git;a=commit;h=a197431414088a417b407b9b20583b2e8f7363bd

Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2

Clarify zpool_events_next() comment

Due to the very poorly chosen argument name 'cleanup_fd' it was
completely unclear that this file descriptor is used to track the
current cursor location.  When the file descriptor is created by
opening ZFS_DEV a private cursor is created in the kernel for the
returned file descriptor.  Subsequent calls to zpool_events_next()
and zpool_events_seek() then require the file descriptor as an
argument to reposition the cursor.  When the file descriptor is
closed the kernel state tracking the cursor is destroyed.

This patch contains no functional change, it just changes a
few variable names and clarifies the documentation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Issue #2

Add zpool_events_seek() functionality

The ZFS_IOC_EVENTS_SEEK ioctl was added to allow user space callers
to seek around the zevent file descriptor by EID. When a specific
EID is passed and it exists the cursor will be positioned there.
If the EID is no longer cached by the kernel ENOENT is returned.
The caller may also pass ZEVENT_SEEK_START or ZEVENT_SEEK_END to seek
to those respective locations.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Issue #2

Add a unique "eid" value to all zevents

Tagging each zevent with a unique monotonically increasing EID
(Event IDentifier) provides the required infrastructure for a user
space daemon to reliably process zevents. By writing the EID to
persistent storage the daemon can safely resume where it left off
in the event stream when it's restarted.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Issue #2

Remount datasets for "zfs inherit".

Changing properties with "zfs inherit" should cause the datasets
to be remounted. This ensures that the modified property values
will be propagated in to the filesystem namespace where they can
be enforced. This change is modeled after an identical fix made
to zfs_prop_set().

Signed-off-by: Gunnar Beutner <gunnar@beutner.name>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2201

Linux 3.13 compat: Handle __must_check bdi_setup_and_register

torvalds/linux@8077c0d983ab276ec5f2700df56a64d671781905 added a
__must_check to the bdi_setup_and_register(), which caused our autotools
check to break. zfsonlinux/zfs@729210564a5325e190fc4fba22bf17bacf957ace
was intended to correct that, but it depended on -Wno-unused-result,
which is unrecognized in older GCC versions. That commit has been
reverted in favor of a solution that does not require
-Wno-unused-result.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2102
Closes #2135

Revert "Properly ignore bdi_setup_and_register return value"

Older GCC versions do not obey -Wno-unused-result. This reverts commit
729210564a5325e190fc4fba22bf17bacf957ace in favor of a solution that
does not require -Wno-unused-result.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1906

Illumos #4089 NULL pointer dereference in arc_read()

4089 NULL pointer dereference in arc_read()

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
https://www.illumos.org/issues/4089
illumos/illumos-gate@57815f6b95a743697e148327725b7f568e75e6ea

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2171
Issue #2165
Closes #2198

Illumos #3120 zinject hangs in zfsdev_ioctl() due to uninitialized zc

3120 zinject hangs in zfsdev_ioctl() due to uninitialized zc

Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
https://www.illumos.org/issues/3120
illumos/illumos-gate@f4c46b1eda9212fd32ba197043d52239ef5c0a7f

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2152

config: compile test rather than run test

When testing compiler flags, we only need to do compile test. Otherwise,
configure will fail with "configure: error: cannot run test program while
cross compiling" when cross compiling.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2191

Implement -t option to zpool import for temporary pool names

Originally, users had to handle spa namespace collisions by either
exporting the already imported pool or by specifying a new name for the
pool with a conflicting name. In the case of root pools from virtual
guests, neither approach to collision resolution is reasonable. This is
addressed by extending the new name syntax with a -t option to specify
that the new name is temporary. When specified, this sets an internal
flag that is passed into the kernel to tell it that all label updates
should refer to the name used in the original label. Consequently, the
original pool name will be retained on export.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2189

Assert alignment in umem_alloc_aligned

Valgrind suggests that the address we are returning is not properly
aligned, so lets add an assertion.

==87740==  Address 0x1012a22a is 554 bytes inside a block of size 4,096
alloc'd
==87740==    at 0x4C2BBA0: memalign (in
/usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==87740==    by 0x4C2BCC7: posix_memalign (in
/usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==87740==    by 0x52FA845: zio_buf_alloc (umem.h:101)
==87740==    by 0x52F6226: zil_alloc_lwb (zil.c:463)
==87740==    by 0x52F8559: zil_commit (zil.c:566)
==87740==    by 0x40611D: ztest_freeze (ztest.c:5909)
==87740==    by 0x4066A7: ztest_init (ztest.c:6048)
==87740==    by 0x407AF4: main (ztest.c:6226)

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2174

sighandler() should take 2 arguments

Stopping arcstat.py with ^C always ends up with error:
TypeError: sighandler() takes no arguments (2 given)

Since no special signal handling was done in sighandler(),
it's simpler to just set SIGINT handler to SIG_DFL, which
terminates the script.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Isaac Huang <he.huang@intel.com>
Closes #2179

Revert "sighandler() should take 2 arguments"

This reverts commit 0bb89b6c594259829556f6dea5a89e722f214fd3 in
favor of a cleaner implementation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2182

Fix regression introduced in port of Illumos #3744

Remove the redundant call to zfs_unmount_snap() which was being
done after char array was freed,

This fixes an upstream regression that was introduced in commit
zfsonlinux/zfs@d09f25dc66774959499a89bf3680d09c6e541ce8, which
ported the Illumos 3744 changes.

Signed-off-by: Andrey Vesnovaty <andrey.vesnovaty@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #2156

Remove unused option -r from 'zpool import'

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2178

Switch ztest mmap(2) ASSERTs to VERIFYs

This is just a small bit of cleanup to ensure ztest fails early
on systems where mmap(2) is not functioning. For the automated
testing which is the primary consumer of ztest there is no
functional change because debugging is always enabled.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2177

Free props in ztest_init()

Valgrind complained about this and it's absolutely right. The
props nvlist was not being freed in ztest_init.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #2174

sighandler() should take 2 arguments

Stopping arcstat.py with ^C always ends up with error:

TypeError: sighandler() takes no arguments (2 given)

This patch corrects the error by updating the signal handler
to take the correct number of arguments.

Signed-off-by: Isaac Huang <he.huang@intel.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2182

Make arcstat.py default to one header per screen

Today arcstat.py prints one header every hdr_intr (20 by default)
lines. It would be more consistent with out utilities like vmstat
if hdr_intr defaulted to terminal window size, i.e. one header
per screenful of outputs.

Signed-off-by: Isaac Huang <he.huang@intel.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2183

Fix NAME section of manpages zhack and fsck.zfs.

In Debian GNU/Linux a program called 'linitian' is used to make sure
that packages conforms to the Debian GNU/Linux packaging guide lines.
This fixes the problem reported as:

W: zfsutils: manpage-has-bad-whatis-entry usr/share/man/man1/zhack.1.gz
W: zfsutils: manpage-has-bad-whatis-entry usr/share/man/man8/fsck.zfs.8.gz

Not something that ZoL needs to addhere to, but every other man page
have their NAME section in a special way - why not these two as well?

Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2161

Illumos #4088 use after free in arc_release()

4088 use after free in arc_release()

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
https://www.illumos.org/issues/4088
illumos/illumos-gate@ccc22e130479b5bd7c0002267fee1e0602d3f772

From the illumos issue:

A race-induced use after free occurs in arc_release() where the
ARC header is used outside the critical section protected by the
hash_lock.

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #2162

Use KM_PUSHPAGE in spa_add() for spa_label_features.

The spa_label_features nvlist is used in the sync context during pool
version upgrade.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2168

Export symbols dsl_sync_task{_nowait}

These are needed by consumers (i.e. Lustre) who wish to perform a
callback in the syncing context. Both a blocking and non-blocking
version are available to callers.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Improve reporting of tx assignment wait times

Some callers of dmu_tx_assign() use the TXG_NOWAIT flag and call
dmu_tx_wait() themselves before retrying if the assignment fails.
The wait times for such callers are not accounted for in the
dmu_tx_assign kstat histogram, because the histogram only records
time spent in dmu_tx_assign().  This change moves the histogram
update to dmu_tx_wait() to properly account for all time spent there.

One downside of this approach is that it is possible to call
dmu_tx_wait() multiple times before successfully assigning a
transaction, in which case the cumulative wait time would not be
recorded.  However, this case should not often arise in practice,
because most callers currently use one of these forms:

  dmu_tx_assign(tx, TXG_WAIT);
  dmu_tx_assign(tx, waited ?  TXG_WAITED : TXG_NOWAIT);

The first form should make just one call to dmu_tx_delay() inside of
dmu_tx_assign(). The second form retries with TXG_WAITED if the first
assignment fails and incurs a delay, in which case no further waiting
is performed.  Therefore transaction delays normally occur in one
call to dmu_tx_wait() so the histogram should be fairly accurate.

Another possible downside of this approach is that the histogram will
no longer record overhead outside of dmu_tx_wait() such as in
dmu_tx_try_assign(). While I'm not aware of any reason for concern on
this point, it is conceivable that lock contention, long list
traversal, etc. could cause assignment delays that would not be
reflected in the histogram.  Therefore the histogram should strictly
be used for visibility in to the normal delay mechanisms and not as a
profiling tool for code performance.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1915

replace nreserved with ndirty in txgs kstat

The nreserved column in the txgs kstat file always contains 0
following the write throttle restructuring of commit
e8b96c6007bf97cdf34869c1ffbd0ce753873a3d.

Prior to that commit, the nreserved column showed the number of bytes
temporarily reserved in the pool by a transaction group at sync time.
The new write throttle did away with temporary reservations and uses
the amount of dirty data instead.  To approximate the old output of
the txgs kstat, the number of dirty bytes per-txg was passed in as
the nreserved value to spa_txg_history_set_io().  This approach did
not work as intended, because the per-txg dirty value is decremented
as data is written out to disk, so it is zero by the time we call
spa_txg_history_set_io().  To fix this, save the number of dirty
bytes before calling spa_sync(), and pass this value in to
spa_txg_history_set_io().

Also, since the name "nreserved" is now a misnomer, the column
heading is now labeled "ndirty".

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1696

dmu_tx kstat cleanup

A few counters in the dmu_tx kstats are obsolete or no longer
bumped properly.

- The sync task restructuring commit
  13fe019870c8779bf2f5b3ff731b512cf89133ef removed the code
  that bumpted dmu_tx_quota. The counter is now bumped in two
  cases, instead of just the one case as before (after the result
  of dsl_dataset_check_quota call). The second case is where
  we check the requested reservation against the actual pool size,
  as this is an implicit quota of sorts.

- The write throttle restructuring commit
  e8b96c6007bf97cdf34869c1ffbd0ce753873a3d makes dmu_tx_how and
  dmu_tx_inflight obsolete, so they are removed.

Signed-off-by: Kohsuke Kawaguchi <kk@kohsuke.org>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1914

Invalidate Linux buffer cache on vdevs upon each flush

Userland tools such as blkid, grub2-probe and zdb will go through the
buffer cache. However, ZFS uses on submit_bio() to bypass the buffer
cache when performing IO operations on vdevs for efficiency purposes.
This permits the on-disk state and buffer cache to fall out of
synchronization. That causes seemingly random failures when tools
reading stale metadata from the buffer cache try to access references to
data that is no longer there.

A particularly bad failure this causes involves grub2-probe, which is
used by grub2-mkconfig. Ordinarily, a rootfs might be called
rpool/ROOT/gentoo. However, when a failure occurs in grub2-probe,
grub2-mkconfig will generate a configuration file containing
/ROOT/gentoo, which omits the pool name and causes a boot failure.

This is avoidable by calling invalidate_bdev() on each flush, which is a
simple way to ensure that all non-dirty pages are wiped. Since userland
tools rarely access vdevs directly, this should be a fancy noop >99.999%
of the time and have little impact on IO. We could have tried a finer
grained approach for the rare instances in which the vdevs are accessed
frequently by userland. However, that would require consideration of
corner cases and it is not worth the effort.

Memory-wise, it would have been better to use a Linux kernel API hook to
disable the buffer cache on such devices, but it provides us no way of
doing that, so we opt for this approach instead. We should revisit that
idea in the future when higher priority issues have been tackled.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2150

Inform OpenRC that ZFS uses mtab

p_l in #zfsonlinux reported that he had issues mounting filesystems that
were resolved by adding rc_need="mtab" to /etc/init.d/zfs. Closer
inspection revealed that we do have a race, but it is not clear how this
race caused mounting to fail. What is clear is that this race should be
fixed, so lets add the proper `use mtab` line to handle it.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2148

Illumos #4574 get_clones_stat does not call zap_count in non-debug kernel

4574 get_clones_stat does not call zap_count in non-debug kernel

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Marcel Telka <marcel@telka.sk>
Approved by: Gordon Ross <gwr@nexenta.com>

References:
https://www.illumos.org/issues/4574
illumos/illumos-gate@03d1795fa6f720eafbee821ad37f4343c391cfe4

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2147

Fix zap_lookup() in feature_is_supported().

The length (number of integers) argument passed to zap_lookup was wrong;
likely as a result of performing stack-reduction on the function.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2141

Include l2asize in arcstat

For consistency with upstream pull in the l2asize update after
reworking it from Perl to Python.

References:
https://github.com/mharsch/arcstat/pull/11
https://github.com/mharsch/arcstat/pull/12

Signed-off-by: cburroughs <chris.burroughs@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2122

Remove recursion from dsl_dir_willuse_space()

Remove recursion from dsl_dir_willuse_space() to reduce stack usage.
Issues with stack overflow were observed in zfs recv of zvols,
likelihood of an overflow is proportional to the depth of the dataset
as dsl_dir_willuse_space() recurses to parent datasets.

Signed-off-by: Andrew Barnes <barnes333@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2069

Merge branch 'arc-changes'

This stack of patches has been empirically shown to drastically improve
the hit rate of the ARC for certain workloads. As a result, fewer reads
to disk are required, which is generally a good thing and can
drastically improve performance if the workload is disk limited.

For the impatient, I'll summarize the results of the tests performed:

    * Test 1 - Creating many empty directories. This test saw 99.9%
               fewer reads and 12.8% more inodes created when running
               *with* these changes.

    * Test 2 - Creating many empty files. This test saw 4% fewer reads
               and 0% more inodes created when running *with* these
               changes.

    * Test 3 - Creating many 4 KiB files. This test saw 96.7% fewer
               reads and 4.9% more inodes created when running *with*
               these changes.

    * Test 4 - Creating many 4096 KiB files. This test saw 99.4% fewer
               reads and 0% more inodes created (but took 6.9% fewer
               seconds to complete) when running *with* these changes.

    * Test 5 - Rsync'ing a dataset with many empty directories. This
               test saw 36.2% fewer reads and 66.2% more inodes created
               when running *with* these changes.

    * Test 6 - Rsync'ing a dataset with many empty files. This test saw
               30.9% fewer reads and 0% more inodes created (but took
               24.3% fewer seconds to complete) when running *with*
               these changes.

    * Test 7 - Rsync'ing a dataset with many 4 KiB files. This test saw
               30.8% fewer reads and 173.3% more inodes created when
               running *with* these changes.

For the patient, the following consists of more a more detailed
description of the tests performed and the results gathered.

All the tests were run using identical machines, each with a pool
consisting of 5 mirror pairs with 2TB 7200 RPM disks. Each test was run
twice, once *without* this set of patches and again *with* this set of
patches to highlight the performance changes introduced. The first four
workloads tested were:

    ** NOTE: None of these tests were run to completion. They ran for a
             set amount of time and then were terminated or hit ENOSPC.

    1. Creating many empty directories:

       * fdtree -d 10 -l 8 -s 0 -f 0 -C
         -> 111,111,111 Directories
         ->           0 Files
         ->           0 KiB File Data

    2. Creating many empty files:

       * fdtree -d 10 -l 5 -s 0 -f 10000 -C
         ->       111,111 Directories
         -> 1,111,110,000 Files
         ->             0 KiB File Data

    3. Creating many 4 KiB files:

       * fdtree -d 10 -l 5 -s 1 -f 10000 -C
         ->       111,111 Directories
         -> 1,111,110,000 Files
         -> 4,444,440,000 KiB File Data

    4. Creating many 4096 KiB files:

       * fdtree -d 10 -l 5 -s 1024 -f 10000 -C
         ->           111,111 Directories
         ->     1,111,110,000 Files
         -> 4,551,106,560,000 KiB File Data

Results for these first four tests are below:

                  | Time (s) |   inodes |  reads |    writes |
                --+----------+----------+--------+-----------+
    Test 1 Before |    65069 | 37845363 | 831975 |   3214646 |
    Test 1 After  |    65069 | 42703608 |    778 |   3327674 |
                --+----------+----------+--------+-----------+
    Test 2 Before |    65073 | 54257583 | 208647 |   2413056 |
    Test 2 After  |    65069 | 54255782 | 200038 |   2533759 |
                --+----------+----------+--------+-----------+
    Test 3 Before |    65068 | 49857744 | 487130 |   5533348 |
    Test 3 After  |    65071 | 52294311 |  16078 |   5648354 |
                --+----------+----------+--------+-----------+
    Test 4 Before |    34854 |  2448329 | 385870 | 162116572 |
    Test 4 After  |    32419 |  2448329 |   2339 | 162175706 |
                --+----------+----------+--------+-----------+

    * "Time (s)" - The run time of the test in seconds
    * "inodes"   - The number of inodes created by the test
    * "reads"    - The number of reads performed by the test
    * "writes"   - The number of writes performed by the test

As you can see from the table above, running with this patch stack
*significantly* reduced the number of reads performed in 3 out of the 4
tests (due to an improved ARC hit rate).

In addition to the tests described above, which specifically targeted
creates only, three other workloads were tested. These additional tests
were targeting rsync performance against the datasets created in the
previous tests. A brief description of the workloads and results for
these tests are below:

    ** NOTE: Aside from (6), these tests didn't run to completion. They
             ran for a set amount of time and then were terminated.

    5. Rsync the dataset created in Test 1 to a new dataset:

       * rsync -a /tank/test-1 /tank/test-5

    6. Rsync the dataset created in Test 2 to a new dataset:

       * rsync -a /tank/test-2 /tank/test-6

    7. Rsync the dataset created in Test 3 to a new dataset:

       * rsync -a /tank/test-3 /tank/test-7

Results for Test 5, 6, and 7 are below:

                  | Time (s) |   inodes |    reads |  writes |
                --+----------+----------+----------+---------+
    Test 5 Before |    93041 | 17921014 | 47632823 | 4094848 |
    Test 5 After  |    93029 | 29785847 | 30376206 | 4484459 |
                --+----------+----------+----------+---------+
    Test 6 Before |    15290 | 54264474 |  6018331 |  733087 |
    Test 6 After  |    11573 | 54260826 |  4155661 |  617285 |
                --+----------+----------+----------+---------+
    Test 7 Before |    93057 | 10093749 | 41561635 | 3659098 |
    Test 7 After  |    93045 | 27587043 | 28773151 | 5612234 |
                --+----------+----------+----------+---------+

    * "Time (s)" - The run time of the test in seconds
    * "inodes"   - The number of inodes created by the test
    * "reads"    - The number of reads performed by the test
    * "writes"   - The number of writes performed by the test

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2110

Set "arc_meta_limit" to 3/4 arc_c_max by default

Unfortunately, this change is an cheap attempt to work around a
pathological workload for the ARC. A "real" solution still needs to be
fleshed out, so this patch is intended to alleviate the situation in the
meantime. Let me try and describe the problem..

Data buffers residing in the dbuf hash table (dbuf cache) will keep a
hold on their respective dnode, this dnode will in turn keep a hold on
its backing dbuf (the physical block of the dnode object backing it).
Since the dnode has a hold on its backing dbuf, the arc buffer for this
dbuf is unevictable. What this essentially boils down to, "data" buffers
have the potential to pin "metadata" in the arc (as a result of these
dnode object buffers being unevictable).

This scenario becomes a real problem when the workload consists of many
small files (e.g. creating millions of 4K files). With this workload,
the arc's "arc_meta_used" space get filled up with buffers for any
resident directories as well as buffers for the objset's dnode object.
Once the "arc_meta_limit" is reached, the directory buffers will be
evicted and only the unevictable dnode object buffers will reside. If
the workload is simply creating new small files, these dnode object
buffers will never even be needed again, whereas the directory buffers
will be used constantly until the creates move to a new directory.

If "arc_c" and "arc_meta_limit" are sized appropriately, this
situation wont occur. This is because as the data buffers accumulate,
"arc_size" will eventually approach "arc_c" (before "arc_meta_used"
reaches "arc_meta_limit"); at that point the data buffers will be
evicted, which releases the hold on the dnode, which releases the hold
on the dnode object's dbuf, which allows that buffer to be evicted from
the arc in preference to more "useful" metadata.

So, to side step the issue, we simply need to ensure "arc_size" reaches
"arc_c" before "arc_meta_used" reaches "arc_meta_limit". In order to
pick a proper limit, we have to do some math.

To make things a little easier to follow, it is assumed that there will
only be a single data buffer per file (which is probably always the case
for "small" files anyways).

Based on the current internals of the arc, if N files residing in the
dbuf cache all pin a single dnode buffer (i.e. their dnodes all share
the same physical dnode object block), then the following amount of
"arc_meta_used" space will be consumed:

    - 16K for the dnode object's block - [        16384 bytes]
    - N * sizeof(dnode_t) -------------- [      N * 928 bytes]
    - (N + 1) * sizeof(arc_buf_t) ------ [(N + 1) *  72 bytes]
    - (N + 1) * sizeof(arc_buf_hdr_t) -- [(N + 1) * 264 bytes]
    - (N + 1) * sizeof(dmu_buf_impl_t) - [(N + 1) * 280 bytes]

To simplify, these N files will pin the following amount of
"arc_meta_used" space as unevictable:

    Pinned "arc_meta_used" bytes = 16384 + N * 928 + (N + 1) * (72 + 264 + 280)
    Pinned "arc_meta_used" bytes = 17000 + N * 1544

This pinned space is regardless of the size of the files, and is only
dependent on the number of pinned dnodes sharing a physical block
(i.e. N). For example, 32 512b files sharing a single dnode object
block would consume the same "arc_meta_used" space as 32 4K files
sharing a single dnode object block.

Now, given a files size of S, we can determine the total amount of
space that will be consumed in the arc:

    Total = 17000 + N * 1544 + S * N
            ^^^^^^^^^^^^^^^^   ^^^^^
                metadata        data

So, given these formulas, we can generate a table which states the ratio
of pinned metadata to total arc (meta + data) using different values of
N (number of pinned dnodes per pinned physical dnode block) and S (size
of the file).

                                  File Sizes (S)
       |    512   |   1024   |   2048   |   4096   |   8192   |   16384  |
    ---+----------+----------+----------+----------+----------+----------+
     1 | 0.973132 | 0.947670 | 0.900544 | 0.819081 | 0.693597 | 0.530921 |
     2 | 0.951497 | 0.907481 | 0.830632 | 0.710325 | 0.550779 | 0.380051 |
N   4 | 0.918807 | 0.849809 | 0.738842 | 0.585844 | 0.414271 | 0.261250 |
     8 | 0.877541 | 0.781803 | 0.641770 | 0.472505 | 0.309333 | 0.182965 |
    16 | 0.835819 | 0.717945 | 0.559996 | 0.388885 | 0.241376 | 0.137253 |
    32 | 0.802106 | 0.669597 | 0.503304 | 0.336277 | 0.202123 | 0.112423 |

As you can see, if we wanted to support the absolute worst case of 1
dnode per physical dnode block and 512b files, we would have to set the
"arc_meta_limit" to something greater than 97.3132% of "arc_c_max". At
that point, it essentially defeats the purpose of having an
"arc_meta_limit" at all.

This patch changes the default value of "arc_meta_limit" to be 75% of
"arc_c_max", which should be good enough for "most" workloads (I think).

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110

Split "data_size" into "meta" and "data"

Previously, the "data_size" field in the arcstats kstat contained the
amount of cached "metadata" and "data" in the ARC. The problem is this
then made it difficult to extract out just the "metadata" size, or just
the "data" size.

To make it easier to distinguish the two values, "data_size" has been
modified to count only buffers of type ARC_BUFC_DATA, and "meta_size"
was added to count only buffers of type ARC_BUFC_METADATA. If one wants
the old "data_size" value, simply sum the new "data_size" and
"meta_size" values.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110

Prioritize "metadata" in arc_get_data_buf

When the arc is at it's size limit and a new buffer is added, data will
be evicted (or recycled) from the arc to make room for this new buffer.
As far as I can tell, this is to try and keep the arc from over stepping
it's bounds (i.e. keep it below the size limitation placed on it).

This makes sense conceptually, but there appears to be a subtle flaw in
its current implementation, resulting in metadata buffers being
throttled. When it evicts from the arc's lists, it also passes in a
"type" so as to remove a buffer of the same type that it is adding. The
problem with this is that once the size limit is hit, the ratio of
"metadata" to "data" contained in the arc essentially becomes fixed.

For example, consider the following scenario:

    * the size of the arc is capped at 10G
    * the meta_limit is capped at 4G
    * 9G of the arc contains "data"
    * 1G of the arc contains "metadata"

Now, every time a new "metadata" buffer is created and added to the arc,
an older "metadata" buffer(s) will be removed from the arc; preserving
the 9G "data" to 1G "metadata" ratio that was in-place when the size
limit was reached. This occurs even though the amount of "metadata" is
far below the "metadata" limit. This can result in the arc behaving
pathologically for certain workloads.

To fix this, the arc_get_data_buf function was modified to evict "data"
from the arc even when adding a "metadata" buffer; unless it's at the
"metadata" limit. In addition, arc_evict now more closely resembles
arc_evict_ghost; such that when evicting "data" from the arc, it may
make a second pass over the arc lists and evict "metadata" if it cannot
meet the eviction size the first time around.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110

Remove "arc_meta_used" from arc_adjust calculation

Using "arc_meta_used" to determine if the arc's mru list is over it's
target value of "arc_p" doesn't seem correct. The size of the mru list
and the value of "arc_meta_used", although related, are completely
independent. Buffers contained in "arc_meta_used" may not even be
contained in the arc's mru list. As such, this patch removes
"arc_meta_used" from the calculation in arc_adjust.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110

Prune metadata from ghost lists in arc_adjust_meta

To maintain a strict limit on the metadata contained in the arc, while
preventing the arc buffer headers from completely consuming the
"arc_meta_used" space, we need to evict metadata buffers from the arc's
ghost lists along with the regular lists.

This change modifies arc_adjust_meta such that it more closely models
the adjustments made in arc_adjust. "arc_meta_used" is used similarly to
"arc_size", and "arc_meta_limit" is used similarly to "arc_c".

Testing metadata intensive workloads (e.g. creating, copying, and
removing millions of small files and/or directories) has shown this
change to make a dramatic improvement to the hit rate maintained in the
arc. While I think there is still room for improvement, this is a big
step in the right direction.

In addition, zpl_free_cached_objects was made into a no-op as I'm not
yet sure how to properly implement that function.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110

Revert "Return -1 from arc_shrinker_func()"

This reverts commit c11a12bc3b2e5ee9a6bd74e26f1a396b6025fbd4.

Out of memory events were fixed by reverting this patch.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110

Disable arc_p adapt dampener by default

It's unclear why adjustments to arc_p need to be dampened as they are in
arc_adjust. With that said, it's removal significantly improves the arc's
ability to "warm up" to a given workload. Thus, I'm disabling by default
until its usefulness is better understood.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110

Allow "arc_p" to drop to zero or grow to "arc_c"

Setting a limit on the minimum value of "arc_p" has been shown to have
detrimental effects on the arc hit rate for certain "metadata" intensive
workloads. Specifically, this has been exhibited with a workload that
constantly dirties new "metadata" but also frequently touches a "small"
amount of mfu data (e.g. mkdir's).

What is seen is that the new anon data throttles the mfu list to a
negligible size (because arc_p > anon + mru in arc_get_data_buf), even
though the mfu ghost list receives a constant stream of hits. To remedy
this, arc_p is now allowed to drop to zero if the algorithm deems it
necessary.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110

Disable aggressive arc_p growth by default

For specific workloads consisting mainly of mfu data and new anon data
buffers, the aggressive growth of arc_p found in the arc_get_data_buf()
function can have detrimental effects on the mfu list size and ghost
list hit rate.

Running a workload consisting of two processes:

    * Process 1 is creating many small files
    * Process 2 is tar'ing a directory consisting of many small files

I've seen arc_p and the mru grow to their maximum size, while the mru
ghost list receives 100K times fewer hits than the mfu ghost list.

Ideally, as the mfu ghost list receives hits, arc_p should be driven
down and the size of the mfu should increase. Given the specific
workload I was testing with, the mfu list size should grow to a point
where almost no mfu ghost list hits would occur. Unfortunately, this
does not happen because the newly dirtied anon buffers constancy drive
arc_p to its maximum value and keep it there (effectively prioritizing
the mru list and starving the mfu list down to a negligible size).

The logic to increment arc_p from within the arc_get_data_buf() function
was introduced many years ago in this upstream commit:

    commit 641fbdae3a027d12b3c3dcd18927ccafae6d58bc
    Author: maybee <none@none>
    Date:   Wed Dec 20 15:46:12 2006 -0800

        6505658 target MRU size (arc.p) needs to be adjusted more aggressively

and since I don't fully understand the motivation for the change, I am
reluctant to completely remove it.

As a way to test out how it's removal might affect performance, I've
disabled that code by default, but left it tunable via a module option.
Thus, if its removal is found to be grossly detrimental for certain
workloads, it can be re-enabled on the fly, without a code change.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110

Adjust arc_p based on "bytes" in arc_shrink

In an attempt to prevent arc_c from collapsing "too fast", the
arc_shrink() function was updated to take a "bytes" parameter by this
change:

    commit 302f753f1657c05a4287226eeda1f53ae431b8a7
    Author: Brian Behlendorf <behlendorf1@llnl.gov>
    Date:   Tue Mar 13 14:29:16 2012 -0700

        Integrate ARC more tightly with Linux

Unfortunately, that change failed to make a similar change to the way
that arc_p was updated. So, there still exists the possibility for arc_p
to collapse to near 0 when the kernel start calling the arc's shrinkers.

This change attempts to fix this, by decrementing arc_p by the "bytes"
parameter in the same way that arc_c is updated.

In addition, a minimum value of arc_p is attempted to be maintained,
similar to the way a minimum arc_p value is maintained in arc_adapt().

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110