git.proxmox.com Git - mirror_zfs-debian.git/log

Linux 3.5 compat, eops->encode_fh() takes inodes

The export_operations member ->encode_fh() has been updated to
take both the child and parent inodes. This interface used to
take the child dentry and a bool describing if the parent is needed.

NOTE: While updating this code I noticed that we do not currently
cleanly handle the case where we're passed a connectable parent.
This code should be audited to make sure we're doing the right thing.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #784

Fix NULL pointer dereference on PaX/GRSecurity patched Linux 3.3 and later kernels

Support for PaX/GRSecurity patched kernels was developed against Linux
3.2. Unfortunately, an autotools check introduced for a Linux 3.3 API
fails on PaX/GRSecurity patched kernels. This causes the module to be
built against the Linux 3.2 ABI, which results in a NULL pointer
dereference at runtime.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Closes #794
Closes #809

Disable .zfs directory on 32-bit systems

The .zfs control directory implementation currently relies on
the fact that there is a direct 1:1 mapping from an object id
to its inode number.  This works well as long as the system
uses a 64-bit value to store the inode number.

Unfortunately, the Linux kernel defines the inode number as
an 'unsigned long' type.  This means that for 32-bit systems
will only have 32-bit inode numbers but we still have 64-bit
object ids.

This problem is particularly acute for the .zfs directories
which leverage those upper 32-bits.  This is done to avoid
conflicting with object ids which are allocated monotonically
starting from 0.  This is likely to also be a problem for
datasets on 32-bit systems with more than ~2 billion files.

The right long term fix must remove the simple 1:1 mapping.
Until that's done the only safe thing to do is to disable the
.zfs directory on 32-bit systems.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Add ddt_object_load() error handling

Add the missing error handling to ddt_object_load().  There's no
good reason this needs to be fatal.  It is preferable that an
error be returned.  This will allow 'zpool import -FX' to safely
attempt to rollback through previous txgs looking for a good one.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Add 'inline' keyword

The '__attribute__((always_inline))' does not strictly imply
'inline'.  Newer versions of gcc detect this misuse and issue
the following warning.  Including the missing 'inline' resolves
the build warning.

    ./module/zfs/dsl_scan.c:758:1:error: always_inline function
    might not be inlinable [-Werror=attributes]

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Fix build failures on PaX/GRSecurity patched kernels

Gentoo Hardened kernels include the PaX/GRSecurity patches. They use a
dialect of C that relies on a GCC plugin. In particular, struct
file_operations has been marked do_const in the PaX/GRSecurity dialect,
which causes GCC to consider all instances of it as const. This caused
failures in the autotools checks and the ZFS source code.

To address this, we modify the autotools checks to take into account
differences between the PaX C dialect and the regular C dialect. We also
modify struct zfs_acl's z_ops member to be a pointer to a function
pointer table. Lastly, we modify zpl_put_link() to address a PaX change
to the function prototype of nd_get_link(). This avoids compiler errors
in the PaX/GRSecurity dialect.

Note that the change in zpl_put_link() causes a warning that becomes a
build failure when debugging is enabled. Fixing that warning requires
ryao/spl@5ca50ef459c59bc74b7a7cd3af7311da2b1cd2c3.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #484

Move partition scanning from userspace to module.

Currently, zpool online -e (dynamic vdev expansion) doesn't work on
whole disks because we're invoking ioctl(BLKRRPART) from userspace
while ZFS still has a partition open on the disk, which results in
EBUSY.

This patch moves the BLKRRPART invocation from the zpool utility to the
module. Specifically, this is done just before opening the device in
vdev_disk_open() which is called inside vdev_reopen(). This requires
jumping through some hoops to get to the disk device from the partition
device, and to make sure we can still open the partition after the
BLKRRPART call.

Note that this new code path is triggered on dynamic vdev expansion
only; other actions, like creating a new pool, are unchanged and still
call BLKRRPART from userspace.

This change also depends on API changes which are available in 2.6.37
and latter kernels.  The build system has been updated to detect this,
but there is no compatibility mode for older kernels.  This means that
online expansion will NOT be available in older kernels.  However, it
will still be possible to expand the vdev offline.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #808

Move zfs.release generation to configure step

Previously, the zfs.release file was created at 'make install' time.
This is slightly problematic when the file is needed without running
'make install'. Because of this, the step creating the file was removed
from 'make install' and replaced with a more appropriate zfs.release.in
file.

As a result, the zfs.release file will now be created earlier as part
of the 'configure' step as opposed to the 'make install' step.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Add PowerPC to supported VTOCs

This code was was inherited from Solaris which was careful to define
the expected VTOC for various supported architectures. While this
check may have made sense there it's something we should be able to
safely drop under Linux.

However, I'm not quite ready to do that yet. So for the moment I'm
just doing the very safe thing of adding PowerPC as a supported type.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Fix efi_use_whole_disk() when efi_nparts == 128.

Commit e5dc681a changed EFI_NUMPAR from 9 to 128. This means that the
on-disk EFI label has efi_nparts = 128 instead of 9. The index of the
reserved partition, however, is still 8. This breaks
efi_use_whole_disk(), which uses efi_nparts-1 as the index of the
reserved partition.

This commit fixes efi_use_whole_disk() when the index of the reserved
partition is not efi_nparts-1. It rewrites the algorithm and makes it
more robust by using the order of the partitions instead of their
numbering. It assumes that the last non-empty partition is the reserved
partition, and that the non-empty partition before that is the data
partition.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #808

Use the right device path when relabeling.

Currently, zpool_vdev_online() calls zpool_relabel_disk() with a short
partition device name, which is obviously wrong because (1)
zpool_relabel_disk() expects a full, absolute path to use with open()
and (2) efi_write() must be called on an opened disk device, not a
partition device.

With this patch, zpool_relabel_disk() gets called with a full disk
device path. The path is determined using the same algorithm as
zpool_find_vdev().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #808

Fix error handling for "zpool online -e".

The error handling code around zpool_relabel_disk() is either inexistent
or wrong. The function call itself is not checked, and
zpool_relabel_disk() is generating error messages from an unitialized
buffer.

Before:

    # zpool online -e homez sdb; echo $?
    `: cannot relabel 'sdb1': unable to open device: 2
    0

After:

    # zpool online -e homez sdb; echo $?
    cannot expand sdb: cannot relabel 'sdb1': unable to open device: 2
    1

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #808

Illumos #1949, #1953

1949 crash during reguid causes stale config
1953 allow and unallow missing from zpool history since removal of pyzfs

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Bill Pijewski <wdp@joyent.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Garrett D'Amore <garrett.damore@gmail.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Steve Gonczi <gonczi@comcast.net>
Approved by: Eric Schrock <eric.schrock@delphix.com>

References:
https://www.illumos.org/issues/1949
https://www.illumos.org/issues/1953

Ported by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #665

Illumos #1748: desire support for reguid in zfs

Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Alexander Eremin <alexander.eremin@nexenta.com>
Reviewed by: Alexander Stetsenko <ams@nexenta.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/1748

This commit modifies the user to kernel space ioctl ABI.  Extra
care should be taken when updating to ensure both the kernel
modules and utilities are updated.  If only the user space
component is updated both the 'zpool events' command and the
'zpool reguid' command will not work until the kernel modules
are updated.

Ported by:     Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #665

Relicense zfs.gentoo.in from GPLv2 to 2-clause BSD

As the Gentoo sys-fs/zfs maintainer, I receive license compatibility
questions and at times, those questions can be harassing. I feel that
the presence of the GPL in Gentoo's package metadata promotes such
questions. zfs.gentoo.in is the only GPLv2 licensed file in ZFS, so I
have taken the liberty of contacting all contributors to this file to
request permission to relicense it.

All of the contributors to this file have agreed to relicense it under
the 2-clause BSD license. I have added their Signed-offs to this commit,
in order of first contribution. Thank you everyone for being so
understanding.

Signed-off-by: devsk <devsku@gmail.com>
Signed-off-by: Alexey Shvetsov <alexxy@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrew Tselischev <andrewtselischev@gmail.com>
Signed-off-by: Zachary Bedell <zac@thebedells.org>
Signed-off-by: Gunnar Beutner <gunnar@beutner.name>
Signed-off-by: Kyle Fuller <inbox@kylefuller.co.uk>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Closes #819

Use ULL suffix in constants

The lack of the ULL suffix causes warnings such as the following on
32-bit systems:

  In function 'zfsctl_is_snapdir':
  zfs-0.6.0//module/zfs/zfs_ctldir.c:151: warning: integer constant
  is too large for 'long' type

We add the ULL suffix to fix that.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #813

Update incorrect ddt_zap_lookup() assertion

When the ddt_zap_lookup() function was updated to dynamically
allocate memory for the cbuf variable, to save stack space, the
'csize <= sizeof (cbuf)' assertion was not updated. The result
of this was that the size of the pointer was being used in the
comparison rather than the buffer size.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>

Remove Chaos 4.x RPM support

The Chaos 4.x distribution is based on RHEL 5.x which is no longer
supported by ZoL since it uses a 2.6.18 kernel.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Support debug and debug-devel sub packages

This commit adds support for building debug and debug-devel sub packages
of the zfs-modules main package. This is to allow building packages
which are built against a debug kernel. By default, only packages are
built against a regular non-debug kernel. This can be toggled by passing
the '--with kernel-debug' parameter to rpmbuild.

Examples:

    # To build packages against only the non-debug kernel
    $ rpmbuild --rebuild --with kernel --without kernel-debug $SRPM

    # To build packages against only the debug kernel
    $ rpmbuild --rebuild --without kernel --with kernel-debug $SRPM

    # To build packages against debug and non-debug kernel
    $ rpmbuild --rebuild --with kernel --with kernel-debug $SRPM

Note: Only the RHEL 5/6, CHAOS 5, and Fedora distributions are supported
      for building the debug and debug-devel packages.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Add ZIL statistics.

The performance of the ZIL is usually the main bottleneck when dealing with
synchronous, write-heavy workloads (e.g. databases). Understanding the
behavior of the ZIL is required to diagnose performance issues for these
workloads, and to tune ZIL parameters (like zil_slog_limit) accordingly.

This commit adds a new kstat page dedicated to the ZIL with some counters
which, hopefully, scheds some light into what the ZIL is doing, and how it is
doing it.

Currently, these statistics are available in /proc/spl/kstat/zfs/zil.
A description of the fields can be found in zil.h.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #786

ZFS 0.6.0-rc9

Speed up 'zfs list -t snapshot -o name -s name'

FreeBSD #xxx:  Dramatically optimize listing snapshots when user
requests only snapshot names and wants to sort them by name, ie.
when executes:

  # zfs list -t snapshot -o name -s name

Because only name is needed we don't have to read all snapshot
properties.

Below you can find how long does it take to list 34509 snapshots
from a single disk pool before and after this change with cold and
warm cache:

    before:

        # time zfs list -t snapshot -o name -s name > /dev/null
        cold cache: 525s
        warm cache: 218s

    after:

        # time zfs list -t snapshot -o name -s name > /dev/null
        cold cache: 1.7s
        warm cache: 1.1s

NOTE: This patch only appears in FreeBSD.  If/when Illumos picks up
the change we may want to drop this patch and adopt their version.
However, for now this addresses a real issue.

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #450

Add zvol_inhibit_dev module option.

ZoL can create more zvols at runtime than can be configured during
system start, which hangs the init stack at reboot.

When a slow system has more than a few hundred zvols, udev will
fork bomb during system start and spend too much time in device
detection routines, so upstart kills it.

The zfs_inhibit_dev option allows an affected system to be rescued
by skipping /dev/zd* creation and thereby avoiding the udev
overload. All zvols are made inaccessible if this option is set, but
the `zfs destroy` and `zfs send` commands still work, and ZFS
filesystems can be mounted.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Make zvol_remove_link() print a more useful error message

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Mark zdev.conf as a config file

Prevent 'rpm -Uvh *.rpm" from automatically replacing your vdev.conf
file by flagging it as a non replacable config file.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #486

Workaround for failing zvol_id

This is not a proper fix. It is just a workaround for the stack
smashing detected by gcc in zvol_id. We simply disable the gcc
stack protector for now when building the zvol_id udev helper.
Once the root cause is resolved this patch should be reverted.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issues #569

Make zil_slog_limit a tunable module parameter.

zil_slog_limit specifies the maximum commit size to be written to the separate
log device. Larger commits bypass the separate log device and go directly to
the data devices.

The optimal value for zil_slog_limit directly depends on the latency and
throughput characteristics of both the separate log device and the data disks.
Small synchronous writes are faster on low-latency separate log devices (e.g.
SSDs) whereas large synchronous writes are faster on high-latency data disks
(e.g. spindles) because of higher throughput, especially with a large array.
The point is, the line between "small" and "large" synchronous writes in this
scenario is heavily dependent on the hardware used. That's why it should be
made configurable.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #783

Retry removal of busy minors

When failing to remove a zvol device link because it's busy, wait
a bit and retry in a loop instead of giving up immediately. This
technique is similar to the loop in zpool_label_disk_wait(), with
the same goal: waiting for the asynchronous udev processes to finish
their work.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #692

Include <locale.h> to avoid error: 'LC_ALL' undeclared.

When compiling ZFS with CFLAGS=-O0 it will trigger the following error.
Resolve the issue by properly including locale.h.

  ../../cmd/mount_zfs/mount_zfs.c: In function 'main':
  ../../cmd/mount_zfs/mount_zfs.c:318:2: warning: implicit declaration
      of function 'setlocale' [-Wimplicit-function-declaration]
  ../../cmd/mount_zfs/mount_zfs.c:318:19: error: 'LC_ALL' undeclared
      (first use in this function)

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #724

Linux 3.4 compat, d_make_root() replaces d_alloc_root()

torvalds/linux@adc0e91ab142abe93f5b0d7980ada8a7676231fe introduced
introduced d_make_root() as a replacement for d_alloc_root(). Further
commits appear to have removed d_alloc_root() from the Linux source
tree. This causes the following failure:

  error: implicit declaration of function 'd_alloc_root'
  [-Werror=implicit-function-declaration]

To correct this we update the code to use the current d_make_root()
interface for readability.  Then we introduce an autotools check
to determine if d_make_root() is available.  If it isn't then we
define some compatibility logic which used the older d_alloc_root()
interface.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #776

Honor logbias when writing to ZVOLs.

The logbias option is not taken into account when writing to ZVOLs. We fix
that by using the same logic as in the zfs filesystem write code
(see zfs_log.c).

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #774

Improve CONFIG_DEBUG_LOCK_ALLOC error message

The configure script error message for kernels built with
CONFIG_DEBUG_LOCK_ALLOC may give the impression that the issue is
strictly with license compliance. To avoid confusion add some words
indicating that the linking stage will fail if the build continues.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #773

Improve 'zpool import' EBUSY error message

When a device is already open O_EXCL by another process the
`zpool import` will correctly fail.  However, the default failure
message isn't very helpful.  It may in fact be harmful if you
take its advise and destroy your pool.

    cannot import 'tank': pool is busy
            Destroy and re-create the pool from
            a backup source.

Improve the error message in the EBUSY case to simply print a
message indicating that the devices are current in use.  The user
will need to manually identify which process has the device open
exclusively and why.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Add /dev/mapper/ to search path

When creating pools short device names may be used when those
devices appear in certain well known locations under /dev/.
This change adds /dev/mapper/ to that list.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Add vdev_id for JBOD-friendly udev aliases

vdev_id parses the file /etc/zfs/vdev_id.conf to map a physical path
in a storage topology to a channel name.  The channel name is combined
with a disk enclosure slot number to create an alias that reflects the
physical location of the drive.  This is particularly helpful when it
comes to tasks like replacing failed drives.  Slot numbers may also be
re-mapped in case the default numbering is unsatisfactory.  The drive
aliases will be created as symbolic links in /dev/disk/by-vdev.

The only currently supported topologies are sas_direct and sas_switch:

o  sas_direct - a channel is uniquely identified by a PCI slot and a
   HBA port

o  sas_switch - a channel is uniquely identified by a SAS switch port

A multipath mode is supported in which dm-mpath devices are handled by
examining the first running component disk, as reported by 'multipath
-l'.  In multipath mode the configuration file should contain a
channel definition with the same name for each path to a given
enclosure.

vdev_id can replace the existing zpool_id script on systems where the
storage topology conforms to sas_direct or sas_switch.  The script
could be extended to support other topologies as well.  The advantage
of vdev_id is that it is driven by a single static input file that can
be shared across multiple nodes having a common storage toplogy.
zpool_id, on the other hand, requires a unique /etc/zfs/zdev.conf per
node and a separate slot-mapping file.  However, zpool_id provides the
flexibility of using any device names that show up in
/dev/disk/by-path, so it may still be needed on some systems.

vdev_id's functionality subsumes that of the sas_switch_id script, and
it is unlikely that anyone is using it, so sas_switch_id is removed.

Finally, /dev/disk/by-vdev is added to the list of directories that
'zpool import' will scan.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #713

Extend CONFIG_DEBUG_LOCK_ALLOC check

The CONFIG_DEBUG_LOCK_ALLOC check at configure time was added to
detect when mutex_lock() is defined as a GPL-only symbol. However,
the check as written only inferred this from this configuration
setting, it never actually checked. This change introduces that
missing check to prevent false positives.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Define the needed ISA types for ARM

Add the minimum required ISA types to support the ARM architecture.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Revert "Disable direct reclaim on zvols"

This reverts commit ce90208cf9e04df966429f115d8831371ea9afce. This
change was observed to cause problems when using a zvol to back a VM
under 2.6.32.59 kernels. This issue was filed as #710.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #342
Issue #710

Add mdadm and bc dependencies

The zfs-test package additionally depends on mdadm and bc to
run the zfault.sh tests.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #690

Linux 3.3 compat, iops->create()/mkdir()/mknod()

The mode argument of iops->create()/mkdir()/mknod() was changed from
an 'int' to a 'umode_t'. To prevent a compiler warning an autoconf
check was added to detect the API change and then correctly set a
zpl_umode_t typedef. There is no functional change.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #701

Disable direct reclaim on zvols

Previously, it was possible for the direct reclaim path to be invoked
when a write to a zvol was made. When a zvol is used as a swap device,
this often causes swap requests to depend on additional swap requests,
which deadlocks. We address this by disabling the direct reclaim path
on zvols.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #342

Update ARC memory limits to account for SLUB internal fragmentation

23bdb07d4e4c435205d25d3efdb5fef2d089ce5e updated the ARC memory limits
to be 1/2 of memory or all but 4GB. Unfortunately, these values assume
zero internal fragmentation in the SLUB allocator, when in reality, the
internal fragmentation could be as high as 50%, effectively doubling
memory usage. This poses clear safety issues, because it permits the
size of ARC to exceed system memory.

This patch changes this so that the default value of arc_c_max is always
1/2 of system memory. This effectively limits the ARC to the memory that
the system has physically installed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #660

Integrate ARC more tightly with Linux

Under Solaris the ARC was designed to stay one step ahead of the
VM subsystem.  It would attempt to recognize low memory situtions
before they occured and evict data from the cache.  It would also
make assessments about if there was enough free memory to perform
a specific operation.

This was all possible because Solaris exposes a fairly decent
view of the memory state of the system to other kernel threads.
Linux on the other hand does not make this information easily
available.  To avoid extensive modifications to the ARC the SPL
attempts to provide these same interfaces.  While this works it
is not ideal and problems can arise when the ARC and Linux have
different ideas about when your out of memory.  This has manifested
itself in the past as a spinning arc_reclaim_thread.

This patch abandons the emulated Solaris interfaces in favor of
the prefered Linux interface.  That means moving the bulk of the
memory reclaim logic out of the arc_reclaim_thread and in to the
evict driven shrinker callback.  The Linux VM will call this
function when it needs memory.  The ARC is then responsible for
attempting to free the requested amount of memory if possible.

Several interfaces have been modified to accomidate this approach,
however the basic user space implementation remains the same.
The following changes almost exclusively just apply to the kernel
implementation.

* Removed the hdr_recl() reclaim callback which is redundant
  with the broader arc_shrinker_func().

* Reduced arc_grow_retry to 5 seconds from 60.  This is now used
  internally in the ARC with arc_no_grow to indicate that direct
  reclaim was recently performed.  This typically indicates a
  rapid change in memory demands which the kswapd threads were
  unable to keep ahead of.  As long as direct reclaim is happening
  once every 5 seconds arc growth will be paused to avoid further
  contributing to the existing memory pressure.  The more common
  indirect reclaim paths will not set arc_no_grow.

* arc_shrink() has been extended to take the number of bytes by
  which arc_c should be reduced.  This allows for a more granual
  reduction of the arc target.  Since the kernel provides a
  reclaim value to the arc_shrinker_func() this value is used
  instead of 1<<arc_shrink_shift.

* arc_reclaim_needed() has been removed.  It was used to determine
  if the system was under memory pressure and relied extensively
  on Solaris specific VM interfaces.  In most case the new code
  just checks arc_no_grow which indicates that within the last
  arc_grow_retry seconds direct memory reclaim occurred.

* arc_memory_throttle() has been updated to always include the
  amount of evictable memory (arc and page cache) in its free
  space calculations.  This space is largely available in most
  call paths due to direct memory reclaim.

* The Solaris pageout code was also removed to avoid confusion.
  It has always been disabled due to proc_pageout being defined
  as NULL in the Linux port.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Add zfs_mdcomp_disable module option

Expose the zfs_mdcomp_disable variable as a module option. This
can be used to disable compression of zfs meta data which is
enabled by default. This shouldn't need to be tuned but for
most workloads, however there may be very specific instances
where it makes sense to trade disk capacity for extra cpu cycles.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Illumos #2067: uninitialized variables in zfs(1M) may make snapshots undestroyable

Reviewed by: Joshua M. Clulow <josh@sysmgr.org>
Reviewed by: Milan Jurik <milan.jurik@xylab.cz>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Steve Gonczi <gonczi@comcast.net>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
https://www.illumos.org/issues/2067

Ported by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Illumos #1946: incorrect formatting when listing output of multiple pools with zpool iostat -v

Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Joshua M. Clulow <josh@sysmgr.org>
Approved by: Richard Lowe <richlowe@richlowe.net>

Reference to Illumos issue:
https://www.illumos.org/issues/1946

Ported by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Illumos #952: separate intent logs should be obvious in 'zpool iostat' output

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Garrett D'Amore <garrett@nexenta.com>
Approved by: Eric Schrock <eric.schrock@delphix.com>

Refererce to Illumos issue:
https://www.illumos.org/issues/952

Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #607

Illumos #1909: disk sync write perf regression when slog is used post oi_148

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Bill Pijewski <wdp@joyent.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Steve Gonczi <gonczi@comcast.net>
Reviewed by: Garrett D'Amore <garrett.damore@gmail.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Albert Lee <trisk@nexenta.com>
Approved by: Eric Schrock <eric.schrock@delphix.com>

Refererces to Illumos issue:
https://www.illumos.org/issues/1909

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #680

Use KM_PUSHPAGE in l2arc_write_buffers

There is potential for deadlock in the l2arc_feed thread if KM_PUSHPAGE
is not used for the allocations made in l2arc_write_buffers.
Specifically, if KM_PUSHPAGE is not used for these allocations, it is
possible for reclaim to be triggered which can cause the l2arc_feed
thread to deadlock itself on the ARC_mru mutex. An example of this is
demonstrated in the following backtrace of the l2arc_feed thread:

    crash> bt 4123
    PID: 4123   TASK: ffff88062f8c1500  CPU: 6   COMMAND: "l2arc_feed"
      0 [ffff88062511d610] schedule at ffffffff814eeee0
      1 [ffff88062511d6d8] __mutex_lock_slowpath at ffffffff814f057e
      2 [ffff88062511d748] mutex_lock at ffffffff814f041b
      3 [ffff88062511d768] arc_evict at ffffffffa05130ca [zfs]
      4 [ffff88062511d858] arc_adjust at ffffffffa05139a9 [zfs]
      5 [ffff88062511d878] arc_shrink at ffffffffa0513a95 [zfs]
      6 [ffff88062511d898] arc_kmem_reap_now at ffffffffa0513be8 [zfs]
      7 [ffff88062511d8c8] arc_shrinker_func at ffffffffa0513ccc [zfs]
      8 [ffff88062511d8f8] shrink_slab at ffffffff8112a17a
      9 [ffff88062511d958] do_try_to_free_pages at ffffffff8112bfdf
     10 [ffff88062511d9e8] try_to_free_pages at ffffffff8112c3ed
     11 [ffff88062511da98] __alloc_pages_nodemask at ffffffff8112431d
     12 [ffff88062511dbb8] kmem_getpages at ffffffff8115e632
     13 [ffff88062511dbe8] fallback_alloc at ffffffff8115f24a
     14 [ffff88062511dc68] ____cache_alloc_node at ffffffff8115efc9
     15 [ffff88062511dcc8] __kmalloc at ffffffff8115fbf9
     16 [ffff88062511dd18] kmem_alloc_debug at ffffffffa047b8cb [spl]
     17 [ffff88062511dda8] l2arc_feed_thread at ffffffffa0511e71 [zfs]
     18 [ffff88062511dea8] thread_generic_wrapper at ffffffffa047d1a1 [spl]
     19 [ffff88062511dee8] kthread at ffffffff81090a86
     20 [ffff88062511df48] kernel_thread at ffffffff8100c14a

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

ZFS list snapshot property alias

Add support for the `zfs list -t snap` alias which is available under
Oracle Solaris 11.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #640

ZFS snapshot alias

For consistency, and because it's handy, add the 'zfs snap' alias which
was introduced by Oracle Solaris 11. This includes an update to the
man page to reflect all the available alias (snap, umount, and recv).

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #640

Illumos #1346: zfs incremental receive may leave behind temporary clones

1356 zfs dataset prefetch code not working
Reviewed by: Matthew Ahrens <matt@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Approved by: Gordon Ross <gwr@nexenta.com>

References to Illumos issue:
https://www.illumos.org/issues/1346
https://www.illumos.org/issues/1356

Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #647

Illumos #1475: zfs spill block hold can access invalid spill blkptr

Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Approved by: Garrett D'Amore <garrett@nexenta.com>

References to Illumos issue:
https://www.illumos.org/issues/1475

Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #648

Illumos #1951: leaking a vdev when removing an l2cache device

1952 memory leak when adding a file-based l2arc device
1954 leak in ZFS from metaslab_group_create and zfs_ereport_checksum

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Bill Pijewski <wdp@joyent.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Approved by: Eric Schrock <eric.schrock@delphix.com>

References to Illumos issues:
  https://www.illumos.org/issues/1951
  https://www.illumos.org/issues/1952
  https://www.illumos.org/issues/1954

Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #650

OS-926: zfs panic in zfs_fill_zplprops_impl()

This change appears to be exclusive to SmartOS. It is not present in
illumos-gate but it just adds some needed error handling. This is
clearly preferable to simply ASSERTING which is what would occur
prior to the patch.

Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #652

Illumos #1680: zfs vdev_file_io_start: validate vdev before using vdev_tsd

vdev_tsd can be NULL for certain vdev states.
At least in userland testing with ztest.

References to Illumos issue:
https://www.illumos.org/issues/1680

Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #655

Improve error message consistency

Signed-off-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Document the zle compression algorithm

Signed-off-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Export additional dsl symbols

Principly these symbols were exported to get access to the
dsl_prop_register/dsl_prop_unregister functions. They allow
us to cleanly register a callback which is called when a
dataset property is modified.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Fixed a NULL pointer dereference bug in zfs_preumount

When zpl_fill_super -> zfs_domount fails (e.g. because the dataset
was destroyed before it could be successfully mounted) the subsequent
call to zpl_kill_sb -> zfs_preumount would derefence a NULL pointer.

This bug can be reproduced using this shell script:

#!/bin/sh
(
while true; do
zfs create -o mountpoint=legacz tank/bar
zfs destroy tank/bar
done
) &

(
while true; do
mount -t zfs tank/bar /mnt
umount /mnt
done
) &

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #639

Make Gentoo initscript use modinfo

The -l parameter to modprobe has been removed from the latest upstream
code and this change has entered Gentoo. Using modinfo as a substitute
addresses this.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #636

Print human readable error message for ENOENT

A cryptic error code is printed when mounting a legacy dataset to a
non-existent mountpoint. This patch changes this behavior to print
"mount point '%s' does not exist", which is similar to the error
message printed when mounting procfs.

The single quotes were added to be consistent with the existing EBUSY
error message, which is the only difference between this error message
and the one that is printed when the same condition occurs when mounting
procfs.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #633

Properly expose the mfu ghost list kstats

Due to a typo the mru ghost lists stats were accidentally being
exposed as the mfu ghost list stats. This was harmless but
confusing since memory usage could be over reported.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Remove hard-coded 80 column output

When stdout is detected to be a tty use the number of columns
specified by the terminal. If that fails fall back to a default
80 column width. In the non-tty case allow for 999 column lines.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

ZFS 0.6.0-rc8

Fix executable permissions

Caught by lint, this permission change was accidentally introduced
by commit 42cb3819f1a1f536105faac81ffc150f3da90a80. Restore the
correct permissions and while I'm at it add a missing whack-bang
to config/ltmain.sh.

lint: executable-not-elf-or-script: zpool_main.c zfs_main.c

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #620

Add --enable-debug-dmu-tx configure option

Allow rigorous (and expensive) tx validation to be enabled/disabled
indepentantly from the standard zfs debugging.  When enabled these
checks ensure that all txs are constructed properly and that a dbuf
is never dirtied without taking the correct tx hold.

This checking is particularly helpful when adding new dmu consumers
like Lustre.  However, for established consumers such as the zpl
with no known outstanding tx construction problems this is just
overhead.

--enable-debug-dmu-tx  - Enable/disable validation of each tx as
--disable-debug-dmu-tx   it is constructed.  By default validation
                         is disabled due to performance concerns.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Enhance a dmu_tx_dirty_buf() assertion

The following assertion is good to validate the correctness of
new DMU consumers, but it doesn't quite provide enough information.
Slightly rework the assertion so that when it is hit the actual
offending values will be included in the output.

SPLError: 4787:0:(dmu_tx.c:828:dmu_tx_dirty_buf())
ASSERTION(dn == NULL || dn->dn_assigned_txg == tx->tx_txg) failed

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Add ZFS_META_RELEASE to module load/unload messages

Include the ZFS_META_RELEASE in the module load/unload messages
to more clearly indidcate exactly what version of ZFS has been
loaded.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Account for .zfs ctldir inodes

Because the .zfs ctldir inodes are not backed by physical storage
they use a different create path which was not properly accounting
for them as used. This could result in ->nr_cached_objects()
returning 0 and cause a divide by zero error in prune_super().

In my option there's a kernel bug here too which allows this to
happen. They should either be checking for 0 or adding +1 like
they correctly do earlier in the function.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #617

Add .zfs control directory

Add support for the .zfs control directory.  This was accomplished
by leveraging as much of the existing ZFS infrastructure as posible
and updating it for Linux as required.  The bulk of the core
functionality is now all there with the following limitations.

*) The .zfs/snapshot directory automount support requires a 2.6.37
   or newer kernel.  The exception is RHEL6.2 which has backported
   the d_automount patches.

*) Creating/destroying/renaming snapshots with mkdir/rmdir/mv
   in the .zfs/snapshot directory works as expected.  However,
   this functionality is only available to root until zfs
   delegations are finished.

      * mkdir - create a snapshot
      * rmdir - destroy a snapshot
      * mv    - rename a snapshot

The following issues are known defeciences, but we expect them to
be addressed by future commits.

*) Add automount support for kernels older the 2.6.37.  This should
   be possible using follow_link() which is what Linux did before.

*) Accessing the .zfs/snapshot directory via NFS is not yet possible.
   The majority of the ground work for this is complete.  However,
   finishing this work will require resolving some lingering
   integration issues with the Linux NFS kernel server.

*) The .zfs/shares directory exists but no futher smb functionality
   has yet been implemented.

Contributions-by: Rohan Puri <rohan.puri15@gmail.com>
Contributiobs-by: Andrew Barnes <barnes333@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #173

Add zio constructor/destructor

Add a standard zio constructor and destructor.  Normally, this is
done to reduce to cost of allocating a new structure by reducing
expensive operations such as memory allocations.  However, in this
case none of the operations moved out of zio_create() were really
very expensive.

This change was principly made as a debug patch (and workaround)
for a zio_destroy() race.  The is good evidence that zio_create()
is reinitializing a mutex which is really still in use by another
thread.  This would completely explain the observed symptoms in
the issue report.

This patch doesn't fix the root cause of the race, but it should
make it less likely by only initializing the mutex once in the
constructor.  Also, this particular flaw might have gone unnoticed
in other zfs implementations due to the specific implementation
details of Linux ticket spinlocks.

Once the real root cause is determined and resolved this change
can be safely reverted.  Until then this should help workaround
the issue.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #496

Revert "Add zio constructor/destructor"

This patch was slightly flawed and allowed for zio->io_logical
to potentially not be reinitialized for a new zio.  This could
lead to assertion failures in specific cases when debugging is
enabled (--enable-debug) and I/O errors are encountered.  It
may also have caused problems when issues logical I/Os.

Since we want to make sure this workaround can be easily removed
in the future (when we have the real fix).  I'm reverting this
change and applying a new version of the patch which includes
the zio->io_logical fix.

This reverts commit 2c6d0b1e07b0265f0661ed7851d3aa8d3e75e7a9.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #602
Issue #604

ZFS 0.6.0-rc7

Add missing NULL in zpl_xattr_handlers

The xattr_resolve_name() helper function expects the registered
list of xattr handlers to be NULL terminated. This NULL was
accidentally missing which could result in a NULL dereference.

Interestingly this issue only manifested itself on certain 32-bit
systems. Presumably on 64-bit kernels we just always happen to
get lucky and the memory following the structure is zeroed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #594

Use stderr for 'no pools/datasets available' error

The 'zfs list' and 'zpool list' commands output the message
'no datasets/pools available' to stdout. This should go to
stderr and only the available datasets/pools should go to
stdout. Returning nothing to stdout is expected behavior
when there is nothing to list.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #581

Add sa_spill_rele() interface

Add a SA interface which allows us to release the spill block
from a SA handle without destroying the handle. This is useful
because we can then ensure that a copy of the dirty spill block
is not made at sync time due to the extra hold. Susequent calls
to sa_update() or sa_lookup() with transparently refetch the
spill block dbuf from the ARC hash.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Add zio constructor/destructor

Add a standard zio constructor and destructor.  Normally, this is
done to reduce to cost of allocating a new structure by reducing
expensive operations such as memory allocations.  However, in this
case none of the operations moved out of zio_create() were really
very expensive.

This change was principly made as a debug patch (and workaround)
for a zio_destroy() race.  The is good evidence that zio_create()
is reinitializing a mutex which is really still in use by another
thread.  This would completely explain the observed symptoms in
the issue report.

This patch doesn't fix the root cause of the race, but it should
make it less likely by only initializing the mutex once in the
constructor.  Also, this particular flaw might have gone unnoticed
in other zfs implementations due to the specific implementation
details of Linux ticket spinlocks.

Once the real root cause is determined and resolved this change
can be safely reverted.  Until then this should help workaround
the issue.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #496

Fix distribution detection

Improve the distribution detection by moving the tests for
distribution specific files first. The Ubuntu and Debian
checks are left for last because they are the least likely
to be unique. This is particularly true in the case of Debian
since so many distributions are based on Debian.

Since this is currently only used to identify the correct
packaging method for this system the result in many instances
is simply cosmetic.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Align parition end on 1 MiB boundary

Some devices have exhibited sensitivity to the ending alignment of
partitions.  In particular, even if the first partition begins at 1
MiB, we have seen many sd driver task abort errors with certain SSDs
if the first partition doesn't end on a 1 MiB boundary.  This occurs
when the vdev label is read during pool creation or importation and
causes a delay of about 30 seconds per device.  It can also be
simulated with dd when the pool isn't imported:

  dd if=/dev/sda1 of=/dev/null bs=262144 count=1

For the record, this problem was observed with SMARTMOD
SG9XCA2E200GE01 200GB SSDs.  Unfortunately I don't have a good
explanation for this behavior. It seems to have something to do with
highly fragmented single-sector requests being issued to the device,
which it may not support.  With end-aligned partitions at least
page-sized requests were queued and issued to the driver according
to blktrace. In any case, aligning the partition end is a fairly
innocuous work-around, wasting at most 1 MiB of space.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #574

Use SA_HDL_PRIVATE for SA xattrs

A private SA handle must be used to ensure we can drop the dbuf
hold on the spill block prior to calling dmu_tx_commit().  If we
call dmu_tx_commit() before sa_handle_destroy(), then our hold
will trigger a copy of the dbuf to be made.  This is done to
prevent data from leaking in to the syncing txg.  As a result
the original dirty spill block will remain cached.

Additionally, relying on the shared zp->z_sa_hdl is unsafe in
the xattr context because the znode may be asynchronously dropped
from the cache.  It's far safer and simpler just to use a private
handle for xattrs.  Plus any additional overhead is offset by
the avoidance of the previously mentioned memory copy.

These forever dirty buffers can be noticed in the arcstats under
the anon_size.  On a quiescent system the value should be zero.
Without this fix and a SA xattr write workload you will see
anon_size increase.  Eventually, if enough dirty data builds up
your system it will appear to hang.  This occurs because the dmu
won't allow new txs to be assigned until that dirty data is
flushed, and it won't be because it's not part of an assigned tx.

As an aside, I typically see anon_size lurk around 16k so I think
there is another place in the code which needs a similar fix.
However, this value doesn't grow over time so it isn't critical.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #503
Issue #513

Cleanly support debug packages

Allow a source rpm to be rebuilt with debugging enabled.  This
avoids the need to have to manually modify the spec file.  By
default debugging is still largely disabled.  To enable specific
debugging features use the following options with rpmbuild.

  '--with debug'               - Enables ASSERTs

  # For example:
  $ rpmbuild --rebuild --with debug zfs-modules-0.6.0-rc6.src.rpm

Additionally, ZFS_CONFIG has been added to zfs_config.h for
packages which build against these headers.  This is critical
to ensure both zfs and the dependant package are using the same
prototype and structure definitions.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Add 'dmu_tx' kstats entry

Keep counters for the various reasons that a thread may end up
in txg_wait_open() waiting on a new txg. This can be useful
when attempting to determine why a particular workload is
under performing.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Add arc_state_t stats to arcstats

To ensure the arc is behaving properly we need greater visibility
in to exactly how it's managing the systems memory. This patch
takes one step in that direction be adding the current arc_state_t
for the anon, mru, mru_ghost, mfu, and mfs_ghost lists. The l2
arc_state_t is already well represented in the arcstats.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Return success from check_slice() if device doesn't exist

When creating a new pool, make_root_vdev() calls check_in_use() to
ensure that none of the consituent disks are in use.  If the disk
contains a valid vdev label it is read to retrieve the list of its
child vdevs and these are checked recursively.  However, the
partitions stored in the vdev label my no longer exist, for example
if the partition table has since been altered.  In any such case we
would want the pool creation to proceed, so this change removes the
check from check_slice() that returns an error if the device doesn't
exist.  As an added assurance, the Solaris implementation also
returns sucess on ENOENT.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Export symbols for zero-copy

Export additional symbols to make use of the DMU's zero-copy
API. This allows external modules to move data in to and out of
the ARC without incurring the cost of a memory copy.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Support ashift=13 for 8KB SSD block sizes

New SSDs are now available which use an internal 8k block size.
To make sure ZFS can get the maximum performance out of these
devices we're increasing the maximum ashift to 13 (8KB).

This value is still small enough that we can fit 16 uberblocks
in the vdev ring label. However, I don't want to increase this
any futher or it will limit the ability the safely roll back a
pool to recover it.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #565

Add 'fsid' mount option to allowed options.

Resolves nfs-utils-1.0.x compatibility issue which requires
that the fsid be set in the export options.

exportfs: Warning: /tank/dir requires fsid= for NFS export

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #570

Export symbols for zero-copy

Exported the required symbols to make use of the DMU's zero-copy
API. This allows external modules to move data in to and out of
the ARC without incurring the cost of a memory copy.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Use spl_debug_* helpers

When configuring the spl debug log support use the provided wrapper
functions. This ensures that if --disable-debug-log was used when
buiding the spl the functions will have no effect.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Add support for DISCARD to ZVOLs.

DISCARD (REQ_DISCARD, BLKDISCARD) is useful for thin provisioning.
It allows ZVOL clients to discard (unmap, trim) block ranges from
a ZVOL, thus optimizing disk space usage by allowing a ZVOL to
shrink instead of just grow.

We can't use zfs_space() or zfs_freesp() here, since these functions
only work on regular files, not volumes. Fortunately we can use the
low-level function dmu_free_long_range() which does exactly what we
want.

Currently the discard operation is not added to the log. That's not
a big deal since losing discard requests cannot result in data
corruption. It would however result in disk space usage higher than
it should be. Thus adding log support to zvol_discard() is probably
a good idea for a future improvement.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Support the fallocate() file operation.

Currently only the (FALLOC_FL_PUNCH_HOLE) flag combination is
supported, since it's the only one that matches the behavior of
zfs_space(). This makes it pretty much useless in its current
form, but it's a start.

To support other flag combinations we would need to modify
zfs_space() to make it more flexible, or emulate the desired
functionality in zpl_fallocate().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #334

Check permissions in zfs_space().

This isn't done on Solaris because on this OS zfs_space() can
only be called with an opened file handle. Since the addition of
zpl_truncate_range() this isn't the case anymore, so we need to
enforce access rights.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #334

Implement the truncate_range() inode operation.

This operation allows "hole punching" in ZFS files. On Solaris this
is done via the vop_space() system call, which maps to the zfs_space()
function. So we just need to write zpl_truncate_range() as a wrapper
around zfs_space().

Note that this only works for regular files, not ZVOLs.

This is currently an insecure implementation without permission
checking, although this isn't that big of a deal since truncate_range()
isn't even callable from userspace.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #334

Fix zconfig.sh non-optimal alignment

The recent zvol improvements have changed default suggested alignment
for zvols from 512b (default) to 8k (zvol blocksize).  Because of this
the zconfig.sh tests which create paritions are now generating a
warning about non-optimal alignments.

This change updates the need zconfig.sh tests such that a partition
will be properly aligned.  In the process, it shifts from using the
sfdisk utility to the parted utility to create partitions.  It also
moves the creation of labels, partitions, and filesystems in to
generic functions in common.sh.in.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Use 32 as the default number of zvol threads.

Currently, the `zvol_threads` variable, which controls the number of worker
threads which process items from the ZVOL queues, is set to the number of
available CPUs.

This choice seems to be based on the assumption that ZVOL threads are
CPU-bound. This is not necessarily true, especially for synchronous writes.
Consider the situation described in the comments for `zil_commit()`, which is
called inside `zvol_write()` for synchronous writes:

> itxs are committed in batches. In a heavily stressed zil there will be a
> commit writer thread who is writing out a bunch of itxs to the log for a
> set of committing threads (cthreads) in the same batch as the writer.
> Those cthreads are all waiting on the same cv for that batch.
>
> There will also be a different and growing batch of threads that are
> waiting to commit (qthreads). When the committing batch completes a
> transition occurs such that the cthreads exit and the qthreads become
> cthreads. One of the new cthreads becomes he writer thread for the batch.
> Any new threads arriving become new qthreads.

We can easily deduce that, in the case of ZVOLs, there can be a maximum of
`zvol_threads` cthreads and qthreads. The default value for `zvol_threads` is
typically between 1 and 8, which is way too low in this case. This means
there will be a lot of small commits to the ZIL, which is very inefficient
compared to a few big commits, especially since we have to wait for the data
to be on stable storage. Increasing the number of threads will increase the
amount of data waiting to be commited and thus the size of the individual
commits.

On my system, in the context of VM disk image storage (lots of small
synchronous writes), increasing `zvol_threads` from 8 to 32 results in a 50%
increase in sequential synchronous write performance.

We should choose a more sensible default for `zvol_threads`. Unfortunately
the optimal value is difficult to determine automatically, since it depends
on the synchronous write latency of the underlying storage devices. In any
case, a hardcoded value of 32 would probably be better than the current
situation. Having a lot of ZVOL threads doesn't seem to have any real
downside anyway.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fixes #392

Improve ZVOL queue behavior.

The Linux block device queue subsystem exposes a number of configurable
settings described in Linux block/blk-settings.c. The defaults for these
settings are tuned for hard drives, and are not optimized for ZVOLs. Proper
configuration of these options would allow upper layers (I/O scheduler) to
take better decisions about write merging and ordering.

Detailed rationale:

- max_hw_sectors is set to unlimited (UINT_MAX). zvol_write() is able to
   handle writes of any size, so there's no reason to impose a limit. Let the
   upper layer decide.

- max_segments and max_segment_size are set to unlimited. zvol_write() will
   copy the requests' contents into a dbuf anyway, so the number and size of
   the segments are irrelevant. Let the upper layer decide.

- physical_block_size and io_opt are set to the ZVOL's block size. This
   has the potential to somewhat alleviate issue #361 for ZVOLs, by warning
   the upper layers that writes smaller than the volume's block size will be
   slow.

- The NONROT flag is set to indicate this isn't a rotational device.
   Although the backing zpool might be composed of rotational devices, the
   resulting ZVOL often doesn't exhibit the same behavior due to the COW
   mechanisms used by ZFS. Setting this flag will prevent upper layers from
   making useless decisions (such as reordering writes) based on incorrect
   assumptions about the behavior of the ZVOL.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Fix synchronicity for ZVOLs.

zvol_write() assumes that the write request must be written to stable storage
if rq_is_sync() is true. Unfortunately, this assumption is incorrect. Indeed,
"sync" does *not* mean what we think it means in the context of the Linux
block layer. This is well explained in linux/fs.h:

    WRITE:       A normal async write. Device will be plugged.
    WRITE_SYNC:  Synchronous write. Identical to WRITE, but passes down
                 the hint that someone will be waiting on this IO
                 shortly.
    WRITE_FLUSH: Like WRITE_SYNC but with preceding cache flush.
    WRITE_FUA:   Like WRITE_SYNC but data is guaranteed to be on
                 non-volatile media on completion.

In other words, SYNC does not *mean* that the write must be on stable storage
on completion. It just means that someone is waiting on us to complete the
write request. Thus triggering a ZIL commit for each SYNC write request on a
ZVOL is unnecessary and harmful for performance. To make matters worse, ZVOL
users have no way to express that they actually want data to be written to
stable storage, which means the ZIL is broken for ZVOLs.

The request for stable storage is expressed by the FUA flag, so we must
commit the ZIL after the write if the FUA flag is set. In addition, we must
commit the ZIL before the write if the FLUSH flag is set.

Also, we must inform the block layer that we actually support FLUSH and FUA.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Support "sync=always" for ZVOLs.

Currently the "sync=always" property works for regular ZFS datasets, but not
for ZVOLs. This patch remedies that.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fixes #374.

Let libnvpair be linked independently of libzfs.

Autoconf will fail to detect the ZoL libnvpair on systems that do not
implicitly link library runtime dependencies, which is anything that
has the GCC 4.5 DCO update.

Build libuutil before libnvpair, and put it on the the LDADD line of
the libnvpair automake template.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #560