git.proxmox.com Git - mirror_zfs-debian.git/log

Merge branch 'upstream'

Improve ZVOL queue behavior.

The Linux block device queue subsystem exposes a number of configurable
settings described in Linux block/blk-settings.c. The defaults for these
settings are tuned for hard drives, and are not optimized for ZVOLs. Proper
configuration of these options would allow upper layers (I/O scheduler) to
take better decisions about write merging and ordering.

Detailed rationale:

- max_hw_sectors is set to unlimited (UINT_MAX). zvol_write() is able to
   handle writes of any size, so there's no reason to impose a limit. Let the
   upper layer decide.

- max_segments and max_segment_size are set to unlimited. zvol_write() will
   copy the requests' contents into a dbuf anyway, so the number and size of
   the segments are irrelevant. Let the upper layer decide.

- physical_block_size and io_opt are set to the ZVOL's block size. This
   has the potential to somewhat alleviate issue #361 for ZVOLs, by warning
   the upper layers that writes smaller than the volume's block size will be
   slow.

- The NONROT flag is set to indicate this isn't a rotational device.
   Although the backing zpool might be composed of rotational devices, the
   resulting ZVOL often doesn't exhibit the same behavior due to the COW
   mechanisms used by ZFS. Setting this flag will prevent upper layers from
   making useless decisions (such as reordering writes) based on incorrect
   assumptions about the behavior of the ZVOL.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Fix synchronicity for ZVOLs.

zvol_write() assumes that the write request must be written to stable storage
if rq_is_sync() is true. Unfortunately, this assumption is incorrect. Indeed,
"sync" does *not* mean what we think it means in the context of the Linux
block layer. This is well explained in linux/fs.h:

    WRITE:       A normal async write. Device will be plugged.
    WRITE_SYNC:  Synchronous write. Identical to WRITE, but passes down
                 the hint that someone will be waiting on this IO
                 shortly.
    WRITE_FLUSH: Like WRITE_SYNC but with preceding cache flush.
    WRITE_FUA:   Like WRITE_SYNC but data is guaranteed to be on
                 non-volatile media on completion.

In other words, SYNC does not *mean* that the write must be on stable storage
on completion. It just means that someone is waiting on us to complete the
write request. Thus triggering a ZIL commit for each SYNC write request on a
ZVOL is unnecessary and harmful for performance. To make matters worse, ZVOL
users have no way to express that they actually want data to be written to
stable storage, which means the ZIL is broken for ZVOLs.

The request for stable storage is expressed by the FUA flag, so we must
commit the ZIL after the write if the FUA flag is set. In addition, we must
commit the ZIL before the write if the FLUSH flag is set.

Also, we must inform the block layer that we actually support FLUSH and FUA.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Support "sync=always" for ZVOLs.

Currently the "sync=always" property works for regular ZFS datasets, but not
for ZVOLs. This patch remedies that.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fixes #374.

Let libnvpair be linked independently of libzfs.

Autoconf will fail to detect the ZoL libnvpair on systems that do not
implicitly link library runtime dependencies, which is anything that
has the GCC 4.5 DCO update.

Build libuutil before libnvpair, and put it on the the LDADD line of
the libnvpair automake template.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #560

PPA 0.6.0.49-0ubuntu1 release.

Add patch: Improve the --with-spl error

If the SPL module is unavailable, then the operator gets an error
message that does not apply to installations that are mananged by
DKMS. Change the error message to fit apt-get systems.

A better solution would be to implement a dependency model in DKMS,
or to bundle SPL into the ZFS package.

Refresh patches after upstream merge.

Merge branch 'upstream'

Revert "Manually sync scripts/ with the upstream release."

This reverts commit ae1cd09777a9fb4060fabf830df4c32824951f24
to keep pkg-zfs limited to the debian/ overlay.

These symlinks were dereferenced in pkg-zfs so that the zfs-linux
source package could be diffed directly onto the vanilla upstream
tarball, but the orig tarball produced by git-buildpackage from the
upstream/* git tags is now being published to the PPA instead.

Revert lib/libspl/asm-generic/atomic.S deletion.

The atomic.S file was deleted when pkg-zfs was converted to a
git-buildpackage project more than a year ago, and I don't remember
whether the deletion was pertinent or accidental.

Regardless, this is a packaging policy violation because it touches a
file outside of the debian/ overlay, and it is not currently required
for building pkg-zfs.

Restore the atomic.S file by revering commit
pkg-zfs/dajhorn@7e4739a203c107ba58ab755eedd20c9f35c1fbf8
"Initial master branch for git-buildpackage."
and discarding all other merge conflicts.

Linux 3.3 compat, sops->show_options()

The second argument of sops->show_options() was changed from a
'struct vfsmount *' to a 'struct dentry *'. Add an autoconf check
to detect the API change and then conditionally define the expected
interface. In either case we are only interested in the zfs_sb_t.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #549

Cleanup ZFS debug infrastructure

Historically the internal zfs debug infrastructure has been
scattered throughout the code.  Since we expect to start making
more use of this code this patch performs some cleanup.

* Consolidate the zfs debug infrastructure in the zfs_debug.[ch]
  files.  This includes moving the zfs_flags and zfs_recover
  variables, plus moving the zfs_panic_recover() function.

* Remove the existing unused functionality in zfs_debug.c and
  replace it with code which correctly utilized the spl logging
  infrastructure.

* Remove the __dprintf() function from zfs_ioctl.c.  This is
  dead code, the dprintf() functionality in the kernel relies
  on the spl log support.

* Remove dprintf() from hdr_recl().  This wasn't particularly
  useful and was missing the required format specifier anyway.

* Subsequent patches should unify the dprintf() and zfs_dbgmsg()
  functions.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Allow multiple values per directory entry

When using zfs to back a Lustre filesystem it's advantageous to
to store a fid with the object id in the directory zap.  The only
technical impediment to doing this is that the zpl code expects
a single value in the zap per directory entry.

This change relaxes that requirement such that multiple entries
are allowed provided the first one is the object id.  The zpl
code will just ignore additional entries.  This allows the ZoL
count to mount datasets which are being used as Lustre server
backends.

Once the upstream feature flags support is merged in this change
should be updated to a read-only feature.  Until this occurs
other zfs implementations will not be able to read the zfs
filesystems created by Lustre.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Export symbol zfs_attr_table

Export the zfs_attr_table symbol so it may be used by non-zpl
consumers which are still interested in writing a zpl compatible
dataset (e.g. Lustre).

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

PPA 0.6.0.48-0ubuntu1 release.

Disable dh_autotools-dev_updateconfig.

Calling `dh $@ --with autotools_dev` causes a FTBFS on Ubuntu 10.04
Lucid Lynx and is currently unnecessary for later releases.

Merge branch 'upstream'

Ignore dataset if the dds_type is DMU_OST_OTHER

Since the zpios and potentially other ZFS tests use the
DMU_OST_OTHER type to label their datasets, the zpool and
zfs commands should gracefully handle this type when it is
encountered. This patch modifies the commands' behavior
to ignore any datasets with a dds_type of DMU_OST_OTHER.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #536

PPA 0.6.0.47-0ubuntu1 release.

Delete packaging for recomposed libraries.

The spl, avl, efi, share, and unicode libraries are now part of the
uutil, nvpair, zpool, and zfs libraries.

See zfsonlinux/zfs@750562833f5009e1602e3b7d8f10497ee683f611.

Remove distdir stubbing for DKMS module sources.

Instead of implementing a new distir_modules rule, the existing
distdir rule is now patched to create the DKMS module sources in a
way that minimizes the number of Lintian complaints.

Refresh debian/patches after upstream merge.

Add libtool to Build-Depends.

The libtool package provides macros that are used by autogen.

Merge branch 'upstream'

Fix rpm dependencies

This change updates the rpm spec files to have strictly correct
package dependencies.  That means a few things:

* The zfs-modules package is now tied to a specific build of
  the spl-modules packages based on the kernel version.  This
  ensures that the correct spl-modules packages will always get
  installed and not just the newest.

* The zfs package now requires both the zfs-modules and spl
  packages.  Thus a 'yum install zfs' will pull in the minimal
  set of packages required for a functional system.

* The zfs-devel packages now require the zfs package to be
  installed which is normal behavior for -devel packages.

* Remove the redundant distribution release extension.  This
  is already added once because it is part of the kernel package
  release name.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Add the release component to headers

When the original build system code was added the release
component was accidentally omited from the development header
install path. This patch adds the missing path component so
it's always clear exactly what release your compiling against.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

PPA 0.6.0.46-0ubuntu1 release.

Merge branch 'upstream'

Disable zfs-dracut packaging.

The dracut/ component broke deb systems when it first added to the
upstream ZoL repository, had a near-zero download count when it was
fixed, seems to be unmaintained, and is incompatible with the
dracut-005 package that is currently published in Debian and Ubuntu.

Move debian/patches into a separate branch.

Using git to automatically rediff the patches is easier than using
`quilt refresh` and manually resolving conflicts, especially because
most submissions for this project come from git remotes.

It is also faster to pull and later discard experimental topic branches.

This also keeps the packaging master history clean and concise, and avoids
accidentally conflating the upstream master. Reverting a commit that changed
something outside of the debian/ overlay is ugly.

Allow GPT+EFI vdev replacement in boot pools.

Commit zfsonlinux/zfs@57a4eddc4d5e1e6c10d8d7dcf87a9fc27398adcd
allows the bootfs property to be set on any pool, but does not
accommodate subsequent vdev changes. For example:

# zpool replace rpool /dev/sda /dev/sdb
operation not supported on this type of pool
property 'bootfs' is not supported on EFI labeled devices

For non-Solaris builds, disable the check that emits this error.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Combine libraries: spl, avl, efi, share, unicode.

These libraries, which are an artifact of the ZoL development
process, conflict with packages that are already in distribution:

  * libspl: SPL Programming Language
  * libavl: AVL for Linux
  * libefi: GRUB

And these libraries are potential conflicts:

  * libshare: the Linux Mount Manager
  * libunicode: Perl and Python

Recompose these five ZoL components into the four libraries that are
conventionally provided by Solaris and FreeBSD systems:

  + libnvpair
  + libuutil
  + libzpool
  + libzfs

This change resolves the name conflict, makes ZoL more compatible
with existing software that uses autotools to detect ZFS, and allows
pkg-zfs to better reflect the official Debian kFreeBSD packaging.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #430

Allow setting bootfs on any pool

The vdev_is_bootable() restrictions are no longer necessary
with recent GRUB2 code. FreeBSD has implemented the same
change, except that I moved the Solaris comment to be inside
the #ifdef __sun__ block.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #317

Run autogen for packaged builds.

Add to autoconf, automake, and autogen to the Build-Depends
field in the debian/control file, and run `./autogen.sh`
before `./configure` in the debian/rules file.

Reduce number of zio free threads

As described in Issue #458 and #258, unlinking large amounts of data
can cause the threads in the zio free wait queue to start spinning.
Reducing the number of z_fr_iss threads from a fixed value of 100 to 1
per cpu signficantly reduces contention on the taskq spinlock and
improves throughput.

Instrumenting the taskq code showed that __taskq_dispatch() can spend
a long time holding tq->tq_lock if there are a large number of threads
in the queue.  It turns out the time spent in wake_up() scales
linearly with the number of threads in the queue.  When a large number
of short work items are dispatched, as seems to be the case with
unlink, the worker threads drain the queue faster than the dispatcher
can fill it.  They then all pile into the work wait queue to wait for
new work items.  So if 100 threads are in the queue, wake_up() takes
about 100 times as long, and the woken threads have to spin until the
dispatcher releases the lock.

Reducing the number of threads helps with the symptoms, but doesn't
get to the root of the problem.  It would seem that wake_up()
shouldn't scale linearly in time with queue depth, particularly if we
are only trying to wake up one thread.  In that vein, I tried making
all of the waiting processes exclusive to prevent the scheduler from
iterating over the entire list, but I still saw the linear time
scaling.  So further investigation is needed, but in the meantime
reducing the thread count is an easy workaround.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #258
Issue #458

PPA 0.6.0.45-0ubuntu1 release.

Merge branch 'fhs'

Conflicts:
debian/zfs-dkms.dkms

Add Allow-setting-bootfs-on-any-pool.patch

From 884959bab92d18139e40b30daa53fd6a713c995e Mon Sep 17 00:00:00 2001
From: Richard Laager <rlaager@wiktel.com>
Date: Fri, 13 Jan 2012 16:24:15 -0600
Subject: [PATCH] Allow setting bootfs on any pool

The vdev_is_bootable() restrictions are no longer necessary with recent
GRUB2 code. FreeBSD has implemented the same change, except that I
moved the Solaris comment to be inside the #ifdef __sun__ block.

Merge branch 'upstream'

FHS conformance and DKMS multiarch, ZFS interface.

Improve FHS conformance by installing intermediary build products --
currently the zfs_config.h and Module.symvers files -- into the
/var/lib/dkms area instead of /usr/src.

This has the beneficial side-effect of enabling DKMS multiarch
support for ZFS because the autoconf templates and `make install`
rules are not aware of the target architecture.

Mitigates: zfsonlinux/zfs#511

Wrap long lines in the dkms.conf file.

Increase link count limit to 2^31-1

Originally, the per-file link limit was set to 65536 because the
exact Linux VFS limit was unclear. Internally ZFS is able to
support 64-bit link counts. After a more careful investigation
the limit can be safely raised to 2^31-1.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #514

Run ZFS_AC_PACMAN only if $VENDOR is "arch"

Unfortunately, Arch's package manager `pacman` shares it's name with a
popular arcade video game. Thus, in order to refrain from executing the
video game when we mean to execute the package manager, ZFS_AC_PACMAN is
now only run when $VENDOR is determined to be "arch".

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #517

PPA 0.6.0.44-0ubuntu1 release.

Refresh debian/patches after upstream merge.

Revert "Add security_inode_init_security.patch"

This reverts commit 071df06b6a6b7f22149d883b57f30ff1acdc84d5.

Merge branch 'upstream'

Add overlay(-O) mount option support

Linux supports mounting over non-empty directories by default.
In Solaris this is not the case and -O option is required for
zfs mount to mount a zfs filesystem over a non-empty directory.

For compatibility, I've added support for -O option to mount
zfs filesystems over non-empty directories if the user wants
to, just like in Solaris.

I've defined MS_OVERLAY to record it in the flags variable if
the -O option is supplied. The flags variable passes through
a few functions and its checked before performing the empty
directory check in zfs_mount function. If -O is given, the
check is not performed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #473

Apply the ZoL coding standard to zpl_xattr.c

Make the indenting in the zpl_xattr.c file consistent with the Sun
coding standard by removing soft tabs.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Linux 3.2 compat, security_inode_init_security()

The security_inode_init_security() API has been changed to include
a filesystem specific callback to write security extended attributes.
This was done to support the initialization of multiple LSM xattrs
and the EVM xattr.

This change updates the code to use the new API when it's available.
Otherwise it falls back to the previous implementation.

In addition, the ZFS_AC_KERNEL_6ARGS_SECURITY_INODE_INIT_SECURITY
autoconf test has been made more rigerous by passing the expected
types. This is done to ensure we always properly the detect the
correct form for the security_inode_init_security() API.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #516

Treat /dev/vd* as whole disks

Correctly detect /dev/vd devices as whole disks and attempt to
create an EFI partition table.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Avoid using awk in the zpool_id script.

Some implementations of `awk` incorrectly parse the \< and \> regex
symbols, so use a `while read` loop and regular globbing instead.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #259

Linux 3.1 compat, super_block->s_shrink

The Linux 3.1 kernel has introduced the concept of per-filesystem
shrinkers which are directly assoicated with a super block.  Prior
to this change there was one shared global shrinker.

The zfs code relied on being able to call the global shrinker when
the arc_meta_limit was exceeded.  This would cause the VFS to drop
references on a fraction of the dentries in the dcache.  The ARC
could then safely reclaim the memory used by these entries and
honor the arc_meta_limit.  Unfortunately, when per-filesystem
shrinkers were added the old interfaces were made unavailable.

This change adds support to use the new per-filesystem shrinker
interface so we can continue to honor the arc_meta_limit.  The
major benefit of the new interface is that we can now target
only the zfs filesystem for dentry and inode pruning.  Thus we
can minimize any impact on the caching of other filesystems.

In the context of making this change several other important
issues related to managing the ARC were addressed, they include:

* The dnlc_reduce_cache() function which was called by the ARC
to drop dentries for the Posix layer was replaced with a generic
zfs_prune_t callback.  The ZPL layer now registers a callback to
drop these dentries removing a layering violation which dates
back to the Solaris code.  This callback can also be used by
other ARC consumers such as Lustre.

  arc_add_prune_callback()
  arc_remove_prune_callback()

* The arc_reduce_dnlc_percent module option has been changed to
arc_meta_prune for clarity.  The dnlc functions are specific to
Solaris's VFS and have already been largely eliminated already.
The replacement tunable now represents the number of bytes the
prune callback will request when invoked.

* Less aggressively invoke the prune callback.  We used to call
this whenever we exceeded the arc_meta_limit however that's not
strictly correct since it results in over zeleous reclaim of
dentries and inodes.  It is now only called once the arc_meta_limit
is exceeded and every effort has been made to evict other data from
the ARC cache.

* More promptly manage exceeding the arc_meta_limit.  When reading
meta data in to the cache if a buffer was unable to be recycled
notify the arc_reclaim thread to invoke the required prune.

* Added arcstat_prune kstat which is incremented when the ARC
is forced to request that a consumer prune its cache.  Remember
this will only occur when the ARC has no other choice.  If it
can evict buffers safely without invoking the prune callback
it will.

* This change is also expected to resolve the unexpect collapses
of the ARC cache.  This would occur because when exceeded just the
arc_meta_limit reclaim presure would be excerted on the arc_c
value via arc_shrink().  This effectively shrunk the entire cache
when really we just needed to reclaim meta data.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #466
Closes #292

Add /sbin/blkid to the initramfs manifest.

If the root pool contains a zvol, then udev tries /sbin/blkid and
emits a warning, which is usually invisible because it happens
after GRUB changes the console mode for Plymouth.

Adding /sbin/blkid to the initramfs manifest satisfies the udev
warning and results in a small boot time improvement.

Merge branch 'issue9'

PPA 0.6.0.43-0ubuntu1 release.

Add precise to the PPA build list.

Begin building packages for the Ubuntu 12.04 LTS Precise Pangolin
alpha release.

The linux-image-3.2.0-8-generic kernel package is the first Ubuntu
P-series release that is compatible and somewhat stable with ZFS.

Add security_inode_init_security.patch

Add an interim fix for issue #516, which is required for running ZoL
on a Linux 3.2 kernel.

Add bash command completion to zfsutils.

added contrib/zfs_completion.bash

received from Aneurin Price
http://groups.google.com/group/zfs-fuse/browse_thread/thread/fd17ab76e5bddc35

Add libselinux1-dev to build-depends.

The /sbin/zfs utility can be selinux aware if the libselinux1-dev
package is installed at build time.

The libselinux1 package is in the Debian base system.

This change is enabled by upstream commit
SHA afd7da0ce72c3b3554079644d73e90fe6d2bf955

Wrap long lines in the debian/control file.

Remove the mountall dependency from zfs-initramfs.

The mountall utility is not in the regular initrd manifest for Ubuntu,
and it is not used to start a native ZFS root filesystem.

Depending on mountall unnecessarily prevents zfs-initramfs from being
installed on vanilla Debian systems.

Closes: dajhorn/pkg-zfs#9

PPA 0.6.0.42-0ubuntu1 release.

Refresh debian/patches after upstream merge.

Merge branch 'upstream'

Move Arch Linux's VENDOR check above Ubuntu's

If the lsb-release package is installed on an Arch Linux distribution,
the configure step will incorrectly detect the running distribution as
Ubuntu. This is a result of both distributions providing an
/etc/lsb-release file, and the Ubuntu VENDOR check being performed
first.

Since the Arch Linux test check's for a file more specific to the Arch
Linux distribution, moving Arch Linux's VENDOR check above Unbuntu's
check provides a quick and easy solution.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Add LIBSELINUX to mount_zfs_LDFLAGS.

Regenerating the autotools configuration on Debian and Ubuntu systems
causes compilation to fail with this error message:

cmd/mount_zfs/../../cmd/mount_zfs/mount_zfs.c:403:
undefined reference to `is_selinux_enabled'

In the automake template, set "mount_zfs_LDFLAGS = ... $(LIBSELINUX)"
so that the /sbin/mount.zfs utility is linked to libselinux.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Linux 3.2 compat: set_nlink()

Directly changing inode->i_nlink is deprecated in Linux 3.2 by commit

SHA: bfe8684869601dacfcb2cd69ef8cfd9045f62170

Use the new set_nlink() kernel function instead.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #462

Update the character class in the zpool man page.

ZoL and all Solaris derivatives allow pool names to contain the colon
and space characters. Update the man page to reflect current behavior.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #438

PPA 0.6.0.41-0ubuntu1 release.

Merge branch 'upstream'

Add make rule for building Arch Linux packages

Added the necessary build infrastructure for building packages
compatible with the Arch Linux distribution. As such, one can now run:

    $ ./configure
    $ make pkg     # Alternatively, one can run 'make arch' as well

on the Arch Linux machine to create two binary packages compatible with
the pacman package manager, one for the zfs userland utilities and
another for the zfs kernel modules. The new packages can then be
installed by running:

    # pacman -U $package.pkg.tar.xz

In addition, source-only packages suitable for an Arch Linux chroot
environment or remote builder can also be build using the 'sarch' make
rule.

NOTE: Since the source dist tarball is created on the fly from the head
of the build tree, it's MD5 hash signature will be continually influx.
As a result, the md5sum variable was intentionally omitted from the
PKGBUILD files, and the '--skipinteg' makepkg option is used. This may
or may not have any serious security implications, as the source tarball
is not being downloaded from an outside source.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #491

Illumos #734: Use taskq_dispatch_ent() interface

It has been observed that some of the hottest locks are those
of the zio taskqs.  Contention on these locks can limit the
rate at which zios are dispatched which limits performance.

This upstream change from Illumos uses new interface to the
taskqs which allow them to utilize a prealloc'ed taskq_ent_t.
This removes the need to perform an allocation at dispatch
time while holding the contended lock.  This has the effect
of improving system performance.

Reviewed by: Albert Lee <trisk@nexenta.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Alexey Zaytsev <alexey.zaytsev@nexenta.com>
Reviewed by: Jason Brian King <jason.brian.king@gmail.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>

References to Illumos issue:
  https://www.illumos.org/issues/734

Ported-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #482

Set zvol_major/zvol_threads permissions

The zvol_major and zvol_threads module options were being created
with 0 permission bits.  This prevented them from being listed in
the /sys/module/zfs/parameters/ directory, although they were
visible in `modinfo zfs`.  This patch fixes the issue by updating
the permission bits to 0444.  For the moment these options must
be read-only because they are used during module initialization.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #392

PPA 0.6.0.40-0ubuntu1 release.

Revert "Depend on udev that provides stand-alone path_id."

This reverts commit ad41a1ca2de555873638d2ed042ad6a6ecd8611d.

Merge branch 'upstream'

Update default ARC memory limits

In the upstream OpenSolaris ZFS code the maximum ARC usage is
limited to 3/4 of memory or all but 1GB, whichever is larger.
Because of how Linux's VM subsystem is organized these defaults
have proven to be too large which can lead to stability issues.

To avoid making everyone manually tune the ARC the defaults are
being changed to 1/2 of memory or all but 4GB.  The rational for
this is as follows:

* Desktop Systems (less than 8GB of memory)

  Limiting the ARC to 1/2 of memory is desirable for desktop
  systems which have highly dynamic memory requirements.  For
  example, launching your web browser can suddenly result in a
  demand for several gigabytes of memory.  This memory must be
  reclaimed from the ARC cache which can take some time.  The
  user will experience this reclaim time as a sluggish system
  with poor interactive performance.  Thus in this case it is
  preferable to leave the memory as free and available for
  immediate use.

* Server Systems (more than 8GB of memory)

  Using all but 4GB of memory for the ARC is preferable for
  server systems.  These systems often run with minimal user
  interaction and have long running daemons with relatively
  stable memory demands.  These systems will benefit most by
  having as much data cached in memory as possible.

These values should work well for most configurations.  However,
if you have a desktop system with more than 8GB of memory you may
wish to further restrict the ARC.  This can still be accomplished
by setting the 'zfs_arc_max' module option.

Additionally, keep in mind these aren't currently hard limits.
The ARC is based on a slab implementation which can suffer from
memory fragmentation.  Because this fragmentation is not visible
from the ARC it may believe it is within the specified limits while
actually consuming slightly more memory.  How much more memory get's
consumed will be determined by how badly fragmented the slabs are.

In the long term this can be mitigated by slab defragmentation code
which was OpenSolaris solution.  Or preferably, using the page cache
to back the ARC under Linux would be even better.  See issue #75
for the benefits of more tightly integrating with the page cache.

This change also fixes a issue where the default ARC max was being
set incorrectly for machines with less than 2GB of memory.  The
constant in the arc_c_max comparison must be explicitly cast to
a uint64_t type to prevent overflow and the wrong conditional
branch being taken.  This failure was typically observed in VMs
which are commonly created with less than 2GB of memory.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #75

Quote variables in the zfs.lsb script.

For consistency and safety, quote all variables in the zfs.lsb script.
This protects in the unlikely case that any of the file names contain
whitespace.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #439

Source /etc/default/zfs after setting defaults.

Let the administrator override all script variables by sourcing the
/etc/default/zfs file after the default values are set.

The spelling mistake in the old path name makes it unlikely that this
bug affected any users.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #371

Demote the whackbang in the zpool_id script.

The zpool_id script is posixly correct and does not use bash
features, so change its whackbang from /bin/bash to /bin/sh.

Debian policy also stipulates that system scripts be dash compatible.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Demote egrep to grep in the zpool_id script.

Direct invocation of GNU egrep is deprecated by its man page, and the
its argument in the zpool_id script is not an extended expression.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Quote variables in the zpool_id script.

For consistency and safety, quote all variables in the zpool_id
script. This accomodates a `-c CONFIG` parameter value with
whitespace in the path name.

Also fix a typo in the usage synopsis for `-h`.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #439

Support path_id changes in udev 174.

The /lib/udev/path_id helper became a builtin command in the udev 174
release, so test whether path_id is external in the zpool_id script.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #429

Added comments for libshare's NFS functions.

Some of the functions' purpose wasn't immediately obvious without
additional explanations. This commit adds these missing comments.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Fix configure tests to play nice with GCC 4.6

As of GCC 4.6, specific kernel 2.6.32 header files do not compile
cleanly without warnings. One specific example of this is the
arch/x86/include/asm/percpu.h file. Thus, a few of the configure tests
were getting hung up on this and the '-Wno-unsued-but-set-variables'
compile option had to be introduced.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #459

Allow xattrs on symlinks

The Solaris version of ZFS does not allow xattrs to be set on
symlinks due to the way they implemented the attropen() system
call.  Linux however implements xattrs through the lgetxattr()
and lsetxattr() system calls which do not have this limitation.

The only reason this hasn't always worked under ZFS on Linux
is that the xattr handlers were not registered for symlink type
inodes.  This was done simply to be consistent with the Solaris
behavior.

Upon futher reflection I believe this should be allowed under
Linux.  The only ill effect would be that the xattrs on symlinks
will not be visible when the pool is imported on a Solaris
system.  This also has the benefit that it allows for SELinux
style security xattr labeling which expects to be able to set
xattrs on all inode types.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #272

Implement SA based xattrs

The current ZFS implementation stores xattrs on disk using a hidden
directory.  In this directory a file name represents the xattr name
and the file contexts are the xattr binary data.  This approach is
very flexible and allows for arbitrarily large xattrs.  However,
it also suffers from a significant performance penalty.  Accessing
a single xattr can requires up to three disk seeks.

  1) Lookup the dnode object.
  2) Lookup the dnodes's xattr directory object.
  3) Lookup the xattr object in the directory.

To avoid this performance penalty Linux filesystems such as ext3
and xfs try to store the xattr as part of the inode on disk.  When
the xattr is to large to store in the inode then a single external
block is allocated for them.  In practice most xattrs are small
and this approach works well.

The addition of System Attributes (SA) to zfs provides us a clean
way to make this optimization.  When the dataset property 'xattr=sa'
is set then xattrs will be preferentially stored as System Attributes.
This allows tiny xattrs (~100 bytes) to be stored with the dnode and
up to 64k of xattrs to be stored in the spill block.  If additional
xattr space is required, which is unlikely under Linux, they will be
stored using the traditional directory approach.

This optimization results in roughly a 3x performance improvement
when accessing xattrs which brings zfs roughly to parity with ext4
and xfs (see table below).  When multiple xattrs are stored per-file
the performance improvements are even greater because all of the
xattrs stored in the spill block will be cached.

However, by default SA based xattrs are disabled in the Linux port
to maximize compatibility with other implementations.  If you do
enable SA based xattrs then they will not be visible on platforms
which do not support this feature.

----------------------------------------------------------------------
   Time in seconds to get/set one xattr of N bytes on 100,000 files
------+--------------------------------+------------------------------
      |            setxattr            |            getxattr
bytes |  ext4     xfs zfs-dir  zfs-sa  |  ext4     xfs zfs-dir  zfs-sa
------+--------------------------------+------------------------------
1     |  2.33   31.88   21.50    4.57  |  2.35    2.64    6.29    2.43
32    |  2.79   30.68   21.98    4.60  |  2.44    2.59    6.78    2.48
256   |  3.25   31.99   21.36    5.92  |  2.32    2.71    6.22    3.14
1024  |  3.30   32.61   22.83    8.45  |  2.40    2.79    6.24    3.27
4096  |  3.57  317.46   22.52   10.73  |  2.78   28.62    6.90    3.94
16384 |   n/a 2342.39   34.30   19.20  |   n/a   45.44  145.90    7.55
65536 |   n/a 2941.39  128.15  131.32* |   n/a  141.92  256.85  262.12*

Legend:
* ext4      - Stock RHEL6.1 ext4 mounted with '-o user_xattr'.
* xfs       - Stock RHEL6.1 xfs mounted with default options.
* zfs-dir   - Directory based xattrs only.
* zfs-sa    - Prefer SAs but spill in to directories as needed, a
              trailing * indicates overflow in to directories occured.

NOTE: Ext4 supports 4096 bytes of xattr name/value pairs per file.
NOTE: XFS and ZFS have no limit on xattr name/value pairs per file.
NOTE: Linux limits individual name/value pairs to 65536 bytes.
NOTE: All setattr/getattr's were done after dropping the cache.
NOTE: All tests were run against a single hard drive.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #443

In autoconf v2.68, AC_LANG_PROGRAM must be quoted

This change updates the AC_LANG_PROGRAM autoconf macro invocations to be
wrapped in quotes. As of autoconf version 2.68, the quotes are necessary
to prevent warnings from appearing. Specifically, the autoconf v2.68
Forward Porting Notes specifies:

    It is important to note that you need to ensure that the call to
    AC_LANG_SOURCE is quoted and not expanded, otherwise that will
    cause the warning to appear nonetheless.

Finally, because of the additional quoting we can drop the extra
quotas used by the ZFS_AC_CONFIG_USER_STACK_GUARD autoconf check.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #464

PPA 0.6.0.39-0ubuntu1 release.

Merge branch 'upstream'

Allow leading digits in userquota/groupquota names

While setting/getting userquota and groupquota properties, the input
was not treated as a possible username or groupname if it had a
leading digit. While useradd in linux recommends the regexp
[a-z_][a-z0-9_-]*[$]? , it is not enforced. This causes problem for
usernames with leading digits in them. We need to be able to support
getting and setting properties for this unconventional but possible
input category

I've updated the code to validate the username or groupname directly
via the API. Also, note that I moved this validation to the beginning
before the check for SID names with @. This also supports usernames
with @ character in them which are valid. Only when input with @ is
not a valid username, it is interpreted as a potential SID name.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #428

Limit maximum ashift value to 12

While we initially allowed you to set your ashift as large as 17
(SPA_MAXBLOCKSIZE) that is actually unsafe. What wasn't considered
at the time is that each uberblock written to the vdev label ring
buffer will be of this size. Now the buffer is statically sized
to 128k and we need to be able to fit several uberblocks in it.
With a large ashift that becomes a problem.

Therefore I'm reducing the maximum configurable ashift value to 12.
This is large enough for the 4k sector drives and small enough that
we can still keep the most recent 32 uberblock in the vdev label
ring buffer.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #425

PPA 0.6.0.38-0ubuntu1 release.

Refresh debian/patches after upstream merge.

Merge branch 'upstream'

Fix depmod warning

The depmod utility from module-init-tools 3.12-pre3 generates a
warning when the -e option is used without -E or -F.  This was
observed under OpenSuse 11.4.  To resolve the issue when the
exact System.map-* for your kernel cannot be found fallback to
a generic safe '/sbin/depmod -a'.

  WARNING: -e needs -E or -F

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Linux 3.1 compat, fops->fsync()

The Linux 3.1 kernel updated the fops->fsync() callback yet again.
They now pass the requested range and delegate the responsibility
for calling filemap_write_and_wait_range() to the callback. In
addition imutex is no longer held by the caller and the callback
is responsible for taking the lock if required.

This commit updates the code to provide a zpl_fsync() function
for the updated API. Implementations for the previous two APIs
are also maintained for compatibility.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #445