Apply all of Rudd-O's changes for the Fedora init script. The
initial init script was one I threw together based on Rudd-O's
original work. It worked for me but it has some flaws.
Rudd-O has invested considerable time updating it to be significantly
smarter. It now handles using ZFS as your root filesystem plus
various other quirks. Since he is familiar with the right
way to do things on Fedora and has tested this init script we
are integrating all of his changes.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Disable the normal reclaim path for the txg_sync thread. This
ensures the thread will never enter dmu_tx_assign() which can
otherwise occur due to direct reclaim. If this is allowed to
happen the system can deadlock. Direct reclaim call path:
Brian Behlendorf [Wed, 30 Mar 2011 01:08:59 +0000 (18:08 -0700)]
Add direct+indirect ARC reclaim
Under OpenSolaris all memory reclaim is done asyncronously. Under
Linux memory reclaim is done asynchronously _and_ synchronously.
When a process allocates memory with GFP_KERNEL it explicitly allows
the kernel to do reclaim on its behalf to satify the allocation.
If that GFP_KERNEL allocation fails the kernel may take more drastic
measures to reclaim the memory such as killing user space processes.
This was observed to happen with ZFS because the ARC could consume
a large fraction of the system memory but no synchronous reclaim
could be performed on it. The result was GFP_KERNEL allocations
could fail resulting in OOM events, and only moments latter the
arc_reclaim thread would free unused memory from the ARC.
This change leaves the arc_thread in place to manage the fundamental
ARC behavior. But it adds a synchronous (direct) reclaim path for
the ARC which can be called when memory is badly needed. It also
adds an asynchronous (indirect) reclaim path which is called
much more frequently to prune the ARC slab caches.
Brian Behlendorf [Wed, 30 Mar 2011 06:04:39 +0000 (23:04 -0700)]
Call d_instantiate before unlocking inode
Under Linux a dentry referencing an inode must be instantiated before
the inode is unlocked. To accomplish this without overly modifing
the core ZFS code the dentry it passed via the vattr_t. There are
cases such as replay when a dentry is not available. In which case
it is obviously not initialized at inode creation time, if a dentry
is needed it will be spliced as when required via d_lookup().
Remove volume and swap handling from the init script.
Managing volumes and swap areas in the init script like
Debian kFreeBSD does is appealing, but providing an
alternative management facility for these things adds
unnecessary complexity.
Preserve the /etc/default/zfsload configuration file by
renaming it to /etc/default/zfs during the replacement
upgrade from the zfs package to the zfsutils package.
Fix `make distclean` for `./configure --with-config=user
Making distclean in module
make[1]: Entering directory `/zfs/module'
make -C SUBDIRS=`pwd` clean
make: Entering an unknown directory
make: *** SUBDIRS=/zfs/module: No such file or directory. Stop.
When using --with-config=user the 'distclean' target would fail
because it assumes the kernel configuration infrastrure is set up.
This is not the case, nor does it need to be, because the
'--with-config=user' option will prune the entire ./module subtree
from SUBDIRS. This prevents most build rules from operating in the
./module directory.
However, the 'dist*' rules will still traverse this directory
because it is listed in DIST_SUBDIRS. This is correct because we
need to ensure the dist rules package the directory contents
regardless of the configuration for the 'dist' rule. The correct
way to handle this is to only invoke the kernel build system as
part of the 'clean' rule when CONFIG_KERNEL_TRUE is set.
Initial fix provided by Darik Horn <dajhorn@vanadac.com>.
This commit is a slightly refined form of the original.
Ned Bass [Fri, 1 Apr 2011 16:47:05 +0000 (09:47 -0700)]
Call udevadm trigger more safely
Some udev hooks are not designed to be idempotent, so calling udevadm
trigger outside of the distribution's initialization scripts can have
unexpected (and potentially dangerous) side effects. For example, the
system time may change or devices may appear multiple times. See Ubuntu
launchpad bug 320200 and this mailing list post for more details:
To avoid these problems we call udevadm trigger with --action=change
--subsystem-match=block. The first argument tells udev just to refresh
devices, and make sure everything's as it should be. The second
argument limits the scope to block devices, so devices belonging to
other subsystems cannot be affected.
This doesn't fix the problem on older udev implementations that don't
provide udevadm but instead have udevtrigger as a standalone program.
In this case the above options aren't available so there's no way to
call call udevtrigger safely. But we can live with that since this
issue only exists in optional test and helper scripts, and most
zfs-on-linux users are running newer systems anyways.
Brian Behlendorf [Thu, 31 Mar 2011 20:43:49 +0000 (13:43 -0700)]
Update CHAOS 5 Packaging
The CHAOS 5 kernels are now packaged identially to the RHEL6 kernels.
Therefore we can simply use the RHEL6 rules in the spec file when
building packages.
Brian Behlendorf [Thu, 31 Mar 2011 19:16:24 +0000 (12:16 -0700)]
Fix libzpool cv_* build error
This build failure was accidentally introduced by previous commit bfd214a which fixed the load average. Unfortunately, the wrapper
for cv_wait_interruptible was not available in the zfs_context.h
user compatibility code. I failed to notice this because I didn't
rebuild everything cleanly before committing.
undefined reference to `cv_wait_interruptible'
collect2: ld returned 1 exit status
Kernel threads which sleep uninterruptibly on Linux are marked in the (D)
state. These threads are usually in the process of performing IO and are
thus counted against the load average. The txg_quiesce and txg_sync threads
were always sleeping uninterruptibly and thus inflating the load average.
This change makes them sleep interruptibly. Some care is required however
because these threads may now be woken early by signals. In this case the
callers are all careful to check that the required conditions are met after
waking up. If we're woken early due to a signal they will simply go back
to sleep. In this case these changes are safe.
Darik Horn [Wed, 30 Mar 2011 20:37:08 +0000 (15:37 -0500)]
Link /proc/mounts to /etc/mtab.
During system start, the /etc/init/mountall.conf on Ubuntu is unable to
generate the /etc/mtab file from the /proc/mounts file if a ZFS filesystem is
already mounted.
However, mountall 2.18 behaves properly if /etc/mtab is a symlink to
/proc/mounts. This version was first published in Ubuntu Maverick.
Additionally, the mountall package is peculiar to Ubuntu, so the mountall
dependency breaks Debian compatibility. A better solution would be to update
mountall to recognize ZFS mounts, or to find a proc syntax that mountall
already recognizes.
but the MOUNTPOINT prefix is preserved on descendent filesystems after the
pivot into the regular root, which later breaks things like `zfs mount -a` and
the /etc/mtab refresh.
Assuming one filesystem on the pivot and implicitly remapping it is sensible
behavior for the exec init. Keeping the -R prefix is similarly sensible
behavior for the zfs utility. Thus, the initramfs is probably the best place
to do something unusual.
Darik Horn [Tue, 29 Mar 2011 17:02:14 +0000 (12:02 -0500)]
Change BSD runlevels to Linux runlevels.
The unmodified init script from Debian kFreeBSD causes
these warnings during installation:
update-rc.d: warning: zfs start runlevel arguments (2 3 4 5)
do not match LSB Default-Start values (S)
update-rc.d: warning: zfs stop runlevel arguments (0 1 6)
do not match LSB Default-Stop values (0 6)
Fajar A. Nugraha [Fri, 25 Mar 2011 17:01:28 +0000 (10:01 -0700)]
Spec file compat, %{datadir}
The dracut change caused an error during "make rpm". The cause
is simple, RHEL5 does not recognize the %{datarootdir} macro in
zfs.spec. It was changed to %{datadir} which fixes the build.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Brian Behlendorf [Thu, 24 Mar 2011 18:34:41 +0000 (11:34 -0700)]
Set cmd paths in udev rules using --prefix
The udev/rules.d scripts must use absolute paths to their support
binaries. However, where those binaries get installed depends
on what --prefix was set to when the package was configured.
This change makes the udev/rules.d helpers to *.in files which
are processed by configure. This allows them to be dynamically
updated to include the specified --prefix.
Additionally, this change updates 60-zvol.rules to handle both
the 'add' and 'change' actions. This ensures that that all
valid zvol devices are correctly linked.
Fajar A. Nugraha [Thu, 24 Mar 2011 08:22:52 +0000 (15:22 +0700)]
Fixes to enable zvol symlink creation
This commit fixes issue on
https://github.com/behlendorf/zfs/issues/#issue/172
Changes:
- update BLKZNAME to use _IOR instead of _IO. Kernel 2.6.32 allows
read parameters (copy_to_user) with _IO, while newer kernels (tested
Archlinux's 2.6.37 kernel) enforces _IOR (which is correct)
- fix return code and message on error
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Brian Behlendorf [Tue, 22 Mar 2011 18:22:49 +0000 (11:22 -0700)]
Linux 2.6.29 compat, .freeze_fs/.unfreeze_fs
The .freeze_fs/.unfreeze_fs hooks were not added until Linux 2.6.29
Since these hooks are currently unused they are being removed to
allow support of older kernels.
Brian Behlendorf [Tue, 22 Mar 2011 18:13:41 +0000 (11:13 -0700)]
Linux 2.6.29 compat, credentials
As of Linux 2.6.29 a clean credential API was added to the Linux kernel.
Previously the credential was embedded in the task_struct. Because the
SPL already has considerable support for handling this API change the
ZPL code has been updated to use the Solaris credential API.
Brian Behlendorf [Tue, 22 Mar 2011 16:55:09 +0000 (09:55 -0700)]
Linux 2.6.28 compat, insert_inode_locked()
Added insert_inode_locked() helper function, prior to this most callers
used insert_inode_hash(). The older method doesn't check for collisions
in the inode_hashtable but it still acceptible for use. Fallback to
using insert_inode_hash() when insert_inode_locked() is unavailable.
Brian Behlendorf [Tue, 22 Mar 2011 16:26:38 +0000 (09:26 -0700)]
Linux 2.6.27 compat, blk_queue_stackable()
The blk_queue_stackable() queue flag was added in 2.6.27 to handle dm
stacking drivers. Prior to this request stacking drivers were detected
by checking (q->request_fn == NULL), for earlier kernels we revert to
this legacy behavior.
Brian Behlendorf [Mon, 21 Mar 2011 23:54:59 +0000 (16:54 -0700)]
Linux compat, umount2(2) flags
Older glibc <sys/mount.h> headers did not define all the available
umount2(2) flags. Both MNT_FORCE and MNT_DETACH are supported in the
kernel back to 2.4.11 so we define them correctly if they are missing.
Brian Behlendorf [Mon, 21 Mar 2011 17:19:30 +0000 (10:19 -0700)]
Fix evict() deadlock
Now that KM_SLEEP is not defined as GFP_NOFS there is the possibility
of synchronous reclaim deadlocks. These deadlocks never existed in the
original OpenSolaris code because all memory reclaim on Solaris is done
asyncronously. Linux does both synchronous (direct) and asynchronous
(indirect) reclaim.
This commit addresses a deadlock caused by inode eviction. A KM_SLEEP
allocation may trigger direct memory reclaim and shrink the inode cache.
This can occur while a mutex in the array of ZFS_OBJ_HOLD mutexes is
held. Through the ->shrink_icache_memory()->evict()->zfs_inactive()->
zfs_zinactive() call path the same mutex may be reacquired resulting
in a deadlock. To avoid this deadlock the process must not reacquire
the mutex when it is already holding it.
This is a reasonable fix for now but longer term the ZFS_OBJ_HOLD
mutex locking should be reevaluated. This infrastructure already
prevents us from ever using the Linux lock dependency analysis tools,
and it may limit scalability.
Brian Behlendorf [Sat, 19 Mar 2011 21:34:30 +0000 (14:34 -0700)]
Use KM_PUSHPAGE instead of KM_SLEEP
It used to be the case that all KM_SLEEP allocations were GFS_NOFS.
Unfortunately this often resulted in the kernel being unable to
reclaim the ARC, inode, and dentry caches in a timely manor.
The fix was to make KM_SLEEP a GFP_KERNEL allocation in the SPL.
However, this increases the posibility of deadlocking the system
on a zfs write thread. If a zfs write thread attempts to perform
an allocation it may trigger synchronous reclaim. This reclaim
may attempt to flush dirty data/inode to disk to free memory.
Unforunately, this write cannot finish because the write thread
which would handle it is holding the previous transaction open.
Deadlock.
To avoid this all allocations in the zfs write thread path must
use KM_PUSHPAGE which prohibits synchronous reclaim for that
thread. In this way forward progress in ensured. The risk
with this change is I missed updating an allocation for the
write threads leaving an increased posibility of deadlock. If
any deadlocks remain they will be unlikely but we'll have to
make sure they all get fixed.
Darik Horn [Mon, 21 Mar 2011 00:11:46 +0000 (19:11 -0500)]
Update debian/changelog using git-dch.
Correct typos in the debian/debuild-ppa.sh helper.
Set extend-diff-ignore='.*' in the debian/source/options file to
disable the automatic patch generated by the implied
--single-debian-patch during PPA builds for more than one series.
The debian/patches/debian-changes-* files do not contain a useful
history and only add cruft to the PPA builds. If this source
package is added to a regular repository, then the
extend-diff-ignore option should be disabled.
Darik Horn [Sun, 20 Mar 2011 22:51:59 +0000 (17:51 -0500)]
Remove the init-top/hostname script and force BUSYBOX=y instead.
The pre-init script in the initramfs environment sets the hostname
if the regular /bin/hostname binary and the /etc/hostname file are
included by the hook.
The pre-init invokes `/bin/hostname -b -F /etc/hostname`, which is
incompatible with the busybox builtin because it lacks the '-b'
switch.
This commit mysteriously resolves the console corruption that
occasionally happens on systems that boot into a native ZFS root
filesystem.
Brian Behlendorf [Fri, 18 Mar 2011 21:47:19 +0000 (14:47 -0700)]
Fix 'LDFLAGS=-Wl,--as-needed' build error
Compiling with 'LDFLAGS=-Wl,--as-needed' exposed the fact that
there were some library linking problems introduced by mount_zfs.
In particular, the libzfs library does use nvpair symbols, and
mount_zfs contains no dependencies on libzpool.
Brian Behlendorf [Fri, 18 Mar 2011 20:54:27 +0000 (13:54 -0700)]
Fix getcwd() warning
New versions glibc declare getcwd() with the warn_unused_result attribute.
This results in a warning because the updated mount helper was not
checking this return value. This issue was fixed by checking the return
type and in the case of an error simply returning the passed dataset.
One possible, but unlikely, error would be having your cwd directory
unlinked while the mount command was running.
cmd/mount_zfs/mount_zfs.c: In function ‘parse_dataset’:
cmd/mount_zfs/mount_zfs.c:223:2: error: ignoring return value of
‘getcwd’, declared with attribute warn_unused_result
To simplify the process of using zfs as your root filesystem a
zfs-drucat sub-package has been added. This sub-package adds a zfs
dracut module which allows your initramfs to be rebuilt with zfs
support. The process for doing this is still complicated but there
is clearly interest from the community about getting this working
well and documented. This should help lay some of the groundwork.
Longer term these changes should be pushed in the upstream dracut
package. Once that occurs this subpackage will no longer be
required for new systems, however we may want to conditionally
build this package in the future for systems running older
dracut versions.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Brian Behlendorf [Thu, 17 Mar 2011 22:02:28 +0000 (15:02 -0700)]
Add init scripts
To support automatically mounting your zfs on filesystem on boot
a basic init script is needed. Unfortunately, every distribution
has their own idea of the _right_ way to do things. Rather than
write one very complicated portable init script, which would be
invariably replaced by the distributions own anyway. I have
instead added support to provide multiple distribution specific
init scripts.
The correct init script for your distribution will be selected
by ZFS_AC_DEFAULT_PACKAGE which will set DEFAULT_INIT_SCRIPT.
During 'make install' the correct script for your system will
be installed from zfs/etc/init.d/zfs.DEFAULT_INIT_SCRIPT to the
usual /etc/init.d/zfs location.
Currently, there is zfs.fedora and a more generic zfs.lsb init
script. Hopefully, the distribution maintainers who know best
how they want their init scripts to function will feedback their
approved versions to be included in the project.
This change does not consider upstart jobs but I'm not at all
opposed to add that sort of thing.
Brian Behlendorf [Tue, 15 Mar 2011 19:41:19 +0000 (12:41 -0700)]
Register .remount_fs handler
Register the missing .remount_fs handler. This handler isn't strictly
required because the VFS does a pretty good job updating most of the
MS_* flags. However, there's no harm in using the hook to call the
registered zpl callback for various MS_* flags. Additionaly, this
allows us to lay the ground work for more complicated argument parsing
in the future.
Brian Behlendorf [Tue, 15 Mar 2011 19:03:42 +0000 (12:03 -0700)]
Register .sync_fs handler
Register the missing .sync_fs handler. This is a noop in most cases
because the usual requirement is that sync just be initiated. As part
of the DMU's normal transaction processing txgs will be frequently
synced. However, when the 'wait' flag is set the requirement is that
.sync_fs must not return until the data is safe on disk. With the
addition of the .sync_fs handler this is now properly implemented.
Brian Behlendorf [Tue, 15 Mar 2011 18:17:33 +0000 (11:17 -0700)]
Strip 'zfsutil,remount' from /etc/mtab
When updating /etc/mtab we should be careful and strip certain
options. In particular, we need to strip 'zfsutil' because if
we don't the mount utility will helpfull provide it to the
mount helper when we issue mount(8) again. This subverts the
check that the caller is zfs(8) and not mount(8).
Brian Behlendorf [Tue, 15 Mar 2011 16:34:56 +0000 (09:34 -0700)]
Always allow '-o remount,ro'
Allow the mount(8) utility to always operate on all datasets when
remounting them read-only. This critical for rc.sysinit/umountroot
which remounts the root filesystem read-only during shutdown to
ensure everything is correctly flushed to disk.
Fix minor typo, the check to set zfsutil should use the bitwise
'&'. I must have accidentally hit the adjacent '*' and obviously
neither the compiler or my code review caught this. Fix it now.