Brian Behlendorf [Fri, 25 Feb 2011 23:48:18 +0000 (15:48 -0800)]
Add zlib regression test
A zlib regression test has been added to verify the correct behavior
of z_compress_level() and z_uncompress. The test case simply takes
a 128k buffer, it compresses the buffer, it them uncompresses the
buffer, and finally it compares the buffers after the transform.
If the buffers match then everything is fine and no data was lost.
It performs this test for all 9 zlib compression levels.
Brian Behlendorf [Fri, 25 Feb 2011 21:26:19 +0000 (13:26 -0800)]
Fix zlib compression
While portions of the code needed to support z_compress_level() and
z_uncompress() where in place. In reality the current implementation
was non-functional, it just was compilable.
The critical missing component was to setup a workspace for the
compress/uncompress stream structures to use. A kmem_cache was
added for the workspace area because we require a large chunk
of memory. This avoids to need to continually alloc/free this
memory and vmap() the pages which is very slow. Several objects
will reside in the per-cpu kmem_cache making them quick to acquire
and release. A further optimization would be to adjust the
implementation to additional ensure the memory is local to the cpu.
Currently that may not be the case.
Brian Behlendorf [Sun, 20 Feb 2011 22:02:48 +0000 (14:02 -0800)]
Use Linux flock struct
Rather than defining our own structure which will conflict with
Linux's version when building 32-bit. Simply setup a typedef
to always use the correct Linux version for both 32 ad 64-bit
builds.
Brian Behlendorf [Wed, 23 Feb 2011 20:25:45 +0000 (12:25 -0800)]
Linux compat 2.6.37, invalidate_inodes()
In the 2.6.37 kernel the function invalidate_inodes() is no longer
exported for use by modules. This memory management functionality
is needed to invalidate the inodes attached to a super block without
unmounting the filesystem.
Because this function still exists in the kernel and the prototype
is available is a common header all we strictly need is the symbol
address. The address is obtained using spl_kallsyms_lookup_name()
and assigned to the variable invalidate_inodes_fn. Then a #define
is used to replace all instances of invalidate_inodes() with a
call to the acquired address. All the complexity is hidden behind
HAVE_INVALIDATE_INODES and invalidate_inodes() can be used as usual.
Long term we should try to get this, or another, interface made
available to modules again.
Brian Behlendorf [Thu, 10 Feb 2011 22:40:57 +0000 (14:40 -0800)]
Prefer /lib/modules/$(uname -r)/ links
Preferentially use the /lib/modules/$(uname -r)/source and
/lib/modules/$(uname -r)/build links. Only if neither of these
links exist fallback to alternate methods for deducing which
kernel to build with. This resolves the need to manually
specify --with-linux= and --with-linux-obj= on Debian systems.
Roll the version forward to 0.6.0. While no major changes
really warrant this I want to keep the version in step with
ZFS for now which is the only SPL consumer.
Previously we would ASSERT in cv_destroy() if it was ever called
with active waiters. However, I've now seen several instances in
OpenSolaris code where they do the following:
cv_broadcast();
cv_destroy();
This leaves no time for active waiters to be woken up and scheduled
and we trip the ASSERT. This has not been observed to be an issue
on OpenSolaris because their cv_destroy() basically does nothing.
They still do run the risk of the memory being free'd after the
cv_destroy() and hitting a bad paging request. But in practice
this race is so small and unlikely it either doesn't happen, or
is so unlikely when it does happen the root cause has not yet been
identified.
Rather than risk the same issue in our code this change updates
cv_destroy() to block until all waiters have been woken and
scheduled. This may take some time because each waiter must
acquire the mutex.
This change may have an impact on performance for frequently
created and destroyed condition variables. That however is a price
worth paying it avoid crashing your system. If performance issues
are observed they can be addressed by the caller.
Brian Behlendorf [Wed, 22 Dec 2010 21:45:02 +0000 (13:45 -0800)]
Minor policy interface
Simply add the policy function wrappers. They are completely
non-functional and always return that everything is OK, but once
again they simplify compilation of dependent packages for now.
These can/should be removed once the security policy of the
dependent application is completely understood and intergrade
as appropriate with Linux.
Brian Behlendorf [Wed, 22 Dec 2010 21:41:57 +0000 (13:41 -0800)]
Add missing headers
Dependent packages require the following missing headers to
simplify compilation. The headers are basically just stubbed
out with minimal content required.
Brian Behlendorf [Wed, 12 Jan 2011 19:29:17 +0000 (11:29 -0800)]
Add MAXUID define
For Linux the maximum uid can vary depending on how your kernel
is built. The Linux kernel still can be compiled with 16 but uids
and gids, although I'm not aware of a major distribution which does
this (maybe an embedded one?). Given that caviot it is reasonably
safe to define the MAXUID as 2147483647.
Brian Behlendorf [Wed, 12 Jan 2011 19:22:34 +0000 (11:22 -0800)]
Minimal VFS additions
This patch simply removes the place holder vfs_t type and includes
some generic Linux VFS headers. It also makes some minor fid_t
additions for compatibility.
Brian Behlendorf [Tue, 11 Jan 2011 19:46:49 +0000 (11:46 -0800)]
Remove VN_HOLD/VN_RELE/VOP_PUTPAGE
Previously these were defined to noops but rather than give
the misleading impression that these are actually implemented
I'm removing the type entirely for clarity.
Brian Behlendorf [Tue, 11 Jan 2011 19:57:02 +0000 (11:57 -0800)]
Make vn_cache|vn_file_cache kmem caches
Both of these caches were previously allowed to be either a
vmem or kmem cache based on the size of the object involved.
Since we know the object won't be to large and performce is
much better for a kmem cache for them to be kmem backed.
Neependra Khare [Mon, 6 Dec 2010 11:35:58 +0000 (17:05 +0530)]
Add cv_timedwait_interruptible() function
The cv_timedwait() function by definition must wait unconditionally
for cv_signal()/cv_broadcast() before waking. This causes processes
to go in the D state which increases the load average. The load
average is the summation of processes in D state and run queue.
To avoid this it can be desirable to sleep interruptibly. These
processes do not count against the load average but may be woken by
a signal. It is up to the caller to determine why the process
was woken it may be for one of three reasons.
1) cv_signal()/cv_broadcast()
2) the timeout expired
3) a signal was received
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Brian Behlendorf [Mon, 10 Jan 2011 20:35:22 +0000 (12:35 -0800)]
Linux Compat: inode->i_mutex/i_sem
Create spl_inode_lock/spl_inode_unlock compability macros to simply
access to the inode mutex/sem. This avoids the need to have to ugly
up the code with the required #define's at every call site. At the
moment the SPL only uses this in one place but higher layers can
benefit from the macro.
To validate the correct behavior of the TSD interfaces it's
important that we add a regression test. This test is designed
to minimally exercise the fundamental TSD behavior, it does not
attempt to validate all potential corner cases.
The test will first create 32 keys via tsd_create() and register
a common destructor. Next 16 wait threads will be created each
of which set/verify a random value for all 32 keys, then block
waiting to be released by the control thread. Meanwhile the
control thread verifies that none of the destructors have been
run prematurely.
The next phase of the test is to create 16 exit threads which
set/verify a random value for all 32 keys. They then immediately
exit. This is is designed to verify tsd_exit() which will be
called via thread_exit(). This must result in all registered
destructors being run and the memory for the tsd being free'd.
After this tsd_destroy() is verified by destroying all 32 keys.
Once again we must see the expected number of destructors run
and the tsd memory free'd. At this point the blocked threads
are released and they exit calling tsd_exit() which should do
very little since all the tsd has already been destroyed.
If this all goes off without a hitch the test passes. To ensure
no memory has been leaked, I have manually verified that after
spl module unload no memory is reported leaked.
Brian Behlendorf [Tue, 30 Nov 2010 17:51:46 +0000 (09:51 -0800)]
Add Thread Specific Data (TSD) Implementation
Thread specific data has implemented using a hash table, this avoids
the need to add a member to the task structure and allows maximum
portability between kernels. This implementation has been optimized
to keep the tsd_set() and tsd_get() times as small as possible.
The majority of the entries in the hash table are for specific tsd
entries. These entries are hashed by the product of their key and
pid because by design the key and pid are guaranteed to be unique.
Their product also has the desirable properly that it will be uniformly
distributed over the hash bins providing neither the pid nor key is zero.
Under linux the zero pid is always the init process and thus won't be
used, and this implementation is careful to never to assign a zero key.
By default the hash table is sized to 512 bins which is expected to
be sufficient for light to moderate usage of thread specific data.
The hash table contains two additional type of entries. They first
type is entry is called a 'key' entry and it is added to the hash during
tsd_create(). It is used to store the address of the destructor function
and it is used as an anchor point. All tsd entries which use the same
key will be linked to this entry. This is used during tsd_destory() to
quickly call the destructor function for all tsd associated with the key.
The 'key' entry may be looked up with tsd_hash_search() by passing the
key you wish to lookup and DTOR_PID constant as the pid.
The second type of entry is called a 'pid' entry and it is added to the
hash the first time a process set a key. The 'pid' entry is also used
as an anchor and all tsd for the process will be linked to it. This
list is using during tsd_exit() to ensure all registered destructors
are run for the process. The 'pid' entry may be looked up with
tsd_hash_search() by passing the PID_KEY constant as the key, and
the process pid. Note that tsd_exit() is called by thread_exit()
so if your using the Solaris thread API you should not need to call
tsd_exit() directly.
When HAVE_MUTEX_OWNER and CONFIG_SMP are defined, kmutex_t is just
a typedef for struct mutex.
This is generally OK but has the downside that it can make mistakes
such as mutex_lock(&kmutex_var) to pass by unnoticed until someone
compiles the code without HAVE_MUTEX_OWNER or CONFIG_SMP (in which
case kmutex_t is a real struct). Note that the correct API to call
should have been mutex_enter() rather than mutex_lock().
We prevent these kind of mistakes by making kmutex_t a real structure
with only one field. This makes kmutex_t typesafe and it shouldn't
have any impact on the generated assembly code.
Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Brian Behlendorf [Tue, 23 Nov 2010 18:56:55 +0000 (10:56 -0800)]
Clear cv->cv_mutex when not in use
For debugging purposes the condition varaibles keep track of the
mutex used during a wait. The idea is to validate that all callers
always use the same mutex. Unfortunately, we have seen cases where
the caller reuses the condition variable with a different mutex but
in a way which is known to be safe. My reading of the man pages
suggests you should not do this and always cv_destroy()/cv_init()
a new mutex. However, there is overhead in doing this and it does
appear to be allowed under Solaris.
To accomidate this behavior cv_wait_common() and __cv_timedwait()
have been modified to clear the associated mutex when the last
waiter is dropped. This ensures that while the condition variable
is in use the incorrect mutex case is detected. It also allows the
condition variable to be safely recycled without requiring the
overhead of a cv_destroy()/cv_init() as long as it isn't currently
in use.
Finally, spin lock cv->cv_lock was removed because it is not required.
When the condition variable is used properly the caller will always
be holding the mutex so the spin lock is redundant. The lock was
originally added because I expected to need to protect more than
just the cv->cv_mutex. It turns out that was not the case.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ned Bass [Tue, 9 Nov 2010 22:06:13 +0000 (14:06 -0800)]
Give ENOTSUP a valid user space error value
The ZFS module returns ENOTSUP for several error conditions where an operation
is not (yet) supported. The SPL defined ENOTSUP in terms of ENOTSUPP, but that
is an internal Linux kernel error code that should not be seen by user
programs. As a result the zfs utilities print a confusing error message if an
unsupported operation is attempted:
internal error: Unknown error 524
Aborted
This change defines ENOTSUP in terms of EOPNOTSUPP which is consistent with
user space.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Brian Behlendorf [Wed, 10 Nov 2010 20:58:07 +0000 (12:58 -0800)]
Linux 2.6.36 compat, use fops->unlocked_ioctl()
As of linux-2.6.36 the last in-tree consumer of fops->ioctl() has
been removed and thus fops()->ioctl() has also been removed. The
replacement hook is fops->unlocked_ioctl() which has existed in
kernel since 2.6.12. Since the SPL only contains support back
to 2.6.18 vintage kernels, I'm not adding an autoconf check for
this and simply moving everything to use fops->unlocked_ioctl().
In the linux-2.6.36 kernel the fs_struct lock was changed from a
rwlock_t to a spinlock_t. If the kernel would export the set_fs_pwd()
symbol by default this would not have caused us any issues, but they
don't. So we're forced to add a new autoconf check which sets the
HAVE_FS_STRUCT_SPINLOCK define when a spinlock_t is used. We can
then correctly use either spin_lock or write_lock in our custom
set_fs_pwd() implementation.
As of linux-2.6.36 RLIM64_INFINITY is defined in linux/resource.h.
This is handled by conditionally defining RLIM64_INFINITY in the
SPL only when the kernel does not provide it.
Flagged by the default compile options on archlinux 2010.05, we should
be using the krw_t type not the krw_type_t type in the private data.
module/splat/splat-rwlock.c: In function ‘splat_rwlock_test4_func’:
module/splat/splat-rwlock.c:432:6: warning: case value ‘1’ not in
enumerated type ‘krw_type_t’
It's important to clear mp->owner after calling mutex_unlock()
because when CONFIG_DEBUG_MUTEXES is defined the mutex owner
is verified in mutex_unlock(). If we set it to NULL this check
fails and the lockdep support is immediately disabled.
Brian Behlendorf [Fri, 22 Oct 2010 21:16:43 +0000 (14:16 -0700)]
Fix 2.6.35 shrinker callback API change
As of linux-2.6.35 the shrinker callback API now takes an additional
argument. The shrinker struct is passed to the callback so that users
can embed the shrinker structure in private data and use container_of()
to access it. This removes the need to always use global state for the
shrinker.
To handle this we add the SPL_AC_3ARGS_SHRINKER_CALLBACK autoconf
check to properly detect the API. Then we simply setup a callback
function with the correct number of arguments. For now we do not make
use of the new 3rd argument.
Brian Behlendorf [Wed, 15 Sep 2010 16:05:34 +0000 (09:05 -0700)]
Fix markdown rendering
These two lines were being rendered incorrectly on the GitHub
site. To fix the issue there needs to be leading whitespace
before each line to ensure each command is rendered on its
own line.
Brian Behlendorf [Tue, 14 Sep 2010 22:54:15 +0000 (15:54 -0700)]
Reference new zfsonlinux.org website
The wiki contents have been converted to html and made available
at their new home http://zfsonlinux.org. The wiki has also been
disabled the html pages are now the official documentation.
One of the neat tricks an autoconf style project is capable of
is allow configurion/building in a directory other than the
source directory. The major advantage to this is that you can
build the project various different ways while making changes
in a single source tree.
For example, this project is designed to work on various different
Linux distributions each of which work slightly differently. This
means that changes need to verified on each of those supported
distributions perferably before the change is committed to the
public git repo.
Using nfs and custom build directories makes this much easier.
I now have a single source tree in nfs mounted on several different
systems each running a supported distribution. When I make a
change to the source base I suspect may break things I can
concurrently build from the same source on all the systems each
in their own subdirectory.
wget -c http://github.com/downloads/behlendorf/spl/spl-x.y.z.tar.gz
tar -xzf spl-x.y.z.tar.gz
cd spl-x-y-z
------------------------- run concurrently ----------------------
<ubuntu system> <fedora system> <debian system> <rhel6 system>
mkdir ubuntu mkdir fedora mkdir debian mkdir rhel6
cd ubuntu cd fedora cd debian cd rhel6
../configure ../configure ../configure ../configure
make make make make
make check make check make check make check
This is something the project has almost supported for a long time
but finishing this support should save me lots of time.
This check was previously done with a hack in config.guess.
However, since a new config.guess is copied in to place when
forcing a full autoreconf this change was easily lost and
never a good idea. This commit also updates all of the
autoconf style support scripts in config.
Full update to date build information will stay on the wiki for
now, but there is no harm in adding the bare bones instructions
to the README. They shouldn't change and are a reasonable
quick start.
Brian Behlendorf [Fri, 27 Aug 2010 20:51:25 +0000 (13:51 -0700)]
Add list_link_replace() function
The list_link_replace() function with swap a new item it to the place
of an old item in a list. It is the callers responsibility to ensure
all lists involved are locked properly.
Brian Behlendorf [Fri, 27 Aug 2010 20:28:10 +0000 (13:28 -0700)]
Stub out kmem cache defrag API
At some point we are going to need to implement the kmem cache
move callbacks to allow for kmem cache defragmentation. This
commit simply lays a small part of the API ground work, it does
not actually implement any of this feature. This is safe for
now because the move callbacks are just an optimization. Even
if they are registered we don't ever really have to call them.
Li Wei [Thu, 12 Aug 2010 16:24:31 +0000 (09:24 -0700)]
Fix stack overflow in vn_rdwr() due to memory reclaim
Unless __GFP_IO and __GFP_FS are removed from the file mapping gfp
mask we may enter memory reclaim during IO. In this case shrink_slab()
entered another file system which is notoriously hungry for stack.
This additional stack usage may cause a stack overflow. This patch
removes __GFP_IO and __GFP_FS from the mapping gfp mask of each file
during vn_open() to avoid any reclaim in the vn_rdwr() IO path. The
original mask is then restored at vn_close() time. Hats off to the
loop driver which does something similiar for the same reason.
Ned Bass [Tue, 10 Aug 2010 18:01:46 +0000 (11:01 -0700)]
Correctly handle rwsem_is_locked() behavior
A race condition in rwsem_is_locked() was fixed in Linux 2.6.33 and the fix was
backported to RHEL5 as of kernel 2.6.18-190.el5. Details can be found here:
The race condition was fixed in the kernel by acquiring the semaphore's
wait_lock inside rwsem_is_locked(). The SPL worked around the race condition
by acquiring the wait_lock before calling that function, but with the fix in
place it must not do that.
This commit implements an autoconf test to detect whether the fixed version of
rwsem_is_locked() is present. The previous version of rwsem_is_locked() was an
inline static function while the new version is exported as a symbol which we
can check for in module.symvers. Depending on the result we correctly
implement the needed compatibility macros for proper spinlock handling.
Finally, we do the right thing with spin locks in RW_*_HELD() by using the
new compatibility macros. We only only acquire the semaphore's wait_lock if
it is calling a rwsem_is_locked() that does not itself try to acquire the lock.
Some new overhead and a small harmless race is introduced by this change.
This is because RW_READ_HELD() and RW_WRITE_HELD() now acquire and release
the wait_lock twice: once for the call to rwsem_is_locked() and once for
the call to rw_owner(). This can't be avoided if calling a rwsem_is_locked()
that takes the wait_lock, as it will in more recent kernels.
The other case which only occurs in legacy kernels could be optimized by
taking the lock only once, as was done prior to this commit. However, I
decided that the performance gain probably wasn't significant enough to
justify the messy special cases required.
The function spl_rw_get_owner() was only used to enable the afore-mentioned
optimization. Since it is no longer used, I removed it.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ned Bass [Fri, 6 Aug 2010 21:04:00 +0000 (14:04 -0700)]
Correctly detect atomic64_cmpxchg support
The RHEL5 2.6.18-194.7.1.el5 kernel added atomic64_cmpxchg to
asm-x86_64/atomic.h. That macro is defined in terms of cmpxchg which
is provided by asm/system.h. However, asm/system.h is not #included by
atomic.h in this kernel nor by the autoconf test for atomic64_cmpxchg, so
the test failed with "implicit declaration of function 'cmpxchg'". This
leads the build system to erroneously conclude that the kernel does not
define atomic64_cmpxchg and enable the built-in definition. This in
turn produces a '"atomic64_cmpxchg" redefined' build warning which is fatal
when building with --enable-debug. This commit fixes this by including
asm/system.h in the autoconf test.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fix taskq code to not drop tasks when TQ_SLEEP is used.
When TQ_SLEEP is used, taskq_dispatch() should always succeed even if the
number of pending tasks is above tq->tq_maxalloc. This semantic is similar
to KM_SLEEP in kmem allocations, which also always succeed.
However, we cannot block forever otherwise there is a risk of deadlock.
Therefore, we still allow the number of pending tasks to go above
tq->tq_maxalloc with TQ_SLEEP, but we may sleep up to 1 second per task
dispatch, thereby throttling the task dispatch rate.
One of the existing splat tests was also augmented to test for this scenario.
The test would fail with the previous implementation but now it succeeds.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Brian Behlendorf [Sat, 31 Jul 2010 05:20:58 +0000 (22:20 -0700)]
Strfree() should call kfree() not kmem_free()
Using kmem_free() results in deducting X bytes from the memory
accounting when --enable-debug is set. Unfortunately, currently
the counterpart kmem_asprintf() and friends do not properly
account for memory allocated, so we must do the same on free.
If we don't then we end up with a negative number of lost bytes
reported when the module is unloaded.
A better long term fix would be to add the accounting in to the
allocation side but that's a project for another day.
Brian Behlendorf [Tue, 27 Jul 2010 17:19:44 +0000 (10:19 -0700)]
Add Debian and Slackware style packaging via alien
The long term fix for Debian and Slackware style packaging is
to add native support for building these packages. Unfortunately,
that is a large chunk of work I don't have time for right now.
That said it would be nice to have at least basic packages for
these distributions.
As a quick short/medium term solution I've settled on using alien
to convert the RPM packages to DEB or TGZ style packages. The
build system has been updated with the following build targets
which will first build RPM packages and then convert them as
needed to the target package type:
make rpm: Create .rpm packages
make deb: Create .deb packages
make tgz: Create .tgz packages
make pkg: Create the right package type for your distribution
The solution comes with lot of caveats and your mileage may vary.
But basically the big limitations are that the resulting packages:
1) Will not have the correct dependency information.
2) Will not not include the kernel version in the release.
3) Will not handle all differences between distributions.
But the resulting packages should be easy to install and remove
from your system and take care of running 'depmod -a' and such.
As I said at the top this is not the right long term solution.
If any of the upstream distribution maintainers want to jump in
and help do this right for their distribution I'd love the help.
Brian Behlendorf [Mon, 26 Jul 2010 22:47:55 +0000 (15:47 -0700)]
Ensure kmem_alloc() and vmem_alloc() never fail
The Solaris semantics for kmem_alloc() and vmem_alloc() are that they
must never fail when called with KM_SLEEP. They may only fail if
called with KM_NOSLEEP otherwise they must block until memory is
available. This is quite different from how the Linux memory
allocators work, under Linux a memory allocation failure is always
possible and must be dealt with.
At one point in the past the kmem code did properly implement this
behavior, however as the code evolved this behavior was overlooked
in places. This patch goes through all three implementations of
the kmem/vmem allocation functions and ensures that they will all
block in the KM_SLEEP case when memory is not available. They
may still fail in the KM_NOSLEEP case in which case the caller
is responsible for handling the failure.
Special care is taken in vmalloc_nofail() to avoid thrashing the
system on the virtual address space spin lock. The down side of
course is if you do see a failure here, which is unlikely for
64-bit systems, your allocation will delay for an entire second.
Still this is preferable to locking up your system and it is the
best we can do given the constraints.
Additionally, the code was cleaned up to be much more readable
and comments were added to describe the various kmem-debug-*
configure options. The default configure options remain:
"--enable-debug-kmem --disable-debug-kmem-tracking"
Brian Behlendorf [Mon, 26 Jul 2010 17:24:26 +0000 (10:24 -0700)]
Fix two minor compiler warnings
In cmd/splat.c there was a comparison between an __u32 and an int. To
resolve the issue simply use a __u32 and strtoul() when converting the
provided user string.
In module/spl/spl-vnode.c we should explicitly cast nd->last.name to
a const char * which is what is expected by the prototype.
Brian Behlendorf [Wed, 21 Jul 2010 23:31:42 +0000 (16:31 -0700)]
Remove deadcode caused by removal of format1 arg
Commit 55abb0929e4fbe326a9737650a167a1a988ad86b removed the never
used format1 argument of spl_debug_msg(). That in turn resulted
in some deadcode which should be removed since it's now useless.
It was being defined as the constant 64 and at first I changed it to be
NR_CPUS instead.
However, NR_CPUS can be a large value on recent kernels (4096), and this
may cause too large kmem allocations to happen.
Therefore, now we use num_possible_cpus(), which should return a (typically)
small value which represents the maximum number of CPUs than can be brought
online in the running hardware (this value is determined at boot time by
arch-specific kernel code).
Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When the kvasprintf() call fails they should reset the arguments
by calling va_start()/va_copy() and va_end() inside the loop,
otherwise they'll try to read more arguments rather than starting
over and reading them from the beginning.
Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fix compilation error due to undefined ACCESS_ONCE macro.
When CONFIG_DEBUG_MUTEXES is turned on in RHEL5's kernel config, the mutexes
store the owner for debugging purposes, therefore the SPL will enable
HAVE_MUTEX_OWNER. However, the SPL code uses ACCESS_ONCE() to access the
owner, and this macro is not defined in the RHEL5 kernel, therefore we define it
ourselves in include/linux/compiler_compat.h.
Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Brian Behlendorf [Tue, 20 Jul 2010 18:55:37 +0000 (11:55 -0700)]
Prefix all SPL debug macros with 'S'
To avoid conflicts with symbols defined by dependent packages
all debugging symbols have been prefixed with a 'S' for SPL.
Any dependent package needing to integrate with the SPL debug
should include the spl-debug.h header and use the 'S' prefixed
macros. They must also build with DEBUG defined.
Brian Behlendorf [Mon, 19 Jul 2010 21:16:05 +0000 (14:16 -0700)]
Split <sys/debug.h> header
To avoid symbol conflicts with dependent packages the debug
header must be split in to several parts. The <sys/debug.h>
header now only contains the Solaris macro's such as ASSERT
and VERIFY. The spl-debug.h header contain the spl specific
debugging infrastructure and should be included by any package
which needs to use the spl logging. Finally the spl-trace.h
header contains internal data structures only used for the log
facility and should not be included by anythign by spl-debug.c.
This way dependent packages can include the standard Solaris
headers without picking up any SPL debug macros. However, if
the dependant package want to integrate with the SPL debugging
subsystem they can then explicitly include spl-debug.h.
Along with this change I have dropped the CHECK_STACK macros
because the upstream Linux kernel now has much better stack
depth checking built in and we don't need this complexity.
Additionally SBUG has been replaced with PANIC and provided as
part of the Solaris macro set. While the Solaris version is
really panic() that conflicts with the Linux kernel so we'll
just have to make due to PANIC. It should rarely be called
directly, the prefered usage would be an ASSERT or VERIFY.
There's lots of change here but this cleanup was overdue.
Ned Bass [Thu, 15 Jul 2010 16:49:38 +0000 (09:49 -0700)]
Proposed fix for oops on SIGINT in splat atomic:64-bit test.
The threads in the splat atomic:64-bit test share the data structure
atomic_priv_t ap, which lives on the kernel stack of the splat user-space
utility. If splat terminates before the threads, accesses to that memory
location by the other threads become invalid. Splat synchronizes with
the threads with the call:
Apparently, the SIGINT wakes and terminates splat prematurely, so that
GPFs or other bad things happen when the threads subsequently access ap.
This commit prevents this by using the uninterruptible form:
Brian Behlendorf [Wed, 14 Jul 2010 18:26:54 +0000 (11:26 -0700)]
Linux 2.6.35 compat: filp_fsync() dropped 'stuct dentry *'
The prototype for filp_fsync() drop the unused argument 'stuct dentry *'.
I've fixed this by adding the needed autoconf check and moving all of
those filp related functions to file_compat.h. This will simplify
handling any further API changes in the future.
Brian Behlendorf [Wed, 14 Jul 2010 04:30:56 +0000 (21:30 -0700)]
Proposed fix for low memory ZFS deadlocks
Deadlocks in the zvol were observed when one of the ZFS threads
performing IO trys to allocate memory while the system is low
on memory. The low memory condition causes dirty pages to be
synced to the zvol but this can't progress because the original
thread is blocked waiting on a memory allocation. Thus we end
up deadlocking.
A proper solution proposed by Wizeman is to change KM_SLEEP from
GFP_KERNEL top GFP_NOFS. This will prevent the memory allocation
which is trying to allocate memory from forcing a sync to the
zvol in shrink_page_list()->pageout().
The down side to all of this is that we are using a pretty big
hammer by changing KM_SLEEP. This change means ALL of the zfs
memory allocations will be until to trigger dirty data to be
synced. The caller still should be able to reclaim memory from
the various slab caches. We will be totally dependent of other
kernel processes which happen to be running and a small number
of asynchronous reclaim threads to trigger the reclaim of dirty
data pages. This should be OK but I think we may see some
slightly longer allocation times when under memory pressure.
Up until now no SPL consumer attempted to perform signed 64-bit
division so there was no need to support this. That has now
changed so I adding 64-bit division support for 32-bit platforms.
The signed implementation is based on the unsigned version.
Since the have been several bug reports in the past concerning
correct 64-bit division on 32-bit platforms I added some long
over due regression tests. Much to my surprise the unsigned
64-bit division regression tests failed.
This was surprising because __udivdi3() was implemented by simply
calling div64_u64() which is provided by the kernel. This meant
that the linux kernels 64-bit division algorithm on 32-bit platforms
was flawed. After some investigation this turned out to be exactly
the case.
Because of this I was forced to abandon the kernel helper and
instead to fully implement 64-bit division in the spl. There are
several published implementation out there on how to do this
properly and I settled on one proposed in the book Hacker's Delight.
Their proposed algoritm is freely available without restriction
and I have just modified it to be linux kernel friendly.
The update implementation now passed all the unsigned and signed
regression tests. This should be functional, but not fast, which is
good enough for out purposes. If you want fast too I'd strongly
suggest you upgrade to a 64-bit platform. I have also reported the
kernel bug and we'll see if we can't get it fixed up stream.
Lars Johannsen [Thu, 1 Jul 2010 08:39:32 +0000 (10:38 +0159)]
Allow config/build to work with autoconf-2.65
As of autoconf-2.65 the AC_LANG_SOURCE source macro no longer
includes the confdef.h results when expanded. To handle this
simply explicitly include confdef.h in conftest.c. This will
cause two copies to of confdef.h to be added to the test for
earlier autoconf versions but this is not harmful.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
For some reason when awk invoked by the usermode helper the command
always fails. Interestingly gawk does not suffer from this problem
which is why I never observed this failure since the distro I tested
with all had gawk installed instead of awk. Anyway, the simplest
thing to do here is to just make gawk mandatory. I've added a
configure check for gawk specifically and have updated the command
to call gawk not awk.
I didn't notice at the time but user_path_dir() was not introduced
at the same time as set_fs_pwd() change. I had lumped the two
together but in fact user_path_dir() was introduced in 2.6.27 and
set_fs_pwd() taking 2 args was introduced in 2.6.25. This means
builds against 2.6.25-2.6.26 kernels were broken.
To fix this I've added a check for user_path_dir() and no longer
assume that if set_fs_pwd() takes 2 args then user_path_dir() is
also available.
Ned Bass [Thu, 1 Jul 2010 00:34:57 +0000 (17:34 -0700)]
Implementation of a regression test for TQ_FRONT.
Use 3 threads and 8 tasks. Dispatch the final 3 tasks with TQ_FRONT.
The first three tasks keep the worker threads busy while we stuff the
queues. Use msleep() to force a known execution order, assuming
TQ_FRONT is properly honored. Verify that the expected completion
order occurs.
The splat_taskq_test5_order() function may be useful in more than
one test. This commit generalizes it by renaming the function to
splat_taskq_test_order() and adding a name argument instead of
assuming SPLAT_TASKQ_TEST5_NAME as the test name.
The documentation for splat taskq regression test #5 swaps the two required
completion orders in the diagram. This commit corrects the error.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ned Bass [Thu, 1 Jul 2010 17:12:57 +0000 (10:12 -0700)]
Initialize the /dev/splatctl device buffer
On open() and initialize the buffer with the SPL version string. The
user space splat utility expects to find the SPL version string when
it opens and reads from /dev/splatctl.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ned Bass [Thu, 1 Jul 2010 17:07:51 +0000 (10:07 -0700)]
Implementation of the TQ_FRONT flag.
Adds a task queue to receive tasks dispatched with TQ_FRONT. Worker
threads pull tasks from this high priority queue before the default
pending queue.
Executing tasks out of FIFO order potentially breaks taskq_lowest_id()
if we do not preserve the ordering of the work list by taskqid.
Therefore, instead of always appending to the work list, we search for
the appropriate place to insert a task. The common case is to append
to the list, so we make this operation efficient by searching the work
list in reverse order.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Whoops, I momentarilly forgot I had explicitly set these as CC
options so dependent packages which need to include spl_config.h
would not end up having these defined which can result in
accidentally hanging debug enabled at best, or a build failure
at worst.
Only make compiler warnings fatal with --enable-debug
While in theory I like the idea of compiler warnings always being
fatal. In practice this causes problems when small harmless errors
cause build failures for end users. To handle this I've updated
the build system such that -Werror is only used when --enable-debug
is passed to configure. This is how I always build when developing
so I'll catch all build warnings and end users will not get stuck
by minor issues.
Brian Behlendorf [Wed, 30 Jun 2010 17:47:36 +0000 (10:47 -0700)]
Linux-2.6.33 compat, O_DSYNC flag added
Prior to linux-2.6.33 only O_DSYNC semantics were implemented and
they used the O_SYNC flag. As of linux-2.6.33 this behavior was
properly split in to O_SYNC and O_DSYNC respectively.
Brian Behlendorf [Wed, 30 Jun 2010 17:36:20 +0000 (10:36 -0700)]
Linux-2.6.33 compat, .ctl_name removed from struct ctl_table
As of linux-2.6.33 the ctl_name member of the ctl_table struct
has been entirely removed. The upstream code has been updated
to depend entirely on the the procname member. To handle this
all references to ctl_name are wrapped in a CTL_NAME macro which
simply expands to nothing for newer kernels. Older kernels are
supported by having it expand to .ctl_name = X just as before.
Brian Behlendorf [Wed, 30 Jun 2010 16:47:57 +0000 (09:47 -0700)]
Linux-2.6.33 compat, check <generated/utsrelease.h> for UTS_RELEASE
It seems the upstream community moved the definition of UTS_RELEASE
yet again as of linux-2.6.33. Update the build system to check in
all three possible locations where your kernel version may be defined.
Brian Behlendorf [Mon, 28 Jun 2010 19:48:20 +0000 (12:48 -0700)]
Treat mutex->owner as volatile
When HAVE_MUTEX_OWNER is defined and we are directly accessing
mutex->owner treat is as volative with the ACCESS_ONCE() helper.
Without this you may get a stale cached value when accessing it
from different cpus. This can result in incorrect behavior from
mutex_owned() and mutex_owner(). This is not a problem for the
!HAVE_MUTEX_OWNER case because in this case all the accesses
are covered by a spin lock which similarly gaurentees we will
not be accessing stale data.
Secondly, check CONFIG_SMP before allowing access to mutex->owner.
I see that for non-SMP setups the kernel does not track the owner
so we cannot rely on it.
Thirdly, check CONFIG_MUTEX_DEBUG when this is defined and the
HAVE_MUTEX_OWNER is defined surprisingly the mutex->owner will
not be cleared on mutex_exit(). When this is the case the SPL
needs to make sure to do it to ensure MUTEX_HELD() behaves as
expected or you will certainly assert in mutex_destroy().
Finally, improve the mutex regression tests. For mutex_owned() we
now minimally check that it behaves correctly when checked from the
owner thread or the non-owner thread. This subtle behaviour has bit
me before and I'd like to catch it early next time if it reappears.
As for mutex_owned() regression test additonally verify that
mutex->owner is always cleared on mutex_exit().
Brian Behlendorf [Mon, 28 Jun 2010 19:34:20 +0000 (12:34 -0700)]
Fix subtle race in threads test case
The call to wake_up() must be moved under the spin lock because
once we drop the lock 'tp' may no longer be valid because the
creating thread has exited. This basic thread implementation
was correct, this was simply a flaw in the test case.
Brian Behlendorf [Mon, 28 Jun 2010 18:39:43 +0000 (11:39 -0700)]
Accept but ignore TASKQ_DC_BATCH and TQ_FRONT
For the moment the SPL accepts the TASKQ_DC_BATCH and TQ_FRONT
flags however they get silently ignored. This is harmless for
the moment but it does need to be implemented at some point.
Brian Behlendorf [Thu, 24 Jun 2010 16:41:59 +0000 (09:41 -0700)]
Add kmem_vasprintf function
We might as well have both asprintf() variants. This allows us
to safely pass a va_list through several levels of the stack
using va_copy() instead of va_start().
Brian Behlendorf [Wed, 16 Jun 2010 22:57:04 +0000 (15:57 -0700)]
Update warnings in kmem debug code
This fix was long overdue. Most of the ground work was laid long
ago to include the exact function and line number in the error message
which there was an issue with a memory allocation call. However,
probably due to lack of time at the moment that informatin never
made it in to the error message. This patch fixes that and trys
to standardize the kmem debug messages as well.
Brian Behlendorf [Mon, 14 Jun 2010 21:18:48 +0000 (14:18 -0700)]
Include kstat.h from kmem.h
It turns out Solaris incidentally includes kstat.h from kmem.h. As
a side effect of this certain higher level .c files which should
explicitly include kstat.h don't because they happen to get it
via kmem.h. To make like easier for everyone I do the same.
Brian Behlendorf [Fri, 11 Jun 2010 21:48:18 +0000 (14:48 -0700)]
Add kmem_asprintf(), strfree(), strdup(), and minor cleanup.
This patch adds three missing Solaris functions: kmem_asprintf(), strfree(),
and strdup(). They are all implemented as a thin layer which just calls
their Linux counterparts. As part of this an autoconf check for kvasprintf
was added because it does not appear in older kernels. If the kernel does
not provide it then spl-generic implements it.
Additionally the dead DEBUG_KMEM_UNIMPLEMENTED code was removed to clean
things up and make the kmem.h a little more readable.
Brian Behlendorf [Fri, 11 Jun 2010 22:02:24 +0000 (15:02 -0700)]
Add xuio_* structures and typedefs.
Add the basic xuio structure and typedefs for Solaris style zero copy.
There's a decent chance this will not be the way I handle this on Linux
but providing the basic types simplifies things for now.
Brian Behlendorf [Fri, 11 Jun 2010 21:37:46 +0000 (14:37 -0700)]
Cleanly split Linux proc.h (fs) from conflicting Solaris proc.h (process)
Under linux the proc.h header is for the /proc filesystem, and under
Solaris the proc/h header if for processes. This patch correctly
moves the Linux proc functionality in a linux/proc_compat.h header
and leaves the sys/proc.h for use by Solaris. Minor updates were
required to all the call sites where it was included of course.