Ned Bass [Thu, 1 Jul 2010 17:12:57 +0000 (10:12 -0700)]
Initialize the /dev/splatctl device buffer
On open() and initialize the buffer with the SPL version string. The
user space splat utility expects to find the SPL version string when
it opens and reads from /dev/splatctl.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ned Bass [Thu, 1 Jul 2010 17:07:51 +0000 (10:07 -0700)]
Implementation of the TQ_FRONT flag.
Adds a task queue to receive tasks dispatched with TQ_FRONT. Worker
threads pull tasks from this high priority queue before the default
pending queue.
Executing tasks out of FIFO order potentially breaks taskq_lowest_id()
if we do not preserve the ordering of the work list by taskqid.
Therefore, instead of always appending to the work list, we search for
the appropriate place to insert a task. The common case is to append
to the list, so we make this operation efficient by searching the work
list in reverse order.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Whoops, I momentarilly forgot I had explicitly set these as CC
options so dependent packages which need to include spl_config.h
would not end up having these defined which can result in
accidentally hanging debug enabled at best, or a build failure
at worst.
Only make compiler warnings fatal with --enable-debug
While in theory I like the idea of compiler warnings always being
fatal. In practice this causes problems when small harmless errors
cause build failures for end users. To handle this I've updated
the build system such that -Werror is only used when --enable-debug
is passed to configure. This is how I always build when developing
so I'll catch all build warnings and end users will not get stuck
by minor issues.
Brian Behlendorf [Wed, 30 Jun 2010 17:47:36 +0000 (10:47 -0700)]
Linux-2.6.33 compat, O_DSYNC flag added
Prior to linux-2.6.33 only O_DSYNC semantics were implemented and
they used the O_SYNC flag. As of linux-2.6.33 this behavior was
properly split in to O_SYNC and O_DSYNC respectively.
Brian Behlendorf [Wed, 30 Jun 2010 17:36:20 +0000 (10:36 -0700)]
Linux-2.6.33 compat, .ctl_name removed from struct ctl_table
As of linux-2.6.33 the ctl_name member of the ctl_table struct
has been entirely removed. The upstream code has been updated
to depend entirely on the the procname member. To handle this
all references to ctl_name are wrapped in a CTL_NAME macro which
simply expands to nothing for newer kernels. Older kernels are
supported by having it expand to .ctl_name = X just as before.
Brian Behlendorf [Wed, 30 Jun 2010 16:47:57 +0000 (09:47 -0700)]
Linux-2.6.33 compat, check <generated/utsrelease.h> for UTS_RELEASE
It seems the upstream community moved the definition of UTS_RELEASE
yet again as of linux-2.6.33. Update the build system to check in
all three possible locations where your kernel version may be defined.
Brian Behlendorf [Mon, 28 Jun 2010 19:48:20 +0000 (12:48 -0700)]
Treat mutex->owner as volatile
When HAVE_MUTEX_OWNER is defined and we are directly accessing
mutex->owner treat is as volative with the ACCESS_ONCE() helper.
Without this you may get a stale cached value when accessing it
from different cpus. This can result in incorrect behavior from
mutex_owned() and mutex_owner(). This is not a problem for the
!HAVE_MUTEX_OWNER case because in this case all the accesses
are covered by a spin lock which similarly gaurentees we will
not be accessing stale data.
Secondly, check CONFIG_SMP before allowing access to mutex->owner.
I see that for non-SMP setups the kernel does not track the owner
so we cannot rely on it.
Thirdly, check CONFIG_MUTEX_DEBUG when this is defined and the
HAVE_MUTEX_OWNER is defined surprisingly the mutex->owner will
not be cleared on mutex_exit(). When this is the case the SPL
needs to make sure to do it to ensure MUTEX_HELD() behaves as
expected or you will certainly assert in mutex_destroy().
Finally, improve the mutex regression tests. For mutex_owned() we
now minimally check that it behaves correctly when checked from the
owner thread or the non-owner thread. This subtle behaviour has bit
me before and I'd like to catch it early next time if it reappears.
As for mutex_owned() regression test additonally verify that
mutex->owner is always cleared on mutex_exit().
Brian Behlendorf [Mon, 28 Jun 2010 19:34:20 +0000 (12:34 -0700)]
Fix subtle race in threads test case
The call to wake_up() must be moved under the spin lock because
once we drop the lock 'tp' may no longer be valid because the
creating thread has exited. This basic thread implementation
was correct, this was simply a flaw in the test case.
Brian Behlendorf [Mon, 28 Jun 2010 18:39:43 +0000 (11:39 -0700)]
Accept but ignore TASKQ_DC_BATCH and TQ_FRONT
For the moment the SPL accepts the TASKQ_DC_BATCH and TQ_FRONT
flags however they get silently ignored. This is harmless for
the moment but it does need to be implemented at some point.
Brian Behlendorf [Thu, 24 Jun 2010 16:41:59 +0000 (09:41 -0700)]
Add kmem_vasprintf function
We might as well have both asprintf() variants. This allows us
to safely pass a va_list through several levels of the stack
using va_copy() instead of va_start().
Brian Behlendorf [Wed, 16 Jun 2010 22:57:04 +0000 (15:57 -0700)]
Update warnings in kmem debug code
This fix was long overdue. Most of the ground work was laid long
ago to include the exact function and line number in the error message
which there was an issue with a memory allocation call. However,
probably due to lack of time at the moment that informatin never
made it in to the error message. This patch fixes that and trys
to standardize the kmem debug messages as well.
Brian Behlendorf [Mon, 14 Jun 2010 21:18:48 +0000 (14:18 -0700)]
Include kstat.h from kmem.h
It turns out Solaris incidentally includes kstat.h from kmem.h. As
a side effect of this certain higher level .c files which should
explicitly include kstat.h don't because they happen to get it
via kmem.h. To make like easier for everyone I do the same.
Brian Behlendorf [Fri, 11 Jun 2010 21:48:18 +0000 (14:48 -0700)]
Add kmem_asprintf(), strfree(), strdup(), and minor cleanup.
This patch adds three missing Solaris functions: kmem_asprintf(), strfree(),
and strdup(). They are all implemented as a thin layer which just calls
their Linux counterparts. As part of this an autoconf check for kvasprintf
was added because it does not appear in older kernels. If the kernel does
not provide it then spl-generic implements it.
Additionally the dead DEBUG_KMEM_UNIMPLEMENTED code was removed to clean
things up and make the kmem.h a little more readable.
Brian Behlendorf [Fri, 11 Jun 2010 22:02:24 +0000 (15:02 -0700)]
Add xuio_* structures and typedefs.
Add the basic xuio structure and typedefs for Solaris style zero copy.
There's a decent chance this will not be the way I handle this on Linux
but providing the basic types simplifies things for now.
Brian Behlendorf [Fri, 11 Jun 2010 21:37:46 +0000 (14:37 -0700)]
Cleanly split Linux proc.h (fs) from conflicting Solaris proc.h (process)
Under linux the proc.h header is for the /proc filesystem, and under
Solaris the proc/h header if for processes. This patch correctly
moves the Linux proc functionality in a linux/proc_compat.h header
and leaves the sys/proc.h for use by Solaris. Minor updates were
required to all the call sites where it was included of course.
Alex Zhuravlev [Thu, 3 Jun 2010 05:01:14 +0000 (22:01 -0700)]
Stack overflow on 64-bit modulus operations on 32-bit architectures.
Running 'zpool create' on a 32-bit machine with an SPL compiled with
gcc 4.4.4 led to a stack overlow. This turned out to be due to some
sort of 'optimization' by gcc:
Brian Behlendorf [Thu, 20 May 2010 21:20:34 +0000 (14:20 -0700)]
Simplify rwlock implementation.
Remove RW_COUNT() from the rwlock implementation. The idea was that it
could be used as a generic wrapper for getting at the internal state
of a rwlock. While a good idea it's proven problematic to keep it
correct for multiple archs and internal implementation changes. In
short it hasn't been worth the trouble.
With that and simplicity in mind things have been updated to use the
rwsem_is_locked() function instead of RW_COUNT for the RW_*_HELD()
functions. As for rw_upgrade() it remains only implemented for
the generic rwsem implemenation. It remains to be determined if its
worth the effort of adding a custom implementation for each arch.
Brian Behlendorf [Thu, 20 May 2010 17:15:51 +0000 (10:15 -0700)]
Disable spl_debug_panic_on_bug by default.
While I may prefer to have the system panic on an SBUG and to get
crash dump for analysis. I suspect most peoples systems are not
configured from crash dump and the best thing to so is to simply
halt the thread and print an error to the console. This way they
have a good chance of actually saving the stack trace and debug log.
Brian Behlendorf [Wed, 19 May 2010 23:53:13 +0000 (16:53 -0700)]
Remove kmem_set_warning() interface replace with __GFP_NOWARN flag.
Remove the kmem_set_warning() hack used by the kmem-splat regression
tests with a per-allocation flag called __GFP_NOWARN. This matches
the lower level linux flag of similar by slightly different function.
The idea is you can then explicitly set this flag on requests where
you know your breaking the max 8k rule but you need/want to do it
anyway.
This is currently used by the regression tests where we intentionally
push things to the limit but don't want the log noise. Additionally,
we are forced to use it in spl_kmem_cache_create() because by default
NR_CPUS is very large and theres no easy way to handle that.
Finally, I've added a stack_dump() call to the warning when it is
trigger to make to clear exactly where the allocation is taking place.
Brian Behlendorf [Tue, 18 May 2010 16:18:20 +0000 (09:18 -0700)]
Minor spec file cleanup for srpm case.
Ensure kdevpkg is defined is srpm case before using it to define
the devel_requires macro. Interestingly this is not an issue for
rpm-4.7.1-4 but it is for rpm-4.4.2.3-18.
Brian Behlendorf [Mon, 17 May 2010 22:18:00 +0000 (15:18 -0700)]
Public Release Prep
Updated AUTHORS, COPYING, DISCLAIMER, and INSTALL files. Added
standardized headers to all source file to clearly indicate the
copyright, license, and to give credit where credit is due.
Brian Behlendorf [Fri, 14 May 2010 16:31:22 +0000 (09:31 -0700)]
Use do_posix_clock_monotonic_gettime() as described by comment.
While this does incur slightly more overhead we should be using
do_posix_clock_monotonic_gettime() for gethrtime() as described
by the existing comment.
Brian Behlendorf [Fri, 14 May 2010 16:24:51 +0000 (09:24 -0700)]
Add cv_wait_interruptible() function.
This is a minor extension to the condition variable API to allow
for reasonable signal handling on Linux. The cv_wait() function by
definition must wait unconditionally for cv_signal()/cv_broadcast()
before waking it. This makes it impossible to woken by a signal
such as SIGTERM. The cv_wait_interruptible() function was added
to handle this case. It behaves identically to cv_wait() with the
exception that it waits interruptibly allowing a signal to wake it
up. This means you do need to be careful and check issig() after
waking.
Brian Behlendorf [Fri, 23 Apr 2010 22:55:02 +0000 (15:55 -0700)]
Dump log from current process when required
When dumping a debug log first check that it is safe to create
a new thread and block waiting for it. If we are in an atomic
context or irqs and disabled it is not safe to sleep and we
must write out of the debug log from the current process.
Brian Behlendorf [Thu, 22 Apr 2010 19:48:40 +0000 (12:48 -0700)]
Update vn_set_pwd() to allow user|kernal address for filename
During module init spl_setup()->The vn_set_pwd("/") was failing
with -EFAULT because user_path_dir() and __user_walk() both
expect 'filename' to be a user space address and it's not in
this case. To handle this the data segment size is increased
to to ensure strncpy_from_user() does not fail with -EFAULT.
Additionally, I've added a printk() warning to catch this and
log it to the console if it ever reoccurs. I thought everything
was working properly here because there consequences of this
failing are subtle and usually non-critical.
Brian Behlendorf [Tue, 20 Apr 2010 22:16:27 +0000 (15:16 -0700)]
Disable rw_tryupgrade() for newer kernels
For kernels using the CONFIG_RWSEM_GENERIC_SPINLOCK implementation
nothing has changed. But if your kernel is building with arch
specific rwsems rw_tryupgrade() has been disabled until it can
be implemented correctly. In particular, the x86 implementation
now leverages atomic primatives for serialization rather than
spinlocks. So to get this working again it will need to be
implemented as a cmpxchg for x86 and likely something similiar
for other arches we are interested in. For now it's safest
to simply disable it.
Brian Behlendorf [Fri, 26 Mar 2010 22:21:06 +0000 (15:21 -0700)]
Add support for 'make -s' silent builds
The cleanest way to do this is to set AM_LIBTOOLFLAGS = --silent. However,
AM_LIBTOOLFLAGS is not honored by automake-1.9.6-2.1 which is what I have
been using. To cleanly handle this I am updating to automake-1.11-3 which
is why it looks like there is a lot of churn in the Makefiles.
Brian Behlendorf [Mon, 22 Mar 2010 21:45:33 +0000 (14:45 -0700)]
Allow spl_config.h to be included by dependant packages (updated)
We need dependent packages to be able to include spl_config.h to
build properly. This was partially solved in commit 0cbaeb1 by using
AH_BOTTOM to #undef common #defines (PACKAGE, VERSION, etc) which
autoconf always adds and cannot be easily removed. This solution
works as long as the spl_config.h is included before your projects
config.h. That turns out to be easier said than done. In particular,
this is a problem when your package includes its config.h using the
-include gcc option which ensures the first thing included is your
config.h.
To handle all cases cleanly I have removed the AH_BOTTOM hack and
replaced it with an AC_CONFIG_HEADERS command. This command runs
immediately after spl_config.h is written and with a little awk-foo
it strips the offending #defines from the file. This eliminates
the problem entirely and makes header safe for inclusion.
Also in this change I have removed the few places in the code where
spl_config.h is included. It is now added to the gcc compile line
to ensure the config results are always available.
Finally, I have also disabled the verbose kernel builds. If you
want them back you can always build with 'make V=1'. Since things
are working now they don't need to be on by default.
Brian Behlendorf [Thu, 18 Mar 2010 20:39:51 +0000 (13:39 -0700)]
Reduce max kmem based slab size
Allowing MAX_ORDER-1 sized allocations for kmem based slabs have
been observed to result in deadlocks. To help prvent this limit
max kmem based slab size to MAX_ORDER-3. Just for the record
callers should not be creating slabs like this, but if they do
we should still handle it as safely as we can.
Brian Behlendorf [Thu, 11 Mar 2010 22:29:17 +0000 (14:29 -0800)]
Ignore unsigned module build products
Along with the addition of signed kernel modules in newer kernel
we have a few new build products we need to ignore. LKLM has the
whole thread for those interested: http://lkml.org/lkml/2007/2/14/164
Fix definitions for the unknown distro/installation
If the distro/installation really is unsupported (i.e. unknown) we should
not make it look like a known distribution (i.e. RHEL) complete with
dependencies on other RPMs and trying to find kenrel source in the RH
standard location.
Additionally add 'k' prefix for kernel requires for consistency.
When no kernel source has been pointed to, first attempt to use
/lib/modules/$(uname -r)/source. This will likely fail when building
under a mock (http://fedoraproject.org/wiki/Projects/Mock) chroot
environment since `uname -r` will report the running kernel which
likely is not the kernel in your chroot. To cleanly handle this
we fallback to using the first kernel in your chroot.
The kernel-devel package which contains all the kernel headers and
a few build products such as Module.symver{s} is all the is required.
Full source is not needed.
As of linux-2.6.32 the 'struct file *filp' argument was dropped from
the proc_handle() prototype. It was apparently unused _almost_
everywhere in the kernel and this was simply cleanup.
I've added a new SPL_AC_5ARGS_PROC_HANDLER autoconf check for this and
the proper compat macros to correctly define the prototypes and some
helper functions. It's not pretty but API compat changes rarely are.
This test case verifies the correct behavior of taskq_wait_id().
In particular it ensure the the following two cases are handled
properly:
1) Task ids larger than the waited for task id can run and
complete as long as there is an available worker thread.
2) All task ids lower than the waited one must complete before
unblocking even if the waited task id itself has completed.
Optimize lowest outstanding taskqid calculation in taskq_lowest_id()
In the initial version of taskq_lowest_id() the entire pending and
work list was locked under the tq->tq_lock to determine the lowest
outstanding taskqid. At the time this done because I was rushed
and wanted to make sure it was right... fast was secondary. Well now
fast is important too so I carefully thought through the pending
and work list management and convinced myself it is safe and correct
to simply check the first entry. I added a large comment to the source
to explain this. But basically as long as we are careful to ensure the
pending and work list stay sorted this is safe and fast.
The motivation for this chance was that I was observing as much as
10% of the total CPU time go to waiting on the tq->tq_lock when the
pending list was long. This resolves that problems and frees up
that CPU time for something useful.
Brian Behlendorf [Wed, 23 Dec 2009 20:46:11 +0000 (12:46 -0800)]
Fix kmem:slab_overcommit regression test locking
This regression test could crash in splat_kmem_cache_test_reclaim()
due to a race between the slab relclaim and the normal exiting of
the thread. Specifically, the kct structure could be free'd by
the thread performing the allocations while the reclaim function
was also working on that's threads kct structure. The simplest
fix is to extend the kcp->kcp_lock over the reclaim to prevent
the kct from being freed. A better fix would be to ref count
these structures, but since is just a regression this locking
change is enough. Surprisingly this was only observed commonly
under RHEL5.4 but all platform could have hit this.
Brian Behlendorf [Thu, 17 Dec 2009 19:57:44 +0000 (11:57 -0800)]
Check for changed gaurd macro in 2.6.28+ for rwsem implementation.
As part of the 2.6.28 cleanup which moved all the linux/include/asm/
headers in to linux/arch, the guard headers for many header files
changed. The i386 rwsem implementation keys off this header to
ensure the internal members of the rwsem structure are interpreted
correctly. This change checks for the new guard macro in addition
to the only one, the implementation of the rwsem has not changed
for i386 so this is safe and correct.
Brian Behlendorf [Thu, 10 Dec 2009 23:06:07 +0000 (15:06 -0800)]
Splat vnode tests must return negative error codes.
I must have been in a hurry when I wrote the vnode regression tests
because the error code handling is not correct. The Solaris vnode
API returns positive errno's, these need to be converted to negative
errno's for Linux before being passed back to user space. Otherwise
the test hardness with report the failure but errno will not be set
with the correct error code.
Additionally tests 3, 4, 6, and 7 may fail in the test file already
exists. To avoid false positives a user mode helper has added to
remove the test files in /tmp/ before running the actual test.
Atomic64 compatibility for 32-bit systems without kernel support.
This patch is another step towards updating the code to handle the
32-bit kernels which I have not been regularly testing. This changes
do not really impact the common case I'm expected which is the latest
kernel running on an x86_64 arch.
Until the linux-2.6.31 kernel the x86 arch did not have support for
64-bit atomic operations. Additionally, the new atomic_compat.h support
for this case was wrong because it embedded a spinlock in the atomic
variable which must always and only be 64-bits total. To handle these
32-bit issues we now simply fall back to the --enable-atomic-spinlock
implementation if the kernel does not provide the 64-bit atomic funcs.
The second issue this patch addresses is the DEBUG_KMEM assumption that
there will always be atomic64 funcs available. On 32-bit archs this may
not be true, and actually that's just fine. In that case the kernel will
will never be able to allocate more the 32-bits worth anyway. So just
check if atomic64 funcs are available, if they are not it means this
is a 32-bit machine and we can safely use atomic_t's instead.
Add missing atomic64 compat helpers for 32-bit systems.
The use of these functions was added with the recent atomic work
and not tested on 32-bit systems. Add the missing compat functions:
atomic64_inc, atomic64_dec, atomic64_add_return, atomic64_sub_return,
atomic64_inc_return, atomic64_dec_return.
Brian Behlendorf [Sun, 15 Nov 2009 22:27:15 +0000 (14:27 -0800)]
Add mutex_enter_nested() as wrapper for mutex_lock_nested().
This symbol can be used by GPL modules which use the SPL to handle
cases where a call path takes a two different locks by the same
name. This is needed to avoid a false positive in the lock checker.
Brian Behlendorf [Fri, 13 Nov 2009 19:12:43 +0000 (11:12 -0800)]
Linux 2.6.31 kmem cache alignment fixes and cleanup.
The big fix here is the removal of kmalloc() in kv_alloc(). It used
to be true in previous kernels that kmallocs over PAGE_SIZE would
always be pages aligned. This is no longer true atleast in 2.6.31
there are no longer any alignment expectations. Since kv_alloc()
requires the resulting address to be page align we no only either
directly allocate pages in the KMC_KMEM case, or directly call
__vmalloc() both of which will always return a page aligned address.
Additionally, to avoid wasting memory size is always a power of two.
As for cleanup several helper functions were introduced to calculate
the aligned sizes of various data structures. This helps ensure no
case is accidentally missed where the alignment needs to be taken in
to account. The helpers now use P2ROUNDUP_TYPE instead of P2ROUNDUP
which is safer since the type will be explict and we no longer count
on the compiler to auto promote types hopefully as we expected.
Always wnforce minimum (SPL_KMEM_CACHE_ALIGN) and maximum (PAGE_SIZE)
alignment restrictions at cache creation time.
Brian Behlendorf [Thu, 12 Nov 2009 23:11:24 +0000 (15:11 -0800)]
Remove __GFP_NOFAIL in kmem and retry internally.
As of 2.6.31 it's clear __GFP_NOFAIL should no longer be used and it
may disappear from the kernel at any time. To handle this I have simply
added *_nofail wrappers in the kmem implementation which perform the
retry for non-atomic allocations.
From linux-2.6.31 mm/page_alloc.c:1166
/*
* __GFP_NOFAIL is not to be used in new code.
*
* All __GFP_NOFAIL callers should be fixed so that they
* properly detect and handle allocation failures.
*
* We most definitely don't want callers attempting to
* allocate greater than order-1 page units with
* __GFP_NOFAIL.
*/
WARN_ON_ONCE(order > 1);
Brian Behlendorf [Tue, 10 Nov 2009 22:06:57 +0000 (14:06 -0800)]
Linux 2.6.31 Compatibility Updates
SPL_AC_2ARGS_SET_FS_PWD macro updated to explicitly include
linux/fs_struct.h which was dropped from linux/sched.h.
min_wmark_pages, low_wmark_pages, high_wmark_pages macros
introduced in newer kernels. For older kernels mm_compat.h
was introduced to define them as needed as direct mappings
to per zone min_pages, low_pages, max_pages.
Brian Behlendorf [Fri, 30 Oct 2009 20:58:51 +0000 (13:58 -0700)]
Autoconf --enable-debug-* cleanup
Cleanup the --enable-debug-* configure options, this has been pending
for quite some time and I am glad I finally got to it. To summerize:
1) All SPL_AC_DEBUG_* macros were updated to be a more autoconf
friendly. This mainly involved shift to the GNU approved usage of
AC_ARG_ENABLE and ensuring AS_IF is used rather than directly using
an if [ test ] construct.
2) --enable-debug-kmem=yes by default. This simply enabled keeping
a running tally of total memory allocated and freed and reporting a
memory leak if there was one at module unload. Additionally, it
ensure /proc/spl/kmem/slab will exist by default which is handy.
The overhead is low for this and it should not impact performance.
3) --enable-debug-kmem-tracking=no by default. This option was added
to provide a configure option to enable to detailed memory allocation
tracking. This support was always there but you had to know where to
turn it on. By default this support is disabled because it is known
to badly hurt performence, however it is invaluable when chasing a
memory leak.
4) --enable-debug-kstat removed. After further reflection I can't see
why you would ever really want to turn this support off. It is now
always on which had the nice side effect of simplifying the proc handling
code in spl-proc.c. We can now always assume the top level directory
will be there.
5) --enable-debug-callb removed. This never really did anything, it was
put in provisionally because it might have been needed. It turns out
it was not so I am just removing it to prevent confusion.
Brian Behlendorf [Fri, 30 Oct 2009 20:53:17 +0000 (13:53 -0700)]
Add autoconf checks for atomic64_cmpxchg + atomic64_xchg
These functions didn't exist for all archs prior to 2.6.24. This
patch addes an autoconf test to detect this and add them when needed.
The autoconf check is needed instead of just an #ifndef because in
the most modern kernels atomic64_{cmp}xchg are implemented as in
inline function and not a #define.
Brian Behlendorf [Fri, 30 Oct 2009 17:55:25 +0000 (10:55 -0700)]
Use Linux atomic primitives by default.
Previously Solaris style atomic primitives were implemented simply by
wrapping the desired operation in a global spinlock. This was easy to
implement at the time when I wasn't 100% sure I could safely layer the
Solaris atomic primatives on the Linux counterparts. It however was
likely not good for performance.
After more investigation however it does appear the Solaris primitives
can be layered on Linux's fairly safely. The Linux atomic_t type really
just wraps a long so we can simply cast the Solaris unsigned value to
either a atomic_t or atomic64_t. The only lingering problem for both
implementations is that Solaris provides no atomic read function. This
means reading a 64-bit value on a 32-bit arch can (and will) result in
word breaking. I was very concerned about this initially, but upon
further reflection it is a limitation of the Solaris API. So really
we are just being bug-for-bug compatible here.
With this change the default implementation is layered on top of Linux
atomic types. However, because we're assuming a lot about the internal
implementation of those types I've made it easy to fall-back to the
generic approach. Simply build with --enable-atomic_spinlocks if
issues are encountered with the new implementation.
Brian Behlendorf [Tue, 27 Oct 2009 22:54:45 +0000 (15:54 -0700)]
Rebase cmn_err on vcmn_err and don't warn about missing \n
The cmn_err/vcmn_err functions are layered on top of the debug
system which usually expects a newline at the end. However, there
really doesn't need to be a newline there and there in fact should
not be for the CE_CONT case so let's just drop the warning.
Also we make a half-hearted attempt to handle a leading ! which
means only send it to the syslog not the console. In this case
we just send to the the debug logs and not the console.
This macro was removed from the default RPM macro file. Interestly,
some of the arch specific macro's add it back it based on your distro
but it should not be counted on. However, __id still exists and its
command line args have historically been fairly stable so we will
directly use %{__id} -un to get the user name.
As of 2.6.25 kobj->k_name was replaced with kobj->name. Some distros
such as RHEL5 (2.6.18) add a patch to prevent this from being a problem
but other older distros such as SLES10 (2.6.16) have not. To avoid
the whole issue I'm updating the code to use kobject_set_name() which
does what I want and has existed all the way back to 2.6.11.
Ricardo has pointed out that under Solaris the cwd is set to '/'
during module load, while under Linux it is set to the callers cwd.
To handle this cleanly I've reworked the module *_init()/_exit()
macros so they call a *_setup()/_cleanup() function when any SPL
dependent module is loaded or unloaded. This gives us a chance to
perform any needed modification of the process, in this case changing
the cwd. It also handily provides a way to avoid creating wrapper
init()/exit() functions because the Solaris and Linux prototypes
differ slightly. All dependent modules should now call the spl
helper macros spl_module_{init,exit}() instead of the native linux
versions.
Unfortunately, it appears that under Linux there has been no consistent
API in the kernel to set the cwd in a module. Because of this I have
had to add more autoconf magic than I'd like. However, what I have
done is correct and has been tested on RHEL5, SLES11, FC11, and CHAOS
kernels.
In addition, I have change the rootdir type from a 'void *' to the
correct 'vnode_t *' type. And I've set rootdir to a non-NULL value.
Brian Behlendorf [Tue, 29 Sep 2009 10:19:09 +0000 (03:19 -0700)]
Expand SEM() outside init_rwsem and directly call __init_rwsem().
We need to directly call __init_rwsem() or the name gets expanded
to SEM(lock-name). This is safe and correct for the support arches
x86/x86_64/ppc/ppc64.
The specific changes made to the mutex implemetation are as follows.
The Linux mutex structure is now directly embedded in the kmutex_t.
This allows a kmutex_t to be directly case to a mutex struct and
passed directly to the Linux primative.
Just like with the rwlocks it is critical that these functions be
implemented as '#defines to ensure the location information is
preserved. The preprocessor can then do a direct replacement of
the Solaris primative with the linux primative.
Just as with the rwlocks we need to track the lock owner. Here
things get a little more interesting because depending on your
kernel version, and how you've built your kernel Linux may already
do this for you. If your running a 2.6.29 or newer kernel on a
SMP system the lock owner will be tracked. This was added to Linux
to support adaptive mutexs, more on that shortly. Alternately, your
kernel might track the lock owner if you've set CONFIG_DEBUG_MUTEXES
in the kernel build. If neither of the above things is true for
your kernel the kmutex_t type will include and track the lock owner
to ensure correct behavior. This is all handled by a new autoconf
check called SPL_AC_MUTEX_OWNER.
Concerning adaptive mutexs these are a very recent development and
they did not make it in to either the latest FC11 of SLES11 kernels.
Ideally, I'd love to see this kernel change appear in one of these
distros because it does help performance. From Linux kernel commit: 0d66bf6d3514b35eb6897629059443132992dbd7
"Testing with Ingo's test-mutex application...
gave a 345% boost for VFS scalability on my testbox"
However, if you don't want to backport this change yourself you
can still simply export the task_curr() symbol. The kmutex_t
implementation will use this symbol when it's available to
provide it's own adaptive mutexs.
Finally, DEBUG_MUTEX support was removed including the proc handlers.
This was done because now that we are cleanly integrated with the
kernel profiling all this information and much much more is available
in debug kernel builds. This code was now redundant.
Brian Behlendorf [Fri, 25 Sep 2009 21:14:35 +0000 (14:14 -0700)]
Update rwlocks to track owner to ensure correct semantics
The behavior of RW_*_HELD was updated because it was not quite right.
It is not sufficient to return non-zero when the lock is help, we must
only do this when the current task in the holder.
This means we need to track the lock owner which is not something
tracked in a Linux semaphore. After some experimentation the
solution I settled on was to embed the Linux semaphore at the start
of a larger krwlock_t structure which includes the owner field.
This maintains good performance and allows us to cleanly intergrate
with the kernel lock analysis tools. My reasons:
1) By placing the Linux semaphore at the start of krwlock_t we can
then simply cast krwlock_t to a rw_semaphore and pass that on to
the linux kernel. This allows us to use '#defines so the preprocessor
can do direct replacement of the Solaris primative with the linux
equivilant. This is important because it then maintains the location
information for each rw_* call point.
2) Additionally, by adding the owner to krwlock_t we can keep this
needed extra information adjacent to the lock itself. This removes
the need for a fancy lookup to get the owner which is optimal for
performance. We can also leverage the existing spin lock in the
semaphore to ensure owner is updated correctly.
3) All helper functions which do not need to strictly be implemented
as a define to preserve location information can be done as a static
inline function.
4) Adding the owner to krwlock_t allows us to remove all memory
allocations done during lock initialization. This is good for all
the obvious reasons, we do give up the ability to specific the lock
name. The Linux profiling tools will stringify the lock name used
in the code via the preprocessor and use that.
Brian Behlendorf [Fri, 18 Sep 2009 23:09:47 +0000 (16:09 -0700)]
Reimplement rwlocks for Linux lock profiling/analysis.
It turns out that the previous rwlock implementation worked well but
did not integrate properly with the upstream kernel lock profiling/
analysis tools. This is a major problem since it would be awfully
nice to be able to use the automatic lock checker and profiler.
The problem is that the upstream lock tools use the pre-processor
to create a lock class for each uniquely named locked. Since the
rwsem was embedded in a wrapper structure the name was always the
same. The effect was that we only ended up with one lock class for
the entire SPL which caused the lock dependency checker to flag
nearly everything as a possible deadlock.
The solution was to directly map a krwlock to a Linux rwsem using
a typedef there by eliminating the wrapper structure. This was not
done initially because the rwsem implementation is specific to the arch.
To fully implement the Solaris krwlock API using only the provided rwsem
API is not possible. It can only be done by directly accessing some of
the internal data member of the rwsem structure.
For example, the Linux API provides a different function for dropping
a reader vs writer lock. Whereas the Solaris API uses the same function
and the caller does not pass in what type of lock it is. This means to
properly drop the lock we need to determine if the lock is currently a
reader or writer lock. Then we need to call the proper Linux API function.
Unfortunately, there is no provided API for this so we must extracted this
information directly from arch specific lock implementation. This is
all do able, and what I did, but it does complicate things considerably.
The good news is that in addition to the profiling benefits of this
change. We may see performance improvements due to slightly reduced
overhead when creating rwlocks and manipulating them.
The only function I was forced to sacrafice was rw_owner() because this
information is simply not stored anywhere in the rwsem. Luckily this
appears not to be a commonly used function on Solaris, and it is my
understanding it is mainly used for debugging anyway.
In addition to the core rwlock changes, extensive updates were made to
the rwlock regression tests. Each class of test was extended to provide
more API coverage and to be more rigerous in checking for misbehavior.
This is a pretty significant change and with that in mind I have been
careful to validate it on several platforms before committing. The full
SPLAT regression test suite was run numberous times on all of the following
platforms. This includes various kernels ranging from 2.6.16 to 2.6.29.
Brian Behlendorf [Fri, 14 Aug 2009 21:09:16 +0000 (14:09 -0700)]
Various spec file tweaks to handle rpm building of several distros.
Supported and tested distros now include SLES10, SLES11, Chaos 4.x,
RHEL5, and Fedora 11. This update was mainly to address rebuildable
kernel module rpms, and correct rpm dependencies for each distro.
Brian Behlendorf [Thu, 13 Aug 2009 22:02:34 +0000 (15:02 -0700)]
Explicit check for requires_* rpm defines
Due to different distros and/or versions of rpm mishandling the shorthand
syntax simply use the longer version which get interpreted correctly.
Brian Behlendorf [Thu, 30 Jul 2009 20:52:11 +0000 (13:52 -0700)]
Disable stack overflow checking by default.
The run time stack overflow checking is being disabled by default
because it is not safe for use with 2.6.29 and latter kernels. These
kernels do now have their own stack overflow checking so this support
has become redundant anyway. It can be re-enabled for older kernels or
arches without stack overflow checking by redefining CHECK_STACK().