Andreas Rohner [Wed, 10 Dec 2014 23:54:29 +0000 (15:54 -0800)]
nilfs2: avoid duplicate segment construction for fsync()
This patch removes filemap_write_and_wait_range() from nilfs_sync_file(),
because it triggers a data segment construction by calling
nilfs_writepages() with WB_SYNC_ALL. A data segment construction does not
remove the inode from the i_dirty list and it does not clear the
NILFS_I_DIRTY flag. Therefore nilfs_inode_dirty() still returns true,
which leads to an unnecessary duplicate segment construction in
nilfs_sync_file().
A call to filemap_write_and_wait_range() is not needed, because NILFS2
does not rely on the generic writeback mechanisms. Instead it implements
its own mechanism to collect all dirty pages and write them into segments.
It is more efficient to initiate the segment construction directly in
nilfs_sync_file() without the detour over filemap_write_and_wait_range().
Additionally the lock of i_mutex is not needed, because all code blocks
that are protected by i_mutex are also protected by a NILFS transaction:
For nilfs_lookup() i_mutex is held for the parent directory, to protect it
from modification. The segment construction does not modify directory
inodes, so no lock is needed.
nilfs_fiemap() reads the block layout on the disk, by using
nilfs_bmap_lookup_contig(). This is already protected by bmap->b_sem.
Signed-off-by: Andreas Rohner <andreas.rohner@gmx.net> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Xunlei Pang [Wed, 10 Dec 2014 23:54:26 +0000 (15:54 -0800)]
rtc: refine rtc_timer_do_work() to consider other set alarm failures
rtc_timer_do_work() only judges -ETIME failure of__rtc_set_alarm(), but
doesn't handle other failures like -EIO, -EBUSY, etc.
If there is a failure other than -ETIME, the next rtc_timer will stay in
the timerqueue. Then later rtc_timers will be enqueued directly because
they have a later expires time, so the alarm irq will never be programmed.
When such failures happen, this patch will retry __rtc_set_alarm(), if
still can't program the alarm time, it will remove current rtc_timer from
timerqueue and fetch next one, thus preventing it from affecting other rtc
timers.
Signed-off-by: Xunlei Pang <pang.xunlei@linaro.org> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: John Stultz <john.stultz@linaro.org> Cc: Arnd Bergmann <arnd.bergmann@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Xunlei Pang [Wed, 10 Dec 2014 23:54:23 +0000 (15:54 -0800)]
rtc/ab8500: set uie_unsupported flag
Currently, ab8500 doesn't set uie_unsupported of rtc_device, while it
doesn't support UIE, see ab8500_rtc_set_alarm().
Thus, when going through rtc_update_irq_enable()->rtc_timer_enqueue(),
there's a chance it has an alarm timer1 queued before which is going to
fired, so this update timer2 will be queued because it isn't the leftmost
one, which means rtc_timer_enqueue() will return 0.
This will result in two problems:
1) UIE EMUL will not be used.
2) When the alarm timer1 is fired, in rtc_timer_do_work() timer2 will
fail to set the alarm time, so this rtc will disfunctional due to
timer2 with the earliest expires in the timerqueue.
So, rtc drivers must set this flag if they don't support UIE.
Signed-off-by: Xunlei Pang <pang.xunlei@linaro.org> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: John Stultz <john.stultz@linaro.org> Cc: Arnd Bergmann <arnd.bergmann@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sanchayan Maity [Wed, 10 Dec 2014 23:54:20 +0000 (15:54 -0800)]
drivers/rtc/rtc-snvs: fix suspend/resume
The alarm interrupt handler also reads registers which are part of SNVS
and need clocks enabled. However, the resume function is called after
IRQ's have been enabled, hence this leads to a abort:
Unhandled fault: external abort on non-linefetch (0x1008) at 0x908c604c
Internal error: : 1008 [#1] ARM
Modules linked in:
CPU: 0 PID: 421 Comm: sh Not tainted 3.18.0-rc5-00135-g0689c67-dirty #1592
task: 8e03e800 ti: 8cad8000 task.ti: 8cad8000
PC is at snvs_rtc_irq_handler+0x14/0x74
LR is at handle_irq_event_percpu+0x3c/0x144
Fix this by using the .{suspend/resume}_noirq callbacks instead of
.{suspend/resume} .
Sanchayan Maity [Wed, 10 Dec 2014 23:54:17 +0000 (15:54 -0800)]
drivers/rtc/rtc-snvs: add clock support
Add clock enable and disable support for the SNVS peripheral, which is
required for using the RTC within the SNVS block.
The clock is not strictly enforced, as this would break the i.MX devices.
The clocking for the i.MX devices seems to be enabled elsewhere and
enabling RTC SNVS for Vybrid results in a crash. This patch adds the
clock support but also makes it optional so Vybrid platform can use the
clock if defined while making sure not to break i.MX.
Arnaud Ebalard [Wed, 10 Dec 2014 23:54:11 +0000 (15:54 -0800)]
drivers/rtc/rtc-isl12057.c: report error code upon failure in dev_err() calls
As pointed out by Mark, it is generally useful to log the error code when
reporting a failure. This patch improves existing calls to dev_err() in
ISL12057 driver to also report error code.
Signed-off-by: Arnaud Ebalard <arno@natisbad.org> Suggested-by: Mark Brown <broonie@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: Peter Huewe <peter.huewe@infineon.com> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Thierry Reding <treding@nvidia.com> Cc: Grant Likely <grant.likely@linaro.org> Acked-by: Uwe Kleine-König <uwe@kleine-koenig.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Arnaud Ebalard [Wed, 10 Dec 2014 23:54:08 +0000 (15:54 -0800)]
drivers/rtc/rtc-isl12057.c: add proper handling of oscillator failure bit
As suggested by Uwe, instead of clearing oscillator failure bit
unconditionally at driver load, this patch adds proper handling of the
flag. The driver now returns -ENODATA when reading time from the device
and oscillator failure bit is set. The flag is now cleared only when the
a new time value is pushed to the device.
Signed-off-by: Arnaud Ebalard <arno@natisbad.org> Reported-by: Uwe Kleine-König <uwe@kleine-koenig.org> Acked-by: Uwe Kleine-König <uwe@kleine-koenig.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: Peter Huewe <peter.huewe@infineon.com> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Thierry Reding <treding@nvidia.com> Cc: Mark Brown <broonie@kernel.org> Cc: Grant Likely <grant.likely@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Arnaud Ebalard [Wed, 10 Dec 2014 23:54:05 +0000 (15:54 -0800)]
drivers/rtc/rtc-isl12057.c: add support for century bit
The month register of ISL12057 RTC chip includes a century bit which
reports overflow of year register from 99 to 0. This bit can also be
written, which allows using it to extend the time interval the chip can
support from 99 to 199 years.
This patch adds support for century overflow bit in tm to regs and regs to
tm helpers in ISL12057 driver.
This was tested by putting a device 100 years in the future (using a
specific kernel due to the inability of userland tools such as date or
hwclock to pass year 2038), rebooting on a kernel w/ this patch applied
and verifying the device was still 100 years in the future.
Signed-off-by: Arnaud Ebalard <arno@natisbad.org> Suggested-by: Uwe Kleine-König <uwe@kleine-koenig.org> Acked-by: Uwe Kleine-König <uwe@kleine-koenig.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: Peter Huewe <peter.huewe@infineon.com> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Thierry Reding <treding@nvidia.com> Cc: Mark Brown <broonie@kernel.org> Cc: Grant Likely <grant.likely@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Arnaud Ebalard [Wed, 10 Dec 2014 23:54:02 +0000 (15:54 -0800)]
drivers/rtc/rtc-isl12057.c: fix masking of register values
When Intersil ISL12057 support was added by commit 70e123373c05 ("rtc: Add
support for Intersil ISL12057 I2C RTC chip"), two masks for time registers
values imported from the device were either wrong or omitted, leading to
additional bits from those registers to impact read values:
- mask for hour register value when reading it in AM/PM mode. As
AM/PM mode is not the usual mode used by the driver, this error
would only have an impact on an externally configured RTC hour
later read by the driver.
- mask for month value. The lack of masking would provide an
erroneous value if century bit is set.
This patch fixes those two masks.
Fixes: 70e123373c05 ("rtc: Add support for Intersil ISL12057 I2C RTC chip") Signed-off-by: Arnaud Ebalard <arno@natisbad.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: Peter Huewe <peter.huewe@infineon.com> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Thierry Reding <treding@nvidia.com> Cc: Mark Brown <broonie@kernel.org> Cc: Grant Likely <grant.likely@linaro.org> Acked-by: Uwe Kleine-König <uwe@kleine-koenig.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Søren Andersen [Wed, 10 Dec 2014 23:53:54 +0000 (15:53 -0800)]
drivers/rtc/rtc-ds1374.c: add watchdog support
Add support for the watchdog functionality of the DS1374 rtc. Based on
the m41t80 watchdog functionality Note: watchdog uses the same registers
as alarm.
Barry Song [Wed, 10 Dec 2014 23:53:51 +0000 (15:53 -0800)]
drivers/rtc/rtc-sirfsoc.c: replace local_irq_disable by spin_lock_irq for SMP safety
Signed-off-by: Barry Song <Baohua.Song@csr.com> Cc: hao liu <hao.liu@csr.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kardell [Wed, 10 Dec 2014 23:53:46 +0000 (15:53 -0800)]
rtc: pcf8563: clear expired alarm at boot time
In case the card is woken up of the rtc alarm, the
devm_rtc_device_register function detects it as a pending alarm about a
month in the future. Fix this by clearing the alarm in module probe.
Signed-off-by: Jan Kardell <jan.kardell@telliq.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: Grant Likely <grant.likely@linaro.org> Cc: Rob Herring <robh+dt@kernel.org> Cc: Vincent Donnefort <vdonnefort@gmail.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kardell [Wed, 10 Dec 2014 23:53:40 +0000 (15:53 -0800)]
rtc: pcf8563: handle consequeces of lacking second alarm reg
To guarantee that a set alarm occurs in the future, the set alarm time
is rounded up to the nearest minute. Also we cannot handle UIE as it
requires second precision.
Signed-off-by: Jan Kardell <jan.kardell@telliq.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: Grant Likely <grant.likely@linaro.org> Cc: Rob Herring <robh+dt@kernel.org> Cc: Vincent Donnefort <vdonnefort@gmail.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johan Hovold [Wed, 10 Dec 2014 23:53:29 +0000 (15:53 -0800)]
ARM: dts: am335x-boneblack: enable power off and rtc wake up
Configure the RTC as system-power controller, which allows the system to
be powered off as well as woken up again on subsequent RTC alarms.
Note that the PMIC needs to be put in SLEEP (rather than OFF) mode to
maintain RTC power. Specifically, this means that the PMIC
ti,pmic-shutdown-controller property must be left unset in order to be
able to wake up on RTC alarms.
Tested on BeagleBone Black (rev A5).
Signed-off-by: Johan Hovold <johan@kernel.org> Reviewed-by: Felipe Balbi <balbi@ti.com> Tested-by: Felipe Balbi <balbi@ti.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: Tony Lindgren <tony@atomide.com> Cc: Benot Cousson <bcousson@baylibre.com> Cc: Lokesh Vutla <lokeshvutla@ti.com> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Sekhar Nori <nsekhar@ti.com> Cc: Tero Kristo <t-kristo@ti.com> Cc: Keerthy J <j-keerthy@ti.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johan Hovold [Wed, 10 Dec 2014 23:53:13 +0000 (15:53 -0800)]
rtc: omap: add support for pmic_power_en
Add new property "ti,system-power-controller" to register the RTC as a
power-off handler.
Some RTC IP revisions can control an external PMIC via the pmic_power_en
pin, which can be configured to transition to OFF on ALARM2 events and
back to ON on subsequent ALARM (wakealarm) events.
This is based on earlier work by Colin Foe-Parker and AnilKumar Ch. [1]
Johan Hovold [Wed, 10 Dec 2014 23:53:04 +0000 (15:53 -0800)]
rtc: omap: silence bogus power-up reset message at probe
Some legacy RTC IP revisions has a power-up reset flag in the status
register that later revisions lack.
As this flag is always read back as set on later revisions (or is
overloaded with a different flag), make sure to only clear the flag and
print the info message on legacy platforms.
Signed-off-by: Johan Hovold <johan@kernel.org> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: Tony Lindgren <tony@atomide.com> Cc: Benot Cousson <bcousson@baylibre.com> Cc: Lokesh Vutla <lokeshvutla@ti.com> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Sekhar Nori <nsekhar@ti.com> Cc: Tero Kristo <t-kristo@ti.com> Cc: Keerthy J <j-keerthy@ti.com> Tested-by: Felipe Balbi <balbi@ti.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johan Hovold [Wed, 10 Dec 2014 23:52:58 +0000 (15:52 -0800)]
rtc: omap: remove DRIVER_NAME macro
Remove DRIVER_NAME macro which was used for unrelated strings (e.g.
id-table entry and module name), but not for related ones (e.g. module
name and alias).
Also move the module alias to the other module-info entries.
Signed-off-by: Johan Hovold <johan@kernel.org> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: Tony Lindgren <tony@atomide.com> Cc: Benot Cousson <bcousson@baylibre.com> Cc: Lokesh Vutla <lokeshvutla@ti.com> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Sekhar Nori <nsekhar@ti.com> Cc: Tero Kristo <t-kristo@ti.com> Tested-by: Felipe Balbi <balbi@ti.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johan Hovold [Wed, 10 Dec 2014 23:52:43 +0000 (15:52 -0800)]
rtc: omap: fix class-device registration
Make sure not to register the class device until after the device has
been configured.
Currently, the device is not fully configured (e.g. 24-hour mode) when
the class device is registered, something which involves driver
callbacks for example to read the current time.
Signed-off-by: Johan Hovold <johan@kernel.org> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: Tony Lindgren <tony@atomide.com> Cc: Benot Cousson <bcousson@baylibre.com> Cc: Lokesh Vutla <lokeshvutla@ti.com> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Sekhar Nori <nsekhar@ti.com> Cc: Tero Kristo <t-kristo@ti.com> Cc: Keerthy J <j-keerthy@ti.com> Tested-by: Felipe Balbi <balbi@ti.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johan Hovold [Wed, 10 Dec 2014 23:52:30 +0000 (15:52 -0800)]
rtc: omap: fix clock-source configuration
This series fixes a few issues with the omap rtc-driver, cleans up a
bit, adds device abstraction, and finally adds support for the PMIC
control feature found in some revisions of this RTC IP block.
Ultimately, this allows for powering off the Beaglebone and waking it up
again on RTC alarms.
This patch (of 20):
Make sure not to reset the clock-source configuration when enabling the
32kHz clock mux.
Until the clock source can be configured through device tree we must not
overwrite settings made by the bootloader (e.g. clock-source
selection).
Fixes: cd914bba03d8 ("drivers/rtc/rtc-omap.c: add support for enabling 32khz clock") Signed-off-by: Johan Hovold <johan@kernel.org> Reviewed-by: Felipe Balbi <balbi@ti.com> Tested-by: Felipe Balbi <balbi@ti.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: Tony Lindgren <tony@atomide.com> Cc: Benot Cousson <bcousson@baylibre.com> Cc: Lokesh Vutla <lokeshvutla@ti.com> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Sekhar Nori <nsekhar@ti.com> Cc: Tero Kristo <t-kristo@ti.com> Cc: Keerthy J <j-keerthy@ti.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hyogi Gim [Wed, 10 Dec 2014 23:52:27 +0000 (15:52 -0800)]
drivers/rtc/interface.c: check the validation of rtc_time in __rtc_read_time
Some rtc devices always return '0' when rtc_class_ops.read_time is
called. So if rtc_time isn't verified in callback, rtc interface cannot
know whether rtc_time is valid.
Check rtc_time by using 'rtc_valid_tm' in '__rtc_read_time'. And add
the message for debugging.
Signed-off-by: Hyogi Gim <hyogi.gim@lge.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: John Stultz <john.stultz@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Guo Zeng [Wed, 10 Dec 2014 23:52:24 +0000 (15:52 -0800)]
drivers/rtc/rtc-sirfsoc.c: move hardware initilization earlier in probe
Move rtc register to be later than hardware initialization. The reason
is that devm_rtc_device_register() will do read_time() which is a
callback accessing hardware. This sometimes causes a hang in the
hardware related callback.
Signed-off-by: Guo Zeng <guo.zeng@csr.com> Signed-off-by: Barry Song <Baohua.Song@csr.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Wed, 10 Dec 2014 23:52:22 +0000 (15:52 -0800)]
ncpfs: return proper error from NCP_IOC_SETROOT ioctl
If some error happens in NCP_IOC_SETROOT ioctl, the appropriate error
return value is then (in most cases) just overwritten before we return.
This can result in reporting success to userspace although error happened.
This bug was introduced by commit 2e54eb96e2c8 ("BKL: Remove BKL from
ncpfs"). Propagate the errors correctly.
Fixes: 2e54eb96e2c80 ("BKL: Remove BKL from ncpfs") Signed-off-by: Jan Kara <jack@suse.cz> Cc: Petr Vandrovec <petr@vandrovec.name> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andy Lutomirski [Wed, 10 Dec 2014 23:52:19 +0000 (15:52 -0800)]
init: allow CONFIG_INIT_FALLBACK=n to disable defaults if init= fails
If a user puts init=/whatever on the command line and /whatever can't be
run, then the kernel will try a few default options before giving up. If
init=/whatever came from a bootloader prompt, then this is unexpected but
probably harmless. On the other hand, if it comes from a script (e.g. a
tool like virtme or perhaps a future kselftest script), then the fallbacks
are likely to exist, but they'll do the wrong thing. For example, they
might unexpectedly invoke systemd.
This adds a config option CONFIG_INIT_FALLBACK. If unset, then a failure
to run the specified init= process be fatal.
The tentative plan is to remove CONFIG_INIT_FALLBACK for 3.20.
[akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Andy Lutomirski <luto@amacapital.net> Cc: Rob Landley <rob@landley.net> Cc: Chuck Ebbert <cebbert.lkml@gmail.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shuah Khan <shuah.kh@samsung.com> Cc: Frank Rowand <frowand.list@gmail.com> Cc: Josh Triplett <josh@joshtriplett.org> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jungseung Lee [Wed, 10 Dec 2014 23:52:16 +0000 (15:52 -0800)]
fs/binfmt_elf.c: fix internal inconsistency relating to vma dump size
vma_dump_size() has been used several times on actual dumper and it is
supposed to return the same value for the same vma. But vma_dump_size()
could return different values for same vma.
The known problem case is concurrent shared memory removal. If a vma is
used for a shared memory and that shared memory is removed between
writing program header and dumping vma memory, this will result in a
dump file which is internally consistent.
To fix the problem, we set baseline to get dump size and store the size
into vma_filesz and always use the same vma dump size which is stored in
vma_filsz. The consistnecy with reality is not actually guranteed, but
it's tolerable since that is fully consistent with base line.
Signed-off-by: Jungseung Lee <js07.lee@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Frysinger [Wed, 10 Dec 2014 23:52:10 +0000 (15:52 -0800)]
binfmt_misc: clean up code style a bit
Clean up various coding style issues that checkpatch complains about.
No functional changes here.
Signed-off-by: Mike Frysinger <vapier@gentoo.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Frysinger [Wed, 10 Dec 2014 23:52:08 +0000 (15:52 -0800)]
binfmt_misc: add comments & debug logs
When trying to develop a custom format handler, the errors returned all
effectively get bucketed as EINVAL with no kernel messages. The other
errors (ENOMEM/EFAULT) are internal/obvious and basic. Thus any time a
bad handler is rejected, the developer has to walk the dense code and
try to guess where it went wrong. Needing to dive into kernel code is
itself a fairly high barrier for a lot of people.
To improve this situation, let's deploy extensive pr_debug markers at
logical parse points, and add comments to the dense parsing logic. It
let's you see exactly where the parsing aborts, the string the kernel
received (useful when dealing with shell code), how it translated the
buffers to binary data, and how it will apply the mask at runtime.
The [raw] lines show us exactly what was received from userspace. The
lines after that show us how the kernel has decoded things.
Signed-off-by: Mike Frysinger <vapier@gentoo.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Wed, 10 Dec 2014 23:52:05 +0000 (15:52 -0800)]
checkpatch: add ability to --fix (coalesce) string fragments on multiple lines
Add --fix option to coalesce string fragments.
This does not coalesce string fragments that have newline terminations or
are otherwise exempted.
Other miscellanea:
o move all the string tests together.
o fix get_quoted_string function for tab characters
o fix concatination typo
Signed-off-by: Joe Perches <joe@perches.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Wed, 10 Dec 2014 23:52:02 +0000 (15:52 -0800)]
checkpatch: add --strict "pointer comparison to NULL" test
It seems there are more and more uses of "if (!ptr)" in preference to "if
(ptr == NULL)" so add a --strict test to emit a message when using the
latter form.
This also finds (ptr != NULL).
Fix it too if desired.
Signed-off-by: Joe Perches <joe@perches.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Greg KH <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Wed, 10 Dec 2014 23:51:57 +0000 (15:51 -0800)]
checkpatch: add --strict preference for #defines using BIT(foo)
Using BIT(foo) and BIT_ULL(bar) is more common now. Suggest using these
macros over #defines with 1<<value.
Add a --fix option too.
Signed-off-by: Joe Perches <joe@perches.com> Cc: David Miller <davem@davemloft.net> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Julius Werner [Wed, 10 Dec 2014 23:51:54 +0000 (15:51 -0800)]
checkpatch: allow certain SI units with three characters
Checkpatch flags CamelCase identifiers in strict mode, but it has a
feature to ignore parts with only two characters to allow for SI units
like mV or uA. Unfortunately, not all SI units fit in two characters, and
not all are lower case followed by upper case.
This patch adds hardcoded detection for frequency and 1024-based size
units (Hz/KHz/MHz/GHz/THz and KiB/MiB/GiB/TiB), since allowing any three
character combinations might be too lenient. The list can later be
expanded as needed.
Signed-off-by: Julius Werner <jwerner@chromium.org> Acked-by: Joe Perches <joe@perches.com> Cc: Andy Whitcroft <apw@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Wed, 10 Dec 2014 23:51:49 +0000 (15:51 -0800)]
checkpatch: reduce MAINTAINERS update message frequency
When files are being added/moved/deleted and a patch contains an update to
the MAINTAINERS file, assume it's to update the MAINTAINERS file correctly
and do not emit the "does MAINTAINERS need updating?" message.
Reported by many people.
Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Wed, 10 Dec 2014 23:51:37 +0000 (15:51 -0800)]
checkpatch: improve test for no space after cast
sizeof(foo) is not a cast, allow a space after it.
Signed-off-by: Joe Perches <joe@perches.com> Tested-by: Kalle Valo <kvalo@qca.qualcomm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rasmus Villemoes [Wed, 10 Dec 2014 23:51:27 +0000 (15:51 -0800)]
lib/lcm.c: ensure correct result whenever it fits
Ensure that lcm(a,b) returns the mathematically correct result, provided
it fits in an unsigned long. The current version returns garbage if a*b
overflows, even if the final result would fit.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alex Elder [Wed, 10 Dec 2014 23:51:21 +0000 (15:51 -0800)]
printk: drop logbuf_cpu volatile qualifier
Pranith Kumar posted a patch in which removed the "volatile"
qualifier for the "logbuf_cpu" variable in vprintk_emit().
https://lkml.org/lkml/2014/11/13/894
In his patch, he used ACCESS_ONCE() for all references to
that symbol to provide whatever protection was intended.
There was some discussion that followed, and in the end Steven Rostedt
concluded that not only was "volatile" not needed, neither was it
required to use ACCESS_ONCE(). I offered an elaborate description that
concluded Steven was right, and Pranith asked me to submit an
alternative patch. And this is it.
The basic reason "volatile" is not needed is that "logbuf_cpu" has
static storage duration, and vprintk_emit() is an exported
interface. This means that the value of logbuf_cpu must be read
from memory the first time it is used in a particular call of
vprintk_emit(). The variable's value is read only once in that
function, when it's read it'll be the copy from memory (or cache).
In addition, the value of "logbuf_cpu" is only ever written under
protection of a spinlock. So the value that is read is the "real"
value (and not an out-of-date cached one). If its value is not
UINT_MAX, it is the current CPU's processor id, and it will have
been last written by the running CPU.
Signed-off-by: Alex Elder <elder@linaro.org> Reported-by: Pranith Kumar <bobby.prani@gmail.com> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Petr Mladek <pmladek@suse.cz> Cc: Luis R. Rodriguez <mcgrof@suse.com> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Prarit Bhargava [Wed, 10 Dec 2014 23:45:50 +0000 (15:45 -0800)]
kernel: add panic_on_warn
There have been several times where I have had to rebuild a kernel to
cause a panic when hitting a WARN() in the code in order to get a crash
dump from a system. Sometimes this is easy to do, other times (such as
in the case of a remote admin) it is not trivial to send new images to
the user.
A much easier method would be a switch to change the WARN() over to a
panic. This makes debugging easier in that I can now test the actual
image the WARN() was seen on and I do not have to engage in remote
debugging.
This patch adds a panic_on_warn kernel parameter and
/proc/sys/kernel/panic_on_warn calls panic() in the
warn_slowpath_common() path. The function will still print out the
location of the warning.
An example of the panic_on_warn output:
The first line below is from the WARN_ON() to output the WARN_ON()'s
location. After that the panic() output is displayed.
WARNING: CPU: 30 PID: 11698 at /home/prarit/dummy_module/dummy-module.c:25 init_dummy+0x1f/0x30 [dummy_module]()
Kernel panic - not syncing: panic_on_warn set ...
hpa said: There is another very valid use for this: many operators would
rather a machine shuts down than being potentially compromised either
functionally or security-wise.
Signed-off-by: Prarit Bhargava <prarit@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Acked-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Fabian Frederick <fabf@skynet.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Macro get_unused_fd() is used to allocate a file descriptor with default
flags. Those default flags (0) don't enable close-on-exec.
This can be seen as an unsafe default: in most case close-on-exec should
be enabled to not leak file descriptor across exec().
It would be better to have a "safer" default set of flags, eg. O_CLOEXEC
must be used to enable close-on-exec.
Instead this patch removes get_unused_fd() so that out of tree modules
won't be affect by a runtime behavor change which might introduce other
kind of bugs: it's better to catch the change at build time, making it
easier to fix.
Removing the macro will also promote use of get_unused_fd_flags() (or
anon_inode_getfd()) with flags provided by userspace. Or, if flags cannot
be given by userspace, with flags set to O_CLOEXEC by default.
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yann Droneaud [Wed, 10 Dec 2014 23:45:44 +0000 (15:45 -0800)]
fs/file.c: replace get_unused_fd() with get_unused_fd_flags(0)
This patch replaces calls to get_unused_fd() with equivalent call to
get_unused_fd_flags(0) to preserve current behavor for existing code.
In a further patch, get_unused_fd() will be removed so that new code
start using get_unused_fd_flags(), with the hope O_CLOEXEC could be
used, either by default or choosen by userspace.
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yann Droneaud [Wed, 10 Dec 2014 23:45:42 +0000 (15:45 -0800)]
binfmt_misc: replace get_unused_fd() with get_unused_fd_flags(0)
This patch replaces calls to get_unused_fd() with equivalent call to
get_unused_fd_flags(0) to preserve current behavor for existing code.
In a further patch, get_unused_fd() will be removed so that new code start
using get_unused_fd_flags(), with the hope O_CLOEXEC could be used, either
by default or choosen by userspace.
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yann Droneaud [Wed, 10 Dec 2014 23:45:39 +0000 (15:45 -0800)]
ppc/cell: replace get_unused_fd() with get_unused_fd_flags(0)
This patch replaces calls to get_unused_fd() with equivalent call to
get_unused_fd_flags(0) to preserve current behavor for existing code.
In a further patch, get_unused_fd() will be removed so that new code start
using get_unused_fd_flags(), with the hope O_CLOEXEC could be used, either
by default or choosen by userspace.
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yann Droneaud [Wed, 10 Dec 2014 23:45:36 +0000 (15:45 -0800)]
ia64: replace get_unused_fd() with get_unused_fd_flags(0)
This patch replaces calls to get_unused_fd() with equivalent call to
get_unused_fd_flags(0) to preserve current behavor for existing code.
In a further patch, get_unused_fd() will be removed so that new code start
using get_unused_fd_flags(), with the hope O_CLOEXEC could be used, either
by default or choosen by userspace.
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg Nesterov [Wed, 10 Dec 2014 23:45:33 +0000 (15:45 -0800)]
exit: ptrace: shift "reap dead" code from exit_ptrace() to forget_original_parent()
Now that forget_original_parent() uses ->ptrace_entry for EXIT_DEAD tasks,
we can simply pass "dead_children" list to exit_ptrace() and remove
another release_task() loop. Plus this way we do not need to drop and
reacquire tasklist_lock.
Also shift the list_empty(ptraced) check, if we want this optimization it
makes sense to eliminate the function call altogether.
Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com>, Cc: Sterling Alexander <stalexan@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roland McGrath <roland@hack.frob.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg Nesterov [Wed, 10 Dec 2014 23:45:30 +0000 (15:45 -0800)]
exit: reparent: cleanup the usage of reparent_leader()
1. Now that reparent_leader() doesn't abuse ->sibling we can shift
list_move_tail() from reparent_leader() to forget_original_parent()
and turn it into a single list_splice_tail_init(). This also makes
BUG_ON(!list_empty()) and list_for_each_entry_safe() unnecessary.
2. This also allows to shift the same_thread_group() check, it looks
a bit more clear in the caller.
Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com>, Cc: Sterling Alexander <stalexan@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roland McGrath <roland@hack.frob.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg Nesterov [Wed, 10 Dec 2014 23:45:27 +0000 (15:45 -0800)]
exit: reparent: cleanup the changing of ->parent
1. Cosmetic, but "if (t->parent == father)" looks a bit confusing.
We need to change t->parent if and only if t is not traced.
2. If we actually want this BUG_ON() to ensure that parent/ptrace
match each other, then we should also take ptrace_reparented()
case into account too.
3. Change this code to use for_each_thread() instead of deprecated
while_each_thread().
[dan.carpenter@oracle.com: silence a bogus static checker warning] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com>, Cc: Sterling Alexander <stalexan@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roland McGrath <roland@hack.frob.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg Nesterov [Wed, 10 Dec 2014 23:45:24 +0000 (15:45 -0800)]
exit: reparent: use ->ptrace_entry rather than ->sibling for EXIT_DEAD tasks
reparent_leader() reuses ->sibling as a list node to add an EXIT_DEAD task
into dead_children list we are going to release. This obviously removes
the dead task from its real_parent->children list and this is even good;
the parent can do nothing with the EXIT_DEAD reparented zombie, it only
makes do_wait() slower.
But, this also means that it can not be reparented once again, so if its
new parent dies too nobody will update ->parent/real_parent, they can
point to the freed memory even before release_task() we are going to call,
this breaks the code which relies on pid_alive() to access
->real_parent/parent.
Fortunately this is mostly theoretical, this can only happen if init or
PR_SET_CHILD_SUBREAPER process ignores SIGCHLD and the new parent
sub-thread exits right after we drop tasklist_lock.
Change this code to use ->ptrace_entry instead, we know that the child is
not traced so nobody can ever use this member. This also allows to unify
this logic with exit_ptrace(), see the next changes.
Note: we really need to change release_task() to nullify real_parent/
parent/group_leader pointers, but we need to change the current users
first somehow. And it would be better to reap this zombie immediately but
release_task_locked() we need is complicated by proc_flush_task().
Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com>, Cc: Sterling Alexander <stalexan@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roland McGrath <roland@hack.frob.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg Nesterov [Wed, 10 Dec 2014 23:45:21 +0000 (15:45 -0800)]
sched_show_task: fix unsafe usage of ->real_parent
rcu_read_lock() can not protect p->real_parent if release_task(p) was
already called, change sched_show_task() to check pis_alive() like other
users do.
Note: we need some helpers to cleanup the code like this. And it seems
that that the usage of cpu_curr(cpu) in dump_cpu_task() is not safe too.
Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com>, Cc: Sterling Alexander <stalexan@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Roland McGrath <roland@hack.frob.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg Nesterov [Wed, 10 Dec 2014 23:45:18 +0000 (15:45 -0800)]
proc: task_state: ptrace_parent() doesn't need pid_alive() check
p->ptrace != 0 means that release_task(p) was not called, so pid_alive()
buys nothing and we can remove this check. Other callers already use it
directly without additional checks.
Note: with or without this patch ptrace_parent() can return the pointer to
the freed task, this will be explained/fixed later.
Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com>, Cc: Sterling Alexander <stalexan@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roland McGrath <roland@hack.frob.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg Nesterov [Wed, 10 Dec 2014 23:45:15 +0000 (15:45 -0800)]
proc: task_state: move the main seq_printf() outside of rcu_read_lock()
task_state() does seq_printf() under rcu_read_lock(), but this is only
needed for task_tgid_nr_ns() and task_numa_group_id(). We can calculate
tgid/ngid and drop rcu lock.
Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com>, Cc: Sterling Alexander <stalexan@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roland McGrath <roland@hack.frob.com> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg Nesterov [Wed, 10 Dec 2014 23:45:10 +0000 (15:45 -0800)]
proc: task_state: read cred->group_info outside of task_lock()
task_state() reads cred->group_info under task_lock() because a long ago
it was task_struct->group_info and it was actually protected by
task->alloc_lock. Today this task_unlock() after rcu_read_unlock() just
adds the confusion, move task_unlock() up.
Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com>, Cc: Sterling Alexander <stalexan@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roland McGrath <roland@hack.frob.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nicolas Dichtel [Wed, 10 Dec 2014 23:45:01 +0000 (15:45 -0800)]
fs/proc: use a rb tree for the directory entries
When a lot of netdevices are created, one of the bottleneck is the
creation of proc entries. This serie aims to accelerate this part.
The current implementation for the directories in /proc is using a single
linked list. This is slow when handling directories with large numbers of
entries (eg netdevice-related entries when lots of tunnels are opened).
This patch replaces this linked list by a red-black tree.
Here are some numbers:
dummy30000.batch contains 30 000 times 'link add type dummy'.
Before the patch:
$ time ip -b dummy30000.batch
real 2m31.950s
user 0m0.440s
sys 2m21.440s
$ time rmmod dummy
real 1m35.764s
user 0m0.000s
sys 1m24.088s
After the patch:
$ time ip -b dummy30000.batch
real 2m0.874s
user 0m0.448s
sys 1m49.720s
$ time rmmod dummy
real 1m13.988s
user 0m0.000s
sys 1m1.008s
The idea of improving this part was suggested by Thierry Herbelot.
[akpm@linux-foundation.org: initialise proc_root.subdir at compile time] Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: David S. Miller <davem@davemloft.net> Cc: Thierry Herbelot <thierry.herbelot@6wind.com>. Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Wed, 10 Dec 2014 23:44:58 +0000 (15:44 -0800)]
mm: move page->mem_cgroup bad page handling into generic code
Now that the external page_cgroup data structure and its lookup is
gone, let the generic bad_page() check for page->mem_cgroup sanity.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: David S. Miller <davem@davemloft.net> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Tejun Heo <tj@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Wed, 10 Dec 2014 23:44:55 +0000 (15:44 -0800)]
mm: page_cgroup: rename file to mm/swap_cgroup.c
Now that the external page_cgroup data structure and its lookup is gone,
the only code remaining in there is swap slot accounting.
Rename it and move the conditional compilation into mm/Makefile.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Tejun Heo <tj@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Wed, 10 Dec 2014 23:44:52 +0000 (15:44 -0800)]
mm: embed the memcg pointer directly into struct page
Memory cgroups used to have 5 per-page pointers. To allow users to
disable that amount of overhead during runtime, those pointers were
allocated in a separate array, with a translation layer between them and
struct page.
There is now only one page pointer remaining: the memcg pointer, that
indicates which cgroup the page is associated with when charged. The
complexity of runtime allocation and the runtime translation overhead is
no longer justified to save that *potential* 0.19% of memory. With
CONFIG_SLUB, page->mem_cgroup actually sits in the doubleword padding
after the page->private member and doesn't even increase struct page,
and then this patch actually saves space. Remaining users that care can
still compile their kernels without CONFIG_MEMCG.
[mhocko@suse.cz: update Documentation/cgroups/memory.txt] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Tejun Heo <tj@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Wed, 10 Dec 2014 23:44:39 +0000 (15:44 -0800)]
mm, memcg: fix potential undefined behaviour in page stat accounting
Since commit d7365e783edb ("mm: memcontrol: fix missed end-writeback
page accounting") mem_cgroup_end_page_stat consumes locked and flags
variables directly rather than via pointers which might trigger C
undefined behavior as those variables are initialized only in the slow
path of mem_cgroup_begin_page_stat.
Although mem_cgroup_end_page_stat handles parameters correctly and
touches them only when they hold a sensible value it is caller which
loads a potentially uninitialized value which then might allow compiler
to do crazy things.
I haven't seen any warning from gcc and it seems that the current
version (4.9) doesn't exploit this type undefined behavior but Sasha has
reported the following:
UBSan: Undefined behaviour in mm/rmap.c:1084:2
load of value 255 is not a valid value for type '_Bool'
CPU: 4 PID: 8304 Comm: rngd Not tainted 3.18.0-rc2-next-20141029-sasha-00039-g77ed13d-dirty #1427
Call Trace:
dump_stack (lib/dump_stack.c:52)
ubsan_epilogue (lib/ubsan.c:159)
__ubsan_handle_load_invalid_value (lib/ubsan.c:482)
page_remove_rmap (mm/rmap.c:1084 mm/rmap.c:1096)
unmap_page_range (./arch/x86/include/asm/atomic.h:27 include/linux/mm.h:463 mm/memory.c:1146 mm/memory.c:1258 mm/memory.c:1279 mm/memory.c:1303)
unmap_single_vma (mm/memory.c:1348)
unmap_vmas (mm/memory.c:1377 (discriminator 3))
exit_mmap (mm/mmap.c:2837)
mmput (kernel/fork.c:659)
do_exit (./arch/x86/include/asm/thread_info.h:168 kernel/exit.c:462 kernel/exit.c:747)
do_group_exit (include/linux/sched.h:775 kernel/exit.c:873)
SyS_exit_group (kernel/exit.c:901)
tracesys_phase2 (arch/x86/kernel/entry_64.S:529)
Fix this by using pointer parameters for both locked and flags and be
more robust for future compiler changes even though the current code is
implemented correctly.
Signed-off-by: Michal Hocko <mhocko@suse.cz> Reported-by: Sasha Levin <sasha.levin@oracle.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As a small zero page, huge zero page should not be accounted in smaps
report as normal page.
For small pages we rely on vm_normal_page() to filter out zero page, but
vm_normal_page() is not designed to handle pmds. We only get here due
hackish cast pmd to pte in smaps_pte_range() -- pte and pmd format is not
necessary compatible on each and every architecture.
Let's add separate codepath to handle pmds. follow_trans_huge_pmd() will
detect huge zero page for us.
We would need pmd_dirty() helper to do this properly. The patch adds it
to THP-enabled architectures which don't yet have one.
[akpm@linux-foundation.org: use do_div to fix 32-bit build] Signed-off-by: "Kirill A. Shutemov" <kirill@shutemov.name> Reported-by: Fengguang Wu <fengguang.wu@intel.com> Tested-by: Fengwei Yin <yfw.kernel@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Wed, 10 Dec 2014 23:44:33 +0000 (15:44 -0800)]
mm: memcontrol: drop bogus RCU locking from mem_cgroup_same_or_subtree()
None of the mem_cgroup_same_or_subtree() callers actually require it to
take the RCU lock, either because they hold it themselves or they have css
references. Remove it.
To make the API change clear, rename the leftover helper to
mem_cgroup_is_descendant() to match cgroup_is_descendant().
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Wed, 10 Dec 2014 23:44:30 +0000 (15:44 -0800)]
mm: memcontrol: pull the NULL check from __mem_cgroup_same_or_subtree()
The NULL in mm_match_cgroup() comes from a possibly exiting mm->owner. It
makes a lot more sense to check where it's looked up, rather than check
for it in __mem_cgroup_same_or_subtree() where it's unexpected.
No other callsite passes NULL to __mem_cgroup_same_or_subtree().
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>