]> git.proxmox.com Git - mirror_ubuntu-bionic-kernel.git/log
mirror_ubuntu-bionic-kernel.git
7 years agoMerge tag 'platform-drivers-x86-v4.12-1' of git://git.infradead.org/linux-platform...
Linus Torvalds [Thu, 4 May 2017 18:56:59 +0000 (11:56 -0700)]
Merge tag 'platform-drivers-x86-v4.12-1' of git://git.infradead.org/linux-platform-drivers-x86

Pull x86 platform-drivers update from Darren Hart:
 "This represents a significantly larger and more complex set of changes
  than those of prior merge windows.

  In particular, we had several changes with dependencies on other
  subsystems which we felt were best managed through merges of immutable
  branches, including one each from input, i2c, and leds. Two patches
  for the watchdog subsystem are included after discussion with Wim and
  Guenter following a collision in linux-next (this should be resolved
  and you should only see these two appear in this pull request). These
  are called out in the "External" section below.

  Summary of changes:
   - significant further cleanup of fujitsu-laptop and hp-wmi
   - new model support for ideapad, asus, silead, and xiaomi
   - new hotkeys for thinkpad and models using intel-vbtn
   - dell keyboard backlight improvements
   - build and dependency improvements
   - intel * ipc fixes, cleanups, and api updates
   - single isolated fixes noted below

  External:
   - watchdog: iTCO_wdt: Add PMC specific noreboot update api
   - watchdog: iTCO_wdt: cleanup set/unset no_reboot_bit functions
   - Merge branch 'ib/4.10-sparse-keymap-managed'
   - Merge branch 'i2c/for-INT33FE'
   - Merge branch 'linux-leds/dell-laptop-changes-for-4.12'

  platform/x86:
   - Add Intel Cherry Trail ACPI INT33FE device driver
   - remove sparse_keymap_free() calls
   - Make SILEAD_DMI depend on TOUCHSCREEN_SILEAD

  asus-wmi:
   - try to set als by default
   - fix cpufv sysfs file permission

  acer-wmi:
   - setup accelerometer when ACPI device was found

  ideapad-laptop:
   - Add IdeaPad V310-15ISK to no_hw_rfkill
   - Add IdeaPad 310-15IKB to no_hw_rfkill

  intel_pmc_ipc:
   - use gcr mem base for S0ix counter read
   - Fix iTCO_wdt GCS memory mapping failure
   - Add pmc gcr read/write/update api's
   - fix gcr offset

  dell-laptop:
   - Add keyboard backlight timeout AC settings
   - Handle return error form dell_get_intensity.
   - Protect kbd_state against races
   - Refactor kbd_led_triggers_store()

  hp-wireless:
   - reuse module_acpi_driver
   - add Xiaomi's hardware id to the supported list

  intel-vbtn:
   - add volume up and down

  INT33FE:
   - add i2c dependency

  hp-wmi:
   - Cleanup exit paths
   - Do not shadow errors in sysfs show functions
   - Use DEVICE_ATTR_(RO|RW) helper macros
   - Refactor dock and tablet state fetchers
   - Cleanup wireless get_(hw|sw)state functions
   - Refactor redundant HPWMI_READ functions
   - Standardize enum usage for constants
   - Cleanup local variable declarations
   - Do not shadow error values
   - Fix detection for dock and tablet mode
   - Fix error value for hp_wmi_tablet_state

  fujitsu-laptop:
   - simplify error handling in acpi_fujitsu_laptop_add()
   - do not log LED registration failures
   - switch to managed LED class devices
   - reorganize LED-related code
   - refactor LED registration
   - select LEDS_CLASS
   - remove redundant fields from struct fujitsu_bl
   - account for backlight power when determining brightness
   - do not log set_lcd_level() failures in bl_update_status()
   - ignore errors when setting backlight power
   - make disable_brightness_adjust a boolean
   - clean up use_alt_lcd_levels handling
   - sync brightness in set_lcd_level()
   - simplify set_lcd_level()
   - merge set_lcd_level_alt() into set_lcd_level()
   - switch to a managed backlight device
   - only handle backlight when appropriate
   - update debug message logged by call_fext_func()
   - rename call_fext_func() arguments
   - simplify call_fext_func()
   - clean up local variables in call_fext_func()
   - remove keycode fields from struct fujitsu_bl
   - model-dependent sparse keymap overrides
   - use a sparse keymap for hotkey event generation
   - switch to a managed hotkey input device
   - refactor hotkey input device setup
   - use a sparse keymap for brightness key events
   - switch to a managed backlight input device
   - refactor backlight input device setup
   - remove pf_device field from struct fujitsu_bl
   - only register platform device if FUJ02E3 is present
   - add and remove platform device in separate functions
   - simplify platform device attribute definitions
   - remove backlight-related attributes from the platform device
   - cleanup error labels in fujitsu_init()
   - only register backlight device if FUJ02B1 is present
   - sync backlight power status in acpi_fujitsu_laptop_add()
   - register backlight device in a separate function
   - simplify brightness key event generation logic
   - decrease indentation in acpi_fujitsu_bl_notify()

  intel-hid:
   - Add missing ->thaw callback
   - do not set parents of input devices explicitly
   - remove redundant set_bit() call
   - use devm_input_allocate_device() for HID events input device
   - make intel_hid_set_enable() take a boolean argument
   - simplify enabling/disabling HID events

  silead_dmi:
   - Add touchscreen info for Surftab Wintron 7.0
   - Abort early if DMI does not match
   - Do not treat all devices as i2c_clients
   - Add entry for Insyde 7W tablets
   - Constify properties arrays

  intel_scu_ipc:
   - Introduce intel_scu_ipc_raw_command()
   - Introduce SCU_DEVICE() macro
   - Remove redundant subarch check
   - Rearrange init sequence
   - Platform data is mandatory

  asus-nb-wmi:
   - Add wapf4 quirk for the X302UA

  dell-*:
   - Call new led hw_changed API on kbd brightness change
   - Add a generic dell-laptop notifier chain

  eeepc-laptop:
   - Skip unknown key messages 0x50 0x51

  thinkpad_acpi:
   - add mapping for new hotkeys
   - guard generic hotkey case"

* tag 'platform-drivers-x86-v4.12-1' of git://git.infradead.org/linux-platform-drivers-x86: (108 commits)
  platform/x86: Make SILEAD_DMI depend on TOUCHSCREEN_SILEAD
  platform/x86: asus-wmi: try to set als by default
  platform/x86: asus-wmi: fix cpufv sysfs file permission
  platform/x86: acer-wmi: setup accelerometer when ACPI device was found
  platform/x86: ideapad-laptop: Add IdeaPad V310-15ISK to no_hw_rfkill
  platform/x86: intel_pmc_ipc: use gcr mem base for S0ix counter read
  platform/x86: intel_pmc_ipc: Fix iTCO_wdt GCS memory mapping failure
  watchdog: iTCO_wdt: Add PMC specific noreboot update api
  watchdog: iTCO_wdt: cleanup set/unset no_reboot_bit functions
  platform/x86: intel_pmc_ipc: Add pmc gcr read/write/update api's
  platform/x86: intel_pmc_ipc: fix gcr offset
  platform/x86: dell-laptop: Add keyboard backlight timeout AC settings
  platform/x86: dell-laptop: Handle return error form dell_get_intensity.
  platform/x86: hp-wireless: reuse module_acpi_driver
  platform/x86: intel-vbtn: add volume up and down
  platform/x86: INT33FE: add i2c dependency
  platform/x86: hp-wmi: Cleanup exit paths
  platform/x86: hp-wmi: Do not shadow errors in sysfs show functions
  platform/x86: hp-wmi: Use DEVICE_ATTR_(RO|RW) helper macros
  platform/x86: hp-wmi: Refactor dock and tablet state fetchers
  ...

7 years agoMerge tag 'vfio-v4.12-rc1' of git://github.com/awilliam/linux-vfio
Linus Torvalds [Thu, 4 May 2017 18:53:24 +0000 (11:53 -0700)]
Merge tag 'vfio-v4.12-rc1' of git://github.com/awilliam/linux-vfio

Pull VFIO updates from Alex Williamson:

 - Updates for SPAPR IOMMU backend including compatibility test and
   memory allocation check (Alexey Kardashevskiy)

 - Updates for type1 IOMMU backend to remove asynchronous locked page
   accounting and remove redundancy (Alex Williamson)

* tag 'vfio-v4.12-rc1' of git://github.com/awilliam/linux-vfio:
  vfio/type1: Reduce repetitive calls in vfio_pin_pages_remote()
  vfio/type1: Prune vfio_pin_page_external()
  vfio/type1: Remove locked page accounting workqueue
  vfio/spapr_tce: Check kzalloc() return when preregistering memory
  vfio/powerpc/spapr_tce: Enforce IOMMU type compatibility check

7 years agoorangefs: count directory pieces correctly
Martin Brandenburg [Thu, 4 May 2017 17:16:04 +0000 (13:16 -0400)]
orangefs: count directory pieces correctly

A large directory full of differently sized file names triggered this.
Most directories, even very large directories with shorter names, would
be lucky enough to fit in one server response.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
7 years agoorangefs: invalidate stored directory on seek
Martin Brandenburg [Tue, 2 May 2017 16:15:11 +0000 (12:15 -0400)]
orangefs: invalidate stored directory on seek

If an application seeks to a position before the point which has been
read, it must want updates which have been made to the directory.  So
delete the copy stored in the kernel so it will be fetched again.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
7 years agoorangefs: skip forward to the next directory entry if seek is short
Martin Brandenburg [Tue, 2 May 2017 16:15:10 +0000 (12:15 -0400)]
orangefs: skip forward to the next directory entry if seek is short

If userspace seeks to a position in the stream which is not correct, it
would have returned EIO because the data in the buffer at that offset
would be incorrect.  This and the userspace daemon returning a corrupt
directory are indistinguishable.

Now if the data does not look right, skip forward to the next chunk and
try again.  The motivation is that if the directory changes, an
application may seek to a position that was valid and no longer is valid.

It is not yet possible for a directory to change.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
7 years agoMerge tag 'for-linus-4.12b-rc0b-tag' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Thu, 4 May 2017 18:37:09 +0000 (11:37 -0700)]
Merge tag 'for-linus-4.12b-rc0b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull xen updates from Juergen Gross:
 "Xen fixes and featrues for 4.12. The main changes are:

   - enable building the kernel with Xen support but without enabling
     paravirtualized mode (Vitaly Kuznetsov)

   - add a new 9pfs xen frontend driver (Stefano Stabellini)

   - simplify Xen's cpuid handling by making use of cpu capabilities
     (Juergen Gross)

   - add/modify some headers for new Xen paravirtualized devices
     (Oleksandr Andrushchenko)

   - EFI reset_system support under Xen (Julien Grall)

   - and the usual cleanups and corrections"

* tag 'for-linus-4.12b-rc0b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (57 commits)
  xen: Move xen_have_vector_callback definition to enlighten.c
  xen: Implement EFI reset_system callback
  arm/xen: Consolidate calls to shutdown hypercall in a single helper
  xen: Export xen_reboot
  xen/x86: Call xen_smp_intr_init_pv() on BSP
  xen: Revert commits da72ff5bfcb0 and 72a9b186292d
  xen/pvh: Do not fill kernel's e820 map in init_pvh_bootparams()
  xen/scsifront: use offset_in_page() macro
  xen/arm,arm64: rename __generic_dma_ops to xen_get_dma_ops
  xen/arm,arm64: fix xen_dma_ops after 815dd18 "Consolidate get_dma_ops..."
  xen/9pfs: select CONFIG_XEN_XENBUS_FRONTEND
  x86/cpu: remove hypervisor specific set_cpu_features
  vmware: set cpu capabilities during platform initialization
  x86/xen: use capabilities instead of fake cpuid values for xsave
  x86/xen: use capabilities instead of fake cpuid values for x2apic
  x86/xen: use capabilities instead of fake cpuid values for mwait
  x86/xen: use capabilities instead of fake cpuid values for acpi
  x86/xen: use capabilities instead of fake cpuid values for acc
  x86/xen: use capabilities instead of fake cpuid values for mtrr
  x86/xen: use capabilities instead of fake cpuid values for aperf
  ...

7 years agoof: fix sparse warning in of_pci_range_parser_one
Rob Herring [Thu, 4 May 2017 17:34:30 +0000 (12:34 -0500)]
of: fix sparse warning in of_pci_range_parser_one

sparse gives the following warning for 'pci_space':

../drivers/of/address.c:266:26: warning: incorrect type in assignment (different base types)
../drivers/of/address.c:266:26:    expected unsigned int [unsigned] [usertype] pci_space
../drivers/of/address.c:266:26:    got restricted __be32 const [usertype] <noident>

It appears that pci_space is only ever accessed on powerpc, so the endian
swap is often not needed.

Cc: stable@vger.kernel.org
Signed-off-by: Rob Herring <robh@kernel.org>
7 years agoof: fix sparse warnings in of_find_next_cache_node
Rob Herring [Thu, 4 May 2017 17:30:07 +0000 (12:30 -0500)]
of: fix sparse warnings in of_find_next_cache_node

sparse gives a warning that 'handle' is not a __be32:

../drivers/of/base.c:2261:61: warning: incorrect type in argument 1 (different base types)
../drivers/of/base.c:2261:61:    expected restricted __be32 const [usertype] *p
../drivers/of/base.c:2261:61:    got unsigned int const [usertype] *[assigned] handle

We could just change the type, but the code can be improved by using
of_parse_phandle instead of open coding it with of_get_property and
of_find_node_by_phandle.

Signed-off-by: Rob Herring <robh@kernel.org>
7 years agocfg80211: make RATE_INFO_BW_20 the default
Johannes Berg [Thu, 4 May 2017 06:42:30 +0000 (08:42 +0200)]
cfg80211: make RATE_INFO_BW_20 the default

Due to the way I did the RX bitrate conversions in mac80211 with
spatch, going setting flags to setting the value, many drivers now
don't set the bandwidth value for 20 MHz, since with the flags it
wasn't necessary to (there was no 20 MHz flag, only the others.)

Rather than go through and try to fix up all the drivers, instead
renumber the enum so that 20 MHz, which is the typical bandwidth,
actually has the value 0, making those drivers all work again.

If VHT was hit used with a driver not reporting it, e.g. iwlmvm,
this manifested in hitting the bandwidth warning in
cfg80211_calculate_bitrate_vht().

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoipv6: initialize route null entry in addrconf_init()
WANG Cong [Thu, 4 May 2017 05:07:31 +0000 (22:07 -0700)]
ipv6: initialize route null entry in addrconf_init()

Andrey reported a crash on init_net.ipv6.ip6_null_entry->rt6i_idev
since it is always NULL.

This is clearly wrong, we have code to initialize it to loopback_dev,
unfortunately the order is still not correct.

loopback_dev is registered very early during boot, we lose a chance
to re-initialize it in notifier. addrconf_init() is called after
ip6_route_init(), which means we have no chance to correct it.

Fix it by moving this initialization explicitly after
ipv6_add_dev(init_net.loopback_dev) in addrconf_init().

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'qed-fixes'
David S. Miller [Thu, 4 May 2017 16:31:03 +0000 (12:31 -0400)]
Merge branch 'qed-fixes'

Sudarsana Reddy Kalluru says:

====================
qed*: Bug fix series.

The series contains minor bug fixes for qed/qede drivers.

Please consider applying it to 'net' branch.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoqede: Fix possible misconfiguration of advertised autoneg value.
sudarsana.kalluru@cavium.com [Thu, 4 May 2017 15:15:05 +0000 (08:15 -0700)]
qede: Fix possible misconfiguration of advertised autoneg value.

Fail the configuration of advertised speed-autoneg value if the config
update is not supported.

Signed-off-by: Sudarsana Reddy Kalluru <Sudarsana.Kalluru@cavium.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoqed: Fix overriding of supported autoneg value.
sudarsana.kalluru@cavium.com [Thu, 4 May 2017 15:15:04 +0000 (08:15 -0700)]
qed: Fix overriding of supported autoneg value.

Driver currently uses advertised-autoneg value to populate the
supported-autoneg field. When advertised field is updated, user gets
the same value for supported field. Supported-autoneg value need to be
populated from the link capabilities value returned by the MFW.

Signed-off-by: Sudarsana Reddy Kalluru <Sudarsana.Kalluru@cavium.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoqed*: Fix possible overflow for status block id field.
sudarsana.kalluru@cavium.com [Thu, 4 May 2017 15:15:03 +0000 (08:15 -0700)]
qed*: Fix possible overflow for status block id field.

Value for status block id could be more than 256 in 100G mode, need to
update its data type from u8 to u16.

Signed-off-by: Sudarsana Reddy Kalluru <Sudarsana.Kalluru@cavium.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoof/unittest: Missing unlocks on error
Dan Carpenter [Wed, 3 May 2017 19:49:50 +0000 (22:49 +0300)]
of/unittest: Missing unlocks on error

Static checkers complain that we should unlock before returning.  Which
is true.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Frank Rowand <frank.rowand@sony.com>
Signed-off-by: Rob Herring <robh@kernel.org>
7 years agofscrypt: correct collision claim for digested names
Eric Biggers [Mon, 1 May 2017 18:43:32 +0000 (11:43 -0700)]
fscrypt: correct collision claim for digested names

As I noted on the mailing list, it's easier than I originally thought to
create intentional collisions in the digested names.  Unfortunately it's
not too easy to solve this, so for now just fix the comment to not lie.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
7 years agoMAINTAINERS: fscrypt: update mailing list, patchwork, and git
Eric Biggers [Wed, 26 Apr 2017 23:43:00 +0000 (16:43 -0700)]
MAINTAINERS: fscrypt: update mailing list, patchwork, and git

Now that there has been a dedicated mailing list, patchwork project, and
git repository set up for filesystem encryption, update the MAINTAINERS
file accordingly.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
7 years agoext4: clean up ext4_match() and callers
Eric Biggers [Mon, 24 Apr 2017 17:00:13 +0000 (10:00 -0700)]
ext4: clean up ext4_match() and callers

When ext4 encryption was originally merged, we were encrypting the
user-specified filename in ext4_match(), introducing a lot of additional
complexity into ext4_match() and its callers.  This has since been
changed to encrypt the filename earlier, so we can remove the gunk
that's no longer needed.  This more or less reverts ext4_search_dir()
and ext4_find_dest_de() to the way they were in the v4.0 kernel.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
7 years agof2fs: switch to using fscrypt_match_name()
Eric Biggers [Mon, 24 Apr 2017 17:00:12 +0000 (10:00 -0700)]
f2fs: switch to using fscrypt_match_name()

Switch f2fs directory searches to use the fscrypt_match_name() helper
function.  There should be no functional change.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
7 years agoext4: switch to using fscrypt_match_name()
Eric Biggers [Mon, 24 Apr 2017 17:00:11 +0000 (10:00 -0700)]
ext4: switch to using fscrypt_match_name()

Switch ext4 directory searches to use the fscrypt_match_name() helper
function.  There should be no functional change.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
7 years agofscrypt: introduce helper function for filename matching
Eric Biggers [Mon, 24 Apr 2017 17:00:10 +0000 (10:00 -0700)]
fscrypt: introduce helper function for filename matching

Introduce a helper function fscrypt_match_name() which tests whether a
fscrypt_name matches a directory entry.  Also clean up the magic numbers
and document things properly.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
7 years agofscrypt: avoid collisions when presenting long encrypted filenames
Eric Biggers [Mon, 24 Apr 2017 17:00:09 +0000 (10:00 -0700)]
fscrypt: avoid collisions when presenting long encrypted filenames

When accessing an encrypted directory without the key, userspace must
operate on filenames derived from the ciphertext names, which contain
arbitrary bytes.  Since we must support filenames as long as NAME_MAX,
we can't always just base64-encode the ciphertext, since that may make
it too long.  Currently, this is solved by presenting long names in an
abbreviated form containing any needed filesystem-specific hashes (e.g.
to identify a directory block), then the last 16 bytes of ciphertext.
This needs to be sufficient to identify the actual name on lookup.

However, there is a bug.  It seems to have been assumed that due to the
use of a CBC (ciphertext block chaining)-based encryption mode, the last
16 bytes (i.e. the AES block size) of ciphertext would depend on the
full plaintext, preventing collisions.  However, we actually use CBC
with ciphertext stealing (CTS), which handles the last two blocks
specially, causing them to appear "flipped".  Thus, it's actually the
second-to-last block which depends on the full plaintext.

This caused long filenames that differ only near the end of their
plaintexts to, when observed without the key, point to the wrong inode
and be undeletable.  For example, with ext4:

    # echo pass | e4crypt add_key -p 16 edir/
    # seq -f "edir/abcdefghijklmnopqrstuvwxyz012345%.0f" 100000 | xargs touch
    # find edir/ -type f | xargs stat -c %i | sort | uniq | wc -l
    100000
    # sync
    # echo 3 > /proc/sys/vm/drop_caches
    # keyctl new_session
    # find edir/ -type f | xargs stat -c %i | sort | uniq | wc -l
    2004
    # rm -rf edir/
    rm: cannot remove 'edir/_A7nNFi3rhkEQlJ6P,hdzluhODKOeWx5V': Structure needs cleaning
    ...

To fix this, when presenting long encrypted filenames, encode the
second-to-last block of ciphertext rather than the last 16 bytes.

Although it would be nice to solve this without depending on a specific
encryption mode, that would mean doing a cryptographic hash like SHA-256
which would be much less efficient.  This way is sufficient for now, and
it's still compatible with encryption modes like HEH which are strong
pseudorandom permutations.  Also, changing the presented names is still
allowed at any time because they are only provided to allow applications
to do things like delete encrypted directories.  They're not designed to
be used to persistently identify files --- which would be hard to do
anyway, given that they're encrypted after all.

For ease of backports, this patch only makes the minimal fix to both
ext4 and f2fs.  It leaves ubifs as-is, since ubifs doesn't compare the
ciphertext block yet.  Follow-on patches will clean things up properly
and make the filesystems use a shared helper function.

Fixes: 5de0b4d0cd15 ("ext4 crypto: simplify and speed up filename encryption")
Reported-by: Gwendal Grignou <gwendal@chromium.org>
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
7 years agof2fs: check entire encrypted bigname when finding a dentry
Jaegeuk Kim [Mon, 24 Apr 2017 17:00:08 +0000 (10:00 -0700)]
f2fs: check entire encrypted bigname when finding a dentry

If user has no key under an encrypted dir, fscrypt gives digested dentries.
Previously, when looking up a dentry, f2fs only checks its hash value with
first 4 bytes of the digested dentry, which didn't handle hash collisions fully.
This patch enhances to check entire dentry bytes likewise ext4.

Eric reported how to reproduce this issue by:

 # seq -f "edir/abcdefghijklmnopqrstuvwxyz012345%.0f" 100000 | xargs touch
 # find edir -type f | xargs stat -c %i | sort | uniq | wc -l
100000
 # sync
 # echo 3 > /proc/sys/vm/drop_caches
 # keyctl new_session
 # find edir -type f | xargs stat -c %i | sort | uniq | wc -l
99999

Cc: <stable@vger.kernel.org>
Reported-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
(fixed f2fs_dentry_hash() to work even when the hash is 0)
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
7 years agoubifs: check for consistent encryption contexts in ubifs_lookup()
Eric Biggers [Fri, 7 Apr 2017 17:58:40 +0000 (10:58 -0700)]
ubifs: check for consistent encryption contexts in ubifs_lookup()

As ext4 and f2fs do, ubifs should check for consistent encryption
contexts during ->lookup() in an encrypted directory.  This protects
certain users of filesystem encryption against certain types of offline
attacks.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
7 years agof2fs: sync f2fs_lookup() with ext4_lookup()
Eric Biggers [Fri, 7 Apr 2017 17:58:39 +0000 (10:58 -0700)]
f2fs: sync f2fs_lookup() with ext4_lookup()

As for ext4, now that fscrypt_has_permitted_context() correctly handles
the case where we have the key for the parent directory but not the
child, f2fs_lookup() no longer has to work around it.  Also add the same
warning message that ext4 uses.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
7 years agoext4: remove "nokey" check from ext4_lookup()
Eric Biggers [Fri, 7 Apr 2017 17:58:38 +0000 (10:58 -0700)]
ext4: remove "nokey" check from ext4_lookup()

Now that fscrypt_has_permitted_context() correctly handles the case
where we have the key for the parent directory but not the child, we
don't need to try to work around this in ext4_lookup().

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
7 years agofscrypt: fix context consistency check when key(s) unavailable
Eric Biggers [Fri, 7 Apr 2017 17:58:37 +0000 (10:58 -0700)]
fscrypt: fix context consistency check when key(s) unavailable

To mitigate some types of offline attacks, filesystem encryption is
designed to enforce that all files in an encrypted directory tree use
the same encryption policy (i.e. the same encryption context excluding
the nonce).  However, the fscrypt_has_permitted_context() function which
enforces this relies on comparing struct fscrypt_info's, which are only
available when we have the encryption keys.  This can cause two
incorrect behaviors:

1. If we have the parent directory's key but not the child's key, or
   vice versa, then fscrypt_has_permitted_context() returned false,
   causing applications to see EPERM or ENOKEY.  This is incorrect if
   the encryption contexts are in fact consistent.  Although we'd
   normally have either both keys or neither key in that case since the
   master_key_descriptors would be the same, this is not guaranteed
   because keys can be added or removed from keyrings at any time.

2. If we have neither the parent's key nor the child's key, then
   fscrypt_has_permitted_context() returned true, causing applications
   to see no error (or else an error for some other reason).  This is
   incorrect if the encryption contexts are in fact inconsistent, since
   in that case we should deny access.

To fix this, retrieve and compare the fscrypt_contexts if we are unable
to set up both fscrypt_infos.

While this slightly hurts performance when accessing an encrypted
directory tree without the key, this isn't a case we really need to be
optimizing for; access *with* the key is much more important.
Furthermore, the performance hit is barely noticeable given that we are
already retrieving the fscrypt_context and doing two keyring searches in
fscrypt_get_encryption_info().  If we ever actually wanted to optimize
this case we might start by caching the fscrypt_contexts.

Cc: stable@vger.kernel.org # 4.0+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
7 years agortnetlink: NUL-terminate IFLA_PHYS_PORT_NAME string
Michal Schmidt [Thu, 4 May 2017 14:48:58 +0000 (16:48 +0200)]
rtnetlink: NUL-terminate IFLA_PHYS_PORT_NAME string

IFLA_PHYS_PORT_NAME is a string attribute, so terminate it with \0.
Otherwise libnl3 fails to validate netlink messages with this attribute.
"ip -detail a" assumes too that the attribute is NUL-terminated when
printing it. It often was, due to padding.

I noticed this as libvirtd failing to start on a system with sfc driver
after upgrading it to Linux 4.11, i.e. when sfc added support for
phys_port_name.

Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonetvsc: make sure napi enabled before vmbus_open
stephen hemminger [Wed, 3 May 2017 23:59:21 +0000 (16:59 -0700)]
netvsc: make sure napi enabled before vmbus_open

This fixes a race where vmbus callback for new packet arriving
could occur before NAPI is initialized.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoaquantia: Fix driver name reported by ethtool
Pavel Belous [Wed, 3 May 2017 18:17:44 +0000 (21:17 +0300)]
aquantia: Fix driver name reported by ethtool

V2: using "aquantia" subsystem tag.

The command "ethtool -i ethX" should display driver name (driver: atlantic)
instead vendor name (driver: aquantia).

Signed-off-by: Pavel Belous <pavel.belous@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoipv4, ipv6: ensure raw socket message is big enough to hold an IP header
Alexander Potapenko [Wed, 3 May 2017 15:06:58 +0000 (17:06 +0200)]
ipv4, ipv6: ensure raw socket message is big enough to hold an IP header

raw_send_hdrinc() and rawv6_send_hdrinc() expect that the buffer copied
from the userspace contains the IPv4/IPv6 header, so if too few bytes are
copied, parts of the header may remain uninitialized.

This bug has been detected with KMSAN.

For the record, the KMSAN report:

==================================================================
BUG: KMSAN: use of unitialized memory in nf_ct_frag6_gather+0xf5a/0x44a0
inter: 0
CPU: 0 PID: 1036 Comm: probe Not tainted 4.11.0-rc5+ #2455
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:16
 dump_stack+0x143/0x1b0 lib/dump_stack.c:52
 kmsan_report+0x16b/0x1e0 mm/kmsan/kmsan.c:1078
 __kmsan_warning_32+0x5c/0xa0 mm/kmsan/kmsan_instr.c:510
 nf_ct_frag6_gather+0xf5a/0x44a0 net/ipv6/netfilter/nf_conntrack_reasm.c:577
 ipv6_defrag+0x1d9/0x280 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c:68
 nf_hook_entry_hookfn ./include/linux/netfilter.h:102
 nf_hook_slow+0x13f/0x3c0 net/netfilter/core.c:310
 nf_hook ./include/linux/netfilter.h:212
 NF_HOOK ./include/linux/netfilter.h:255
 rawv6_send_hdrinc net/ipv6/raw.c:673
 rawv6_sendmsg+0x2fcb/0x41a0 net/ipv6/raw.c:919
 inet_sendmsg+0x3f8/0x6d0 net/ipv4/af_inet.c:762
 sock_sendmsg_nosec net/socket.c:633
 sock_sendmsg net/socket.c:643
 SYSC_sendto+0x6a5/0x7c0 net/socket.c:1696
 SyS_sendto+0xbc/0xe0 net/socket.c:1664
 do_syscall_64+0x72/0xa0 arch/x86/entry/common.c:285
 entry_SYSCALL64_slow_path+0x25/0x25 arch/x86/entry/entry_64.S:246
RIP: 0033:0x436e03
RSP: 002b:00007ffce48baf38 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 00000000004002b0 RCX: 0000000000436e03
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
RBP: 00007ffce48baf90 R08: 00007ffce48baf50 R09: 000000000000001c
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000401790 R14: 0000000000401820 R15: 0000000000000000
origin: 00000000d9400053
 save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:362
 kmsan_internal_poison_shadow+0xb1/0x1a0 mm/kmsan/kmsan.c:257
 kmsan_poison_shadow+0x6d/0xc0 mm/kmsan/kmsan.c:270
 slab_alloc_node mm/slub.c:2735
 __kmalloc_node_track_caller+0x1f4/0x390 mm/slub.c:4341
 __kmalloc_reserve net/core/skbuff.c:138
 __alloc_skb+0x2cd/0x740 net/core/skbuff.c:231
 alloc_skb ./include/linux/skbuff.h:933
 alloc_skb_with_frags+0x209/0xbc0 net/core/skbuff.c:4678
 sock_alloc_send_pskb+0x9ff/0xe00 net/core/sock.c:1903
 sock_alloc_send_skb+0xe4/0x100 net/core/sock.c:1920
 rawv6_send_hdrinc net/ipv6/raw.c:638
 rawv6_sendmsg+0x2918/0x41a0 net/ipv6/raw.c:919
 inet_sendmsg+0x3f8/0x6d0 net/ipv4/af_inet.c:762
 sock_sendmsg_nosec net/socket.c:633
 sock_sendmsg net/socket.c:643
 SYSC_sendto+0x6a5/0x7c0 net/socket.c:1696
 SyS_sendto+0xbc/0xe0 net/socket.c:1664
 do_syscall_64+0x72/0xa0 arch/x86/entry/common.c:285
 return_from_SYSCALL_64+0x0/0x6a arch/x86/entry/entry_64.S:246
==================================================================

, triggered by the following syscalls:
  socket(PF_INET6, SOCK_RAW, IPPROTO_RAW) = 3
  sendto(3, NULL, 0, 0, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "ff00::", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = -1 EPERM

A similar report is triggered in net/ipv4/raw.c if we use a PF_INET socket
instead of a PF_INET6 one.

Signed-off-by: Alexander Potapenko <glider@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agojbd2: cleanup write flags handling from jbd2_write_superblock()
Jan Kara [Thu, 4 May 2017 15:01:31 +0000 (11:01 -0400)]
jbd2: cleanup write flags handling from jbd2_write_superblock()

Currently jbd2_write_superblock() silently adds REQ_SYNC to flags with
which journal superblock is written. Make this explicit by making flags
passed down to jbd2_write_superblock() contain REQ_SYNC.

CC: linux-ext4@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
7 years agonet/sched: remove redundant null check on head
Colin Ian King [Wed, 3 May 2017 13:50:40 +0000 (14:50 +0100)]
net/sched: remove redundant null check on head

head is previously null checked and so the 2nd null check on head
is redundant and therefore can be removed.

Detected by CoverityScan, CID#1399505 ("Logically dead code")

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agotcp: do not inherit fastopen_req from parent
Eric Dumazet [Wed, 3 May 2017 13:39:31 +0000 (06:39 -0700)]
tcp: do not inherit fastopen_req from parent

Under fuzzer stress, it is possible that a child gets a non NULL
fastopen_req pointer from its parent at accept() time, when/if parent
morphs from listener to active session.

We need to make sure this can not happen, by clearing the field after
socket cloning.

BUG: Double free or freeing an invalid pointer
Unexpected shadow byte: 0xFB
CPU: 3 PID: 20933 Comm: syz-executor3 Not tainted 4.11.0+ #306
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
01/01/2011
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:16 [inline]
 dump_stack+0x292/0x395 lib/dump_stack.c:52
 kasan_object_err+0x1c/0x70 mm/kasan/report.c:164
 kasan_report_double_free+0x5c/0x70 mm/kasan/report.c:185
 kasan_slab_free+0x9d/0xc0 mm/kasan/kasan.c:580
 slab_free_hook mm/slub.c:1357 [inline]
 slab_free_freelist_hook mm/slub.c:1379 [inline]
 slab_free mm/slub.c:2961 [inline]
 kfree+0xe8/0x2b0 mm/slub.c:3882
 tcp_free_fastopen_req net/ipv4/tcp.c:1077 [inline]
 tcp_disconnect+0xc15/0x13e0 net/ipv4/tcp.c:2328
 inet_child_forget+0xb8/0x600 net/ipv4/inet_connection_sock.c:898
 inet_csk_reqsk_queue_add+0x1e7/0x250
net/ipv4/inet_connection_sock.c:928
 tcp_get_cookie_sock+0x21a/0x510 net/ipv4/syncookies.c:217
 cookie_v4_check+0x1a19/0x28b0 net/ipv4/syncookies.c:384
 tcp_v4_cookie_check net/ipv4/tcp_ipv4.c:1384 [inline]
 tcp_v4_do_rcv+0x731/0x940 net/ipv4/tcp_ipv4.c:1421
 tcp_v4_rcv+0x2dc0/0x31c0 net/ipv4/tcp_ipv4.c:1715
 ip_local_deliver_finish+0x4cc/0xc20 net/ipv4/ip_input.c:216
 NF_HOOK include/linux/netfilter.h:257 [inline]
 ip_local_deliver+0x1ce/0x700 net/ipv4/ip_input.c:257
 dst_input include/net/dst.h:492 [inline]
 ip_rcv_finish+0xb1d/0x20b0 net/ipv4/ip_input.c:396
 NF_HOOK include/linux/netfilter.h:257 [inline]
 ip_rcv+0xd8c/0x19c0 net/ipv4/ip_input.c:487
 __netif_receive_skb_core+0x1ad1/0x3400 net/core/dev.c:4210
 __netif_receive_skb+0x2a/0x1a0 net/core/dev.c:4248
 process_backlog+0xe5/0x6c0 net/core/dev.c:4868
 napi_poll net/core/dev.c:5270 [inline]
 net_rx_action+0xe70/0x18e0 net/core/dev.c:5335
 __do_softirq+0x2fb/0xb99 kernel/softirq.c:284
 do_softirq_own_stack+0x1c/0x30 arch/x86/entry/entry_64.S:899
 </IRQ>
 do_softirq.part.17+0x1e8/0x230 kernel/softirq.c:328
 do_softirq kernel/softirq.c:176 [inline]
 __local_bh_enable_ip+0x1cf/0x1e0 kernel/softirq.c:181
 local_bh_enable include/linux/bottom_half.h:31 [inline]
 rcu_read_unlock_bh include/linux/rcupdate.h:931 [inline]
 ip_finish_output2+0x9ab/0x15e0 net/ipv4/ip_output.c:230
 ip_finish_output+0xa35/0xdf0 net/ipv4/ip_output.c:316
 NF_HOOK_COND include/linux/netfilter.h:246 [inline]
 ip_output+0x1f6/0x7b0 net/ipv4/ip_output.c:404
 dst_output include/net/dst.h:486 [inline]
 ip_local_out+0x95/0x160 net/ipv4/ip_output.c:124
 ip_queue_xmit+0x9a8/0x1a10 net/ipv4/ip_output.c:503
 tcp_transmit_skb+0x1ade/0x3470 net/ipv4/tcp_output.c:1057
 tcp_write_xmit+0x79e/0x55b0 net/ipv4/tcp_output.c:2265
 __tcp_push_pending_frames+0xfa/0x3a0 net/ipv4/tcp_output.c:2450
 tcp_push+0x4ee/0x780 net/ipv4/tcp.c:683
 tcp_sendmsg+0x128d/0x39b0 net/ipv4/tcp.c:1342
 inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:762
 sock_sendmsg_nosec net/socket.c:633 [inline]
 sock_sendmsg+0xca/0x110 net/socket.c:643
 SYSC_sendto+0x660/0x810 net/socket.c:1696
 SyS_sendto+0x40/0x50 net/socket.c:1664
 entry_SYSCALL_64_fastpath+0x1f/0xbe
RIP: 0033:0x446059
RSP: 002b:00007faa6761fb58 EFLAGS: 00000282 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 0000000000000017 RCX: 0000000000446059
RDX: 0000000000000001 RSI: 0000000020ba3fcd RDI: 0000000000000017
RBP: 00000000006e40a0 R08: 0000000020ba4ff0 R09: 0000000000000010
R10: 0000000020000000 R11: 0000000000000282 R12: 0000000000708150
R13: 0000000000000000 R14: 00007faa676209c0 R15: 00007faa67620700
Object at ffff88003b5bbcb8, in cache kmalloc-64 size: 64
Allocated:
PID = 20909
 save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
 save_stack+0x43/0xd0 mm/kasan/kasan.c:513
 set_track mm/kasan/kasan.c:525 [inline]
 kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:616
 kmem_cache_alloc_trace+0x82/0x270 mm/slub.c:2745
 kmalloc include/linux/slab.h:490 [inline]
 kzalloc include/linux/slab.h:663 [inline]
 tcp_sendmsg_fastopen net/ipv4/tcp.c:1094 [inline]
 tcp_sendmsg+0x221a/0x39b0 net/ipv4/tcp.c:1139
 inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:762
 sock_sendmsg_nosec net/socket.c:633 [inline]
 sock_sendmsg+0xca/0x110 net/socket.c:643
 SYSC_sendto+0x660/0x810 net/socket.c:1696
 SyS_sendto+0x40/0x50 net/socket.c:1664
 entry_SYSCALL_64_fastpath+0x1f/0xbe
Freed:
PID = 20909
 save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
 save_stack+0x43/0xd0 mm/kasan/kasan.c:513
 set_track mm/kasan/kasan.c:525 [inline]
 kasan_slab_free+0x73/0xc0 mm/kasan/kasan.c:589
 slab_free_hook mm/slub.c:1357 [inline]
 slab_free_freelist_hook mm/slub.c:1379 [inline]
 slab_free mm/slub.c:2961 [inline]
 kfree+0xe8/0x2b0 mm/slub.c:3882
 tcp_free_fastopen_req net/ipv4/tcp.c:1077 [inline]
 tcp_disconnect+0xc15/0x13e0 net/ipv4/tcp.c:2328
 __inet_stream_connect+0x20c/0xf90 net/ipv4/af_inet.c:593
 tcp_sendmsg_fastopen net/ipv4/tcp.c:1111 [inline]
 tcp_sendmsg+0x23a8/0x39b0 net/ipv4/tcp.c:1139
 inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:762
 sock_sendmsg_nosec net/socket.c:633 [inline]
 sock_sendmsg+0xca/0x110 net/socket.c:643
 SYSC_sendto+0x660/0x810 net/socket.c:1696
 SyS_sendto+0x40/0x50 net/socket.c:1664
 entry_SYSCALL_64_fastpath+0x1f/0xbe

Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets")
Fixes: 7db92362d2fe ("tcp: fix potential double free issue for fastopen_req")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agodm integrity: improve the Kconfig help text for DM_INTEGRITY
Mike Snitzer [Thu, 4 May 2017 14:32:07 +0000 (10:32 -0400)]
dm integrity: improve the Kconfig help text for DM_INTEGRITY

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Milan Broz <gmazyland@gmail.com>
7 years agoext4: mark superblock writes synchronous for nobarrier mounts
Jan Kara [Thu, 4 May 2017 14:58:03 +0000 (10:58 -0400)]
ext4: mark superblock writes synchronous for nobarrier mounts

Commit b685d3d65ac7 "block: treat REQ_FUA and REQ_PREFLUSH as
synchronous" removed REQ_SYNC flag from WRITE_FUA implementation.
generic_make_request_checks() however strips REQ_FUA flag from a bio
when the storage doesn't report volatile write cache and thus write
effectively becomes asynchronous which can lead to performance
regressions. This affects superblock writes for ext4. Fix the problem
by marking superblock writes always as synchronous.

Fixes: b685d3d65ac791406e0dfd8779cc9b3707fea5a3
CC: linux-ext4@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
7 years agoforcedeth: remove unnecessary carrier status check
Zhu Yanjun [Wed, 3 May 2017 04:43:42 +0000 (00:43 -0400)]
forcedeth: remove unnecessary carrier status check

Since netif_carrier_on() will do nothing if device's
carrier is already on, so it's unnecessary to do
carrier status check.

It's the same for netif_carrier_off().

Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agodm cache policy smq: cleanup free_target_met() and clean_target_met()
Mike Snitzer [Thu, 4 May 2017 14:04:18 +0000 (10:04 -0400)]
dm cache policy smq: cleanup free_target_met() and clean_target_met()

Depending on the passed @idle arg, there may be no need to calculate
'nr_free' or 'nr_clean' respectively in free_target_met() and
clean_target_met().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
7 years agodm cache policy smq: allow demotions to happen even during continuous IO
Joe Thornber [Wed, 3 May 2017 16:01:27 +0000 (12:01 -0400)]
dm cache policy smq: allow demotions to happen even during continuous IO

dm-cache's smq policy tries hard to do it's work during the idle periods
when there is no IO.  But if there are no idle periods (eg, a long fio
run) we still need to allow some demotions and promotions to occur.

To achieve this, pass @idle=true to queue_promotion()'s
free_target_met() call so that free_target_met() doesn't short-circuit
the possibility of demotion simply because it isn't an idle period.

Fixes: b29d4986d0 ("dm cache: significant rework to leverage dm-bio-prison-v2")
Reported-by: John Harrigan <jharriga@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
7 years agomq-deadline: add debugfs attributes
Omar Sandoval [Thu, 4 May 2017 07:31:34 +0000 (00:31 -0700)]
mq-deadline: add debugfs attributes

Expose the fifo lists, cached next requests, batching state, and
dispatch list. It'd also be possible to add the sorted lists, but there
aren't already seq_file helpers for rbtrees.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agokyber: add debugfs attributes
Omar Sandoval [Thu, 4 May 2017 07:31:33 +0000 (00:31 -0700)]
kyber: add debugfs attributes

Expose the domain token pools, asynchronous sbitmap depth, domain
request lists, and batching state.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblk-mq-debugfs: allow schedulers to register debugfs attributes
Omar Sandoval [Thu, 4 May 2017 14:24:40 +0000 (08:24 -0600)]
blk-mq-debugfs: allow schedulers to register debugfs attributes

This provides the infrastructure for schedulers to expose their internal
state through debugfs. We add a list of queue attributes and a list of
hctx attributes to struct elevator_type and wire them up when switching
schedulers.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Add missing seq_file.h header in blk-mq-debugfs.h

Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblk-mq: untangle debugfs and sysfs
Omar Sandoval [Thu, 4 May 2017 14:17:21 +0000 (08:17 -0600)]
blk-mq: untangle debugfs and sysfs

Originally, I tied debugfs registration/unregistration together with
sysfs. There's no reason to do this, and it's getting in the way of
letting schedulers define their own debugfs attributes. Instead, tie the
debugfs registration to the lifetime of the structures themselves.

The saner lifetimes mean we can also get rid of the extra mq directory
and move everything one level up. I.e., nvme0n1/mq/hctx0/tags is now
just nvme0n1/hctx0/tags.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblk-mq: move debugfs declarations to a separate header file
Omar Sandoval [Thu, 4 May 2017 07:31:30 +0000 (00:31 -0700)]
blk-mq: move debugfs declarations to a separate header file

Preparation for adding more declarations.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblk-mq: Do not invoke queue operations on a dead queue
Bart Van Assche [Thu, 4 May 2017 07:31:29 +0000 (00:31 -0700)]
blk-mq: Do not invoke queue operations on a dead queue

In commit e869b5462f83 ("blk-mq: Unregister debugfs attributes
earlier"), we shuffled the debugfs cleanup around so that the "state"
attribute was removed before we freed the blk-mq data structures.
However, later changes are going to undo that, so we need to explicitly
disallow running a dead queue.

[Omar: rebased and updated commit message]
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblk-mq-debugfs: get rid of a bunch of boilerplate
Omar Sandoval [Thu, 4 May 2017 07:31:28 +0000 (00:31 -0700)]
blk-mq-debugfs: get rid of a bunch of boilerplate

A large part of blk-mq-debugfs.c is file_operations and seq_file
boilerplate. This sucks as is but will suck even more when schedulers
can define their own debugfs entries. Factor it all out into a single
blk_mq_debugfs_fops which multiplexes as needed. We store the
request_queue, blk_mq_hw_ctx, or blk_mq_ctx in the parent directory
dentry, which is kind of hacky, but it works.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblk-mq-debugfs: rename hw queue directories from <n> to hctx<n>
Omar Sandoval [Thu, 4 May 2017 07:31:27 +0000 (00:31 -0700)]
blk-mq-debugfs: rename hw queue directories from <n> to hctx<n>

It's not clear what these numbered directories represent unless you
consult the code. We're about to get rid of the intermediate "mq"
directory, so these would be even more confusing without that context.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblk-mq-debugfs: don't open code strstrip()
Omar Sandoval [Thu, 4 May 2017 07:31:26 +0000 (00:31 -0700)]
blk-mq-debugfs: don't open code strstrip()

Slightly more readable, plus we also strip leading spaces.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblk-mq-debugfs: error on long write to queue "state" file
Omar Sandoval [Thu, 4 May 2017 07:31:25 +0000 (00:31 -0700)]
blk-mq-debugfs: error on long write to queue "state" file

blk_queue_flags_store() currently truncates and returns a short write if
the operation being written is too long. This can give us weird results,
like here:

$ echo "run            bar"
echo: write error: invalid argument
$ dmesg
[ 1103.075435] blk_queue_flags_store: unsupported operation bar. Use either 'run' or 'start'

Instead, return an error if the user does this. While we're here, make
the argument names consistent with everywhere else in this file.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblk-mq-debugfs: clean up flag definitions
Omar Sandoval [Thu, 4 May 2017 07:31:24 +0000 (00:31 -0700)]
blk-mq-debugfs: clean up flag definitions

Make sure the spelled out flag names match the definition. This also
adds a missing hctx state, BLK_MQ_S_START_ON_RUN, and a missing
cmd_flag, __REQ_NOUNMAP.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblk-mq-debugfs: separate flags with |
Omar Sandoval [Thu, 4 May 2017 07:31:23 +0000 (00:31 -0700)]
blk-mq-debugfs: separate flags with |

This reads more naturally than spaces.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agonfs: Fix bdi handling for cloned superblocks
Jan Kara [Thu, 4 May 2017 07:02:42 +0000 (09:02 +0200)]
nfs: Fix bdi handling for cloned superblocks

In commit 0d3b12584972 "nfs: Convert to separately allocated bdi" I have
wrongly cloned bdi reference in nfs_clone_super(). Further inspection
has shown that originally the code was actually allocating a new bdi (in
->clone_server callback) which was later registered in
nfs_fs_mount_common() and used for sb->s_bdi in nfs_initialise_sb().
This could later result in bdi for the original superblock not getting
unregistered when that superblock got shutdown (as the cloned sb still
held bdi reference) and later when a new superblock was created under
the same anonymous device number, a clash in sysfs has happened on bdi
registration:

------------[ cut here ]------------
WARNING: CPU: 1 PID: 10284 at /linux-next/fs/sysfs/dir.c:31 sysfs_warn_dup+0x64/0x74
sysfs: cannot create duplicate filename '/devices/virtual/bdi/0:32'
Modules linked in: axp20x_usb_power gpio_axp209 nvmem_sunxi_sid sun4i_dma sun4i_ss virt_dma
CPU: 1 PID: 10284 Comm: mount.nfs Not tainted 4.11.0-rc4+ #14
Hardware name: Allwinner sun7i (A20) Family
[<c010f19c>] (unwind_backtrace) from [<c010bc74>] (show_stack+0x10/0x14)
[<c010bc74>] (show_stack) from [<c03c6e24>] (dump_stack+0x78/0x8c)
[<c03c6e24>] (dump_stack) from [<c0122200>] (__warn+0xe8/0x100)
[<c0122200>] (__warn) from [<c0122250>] (warn_slowpath_fmt+0x38/0x48)
[<c0122250>] (warn_slowpath_fmt) from [<c02ac178>] (sysfs_warn_dup+0x64/0x74)
[<c02ac178>] (sysfs_warn_dup) from [<c02ac254>] (sysfs_create_dir_ns+0x84/0x94)
[<c02ac254>] (sysfs_create_dir_ns) from [<c03c8b8c>] (kobject_add_internal+0x9c/0x2ec)
[<c03c8b8c>] (kobject_add_internal) from [<c03c8e24>] (kobject_add+0x48/0x98)
[<c03c8e24>] (kobject_add) from [<c048d75c>] (device_add+0xe4/0x5a0)
[<c048d75c>] (device_add) from [<c048ddb4>] (device_create_groups_vargs+0xac/0xbc)
[<c048ddb4>] (device_create_groups_vargs) from [<c048dde4>] (device_create_vargs+0x20/0x28)
[<c048dde4>] (device_create_vargs) from [<c02075c8>] (bdi_register_va+0x44/0xfc)
[<c02075c8>] (bdi_register_va) from [<c023d378>] (super_setup_bdi_name+0x48/0xa4)
[<c023d378>] (super_setup_bdi_name) from [<c0312ef4>] (nfs_fill_super+0x1a4/0x204)
[<c0312ef4>] (nfs_fill_super) from [<c03133f0>] (nfs_fs_mount_common+0x140/0x1e8)
[<c03133f0>] (nfs_fs_mount_common) from [<c03335cc>] (nfs4_remote_mount+0x50/0x58)
[<c03335cc>] (nfs4_remote_mount) from [<c023ef98>] (mount_fs+0x14/0xa4)
[<c023ef98>] (mount_fs) from [<c025cba0>] (vfs_kern_mount+0x54/0x128)
[<c025cba0>] (vfs_kern_mount) from [<c033352c>] (nfs_do_root_mount+0x80/0xa0)
[<c033352c>] (nfs_do_root_mount) from [<c0333818>] (nfs4_try_mount+0x28/0x3c)
[<c0333818>] (nfs4_try_mount) from [<c0313874>] (nfs_fs_mount+0x2cc/0x8c4)
[<c0313874>] (nfs_fs_mount) from [<c023ef98>] (mount_fs+0x14/0xa4)
[<c023ef98>] (mount_fs) from [<c025cba0>] (vfs_kern_mount+0x54/0x128)
[<c025cba0>] (vfs_kern_mount) from [<c02600f0>] (do_mount+0x158/0xc7c)
[<c02600f0>] (do_mount) from [<c0260f98>] (SyS_mount+0x8c/0xb4)
[<c0260f98>] (SyS_mount) from [<c0107840>] (ret_fast_syscall+0x0/0x3c)

Fix the problem by always creating new bdi for a superblock as we used
to do.

Reported-and-tested-by: Corentin Labbe <clabbe.montjoie@gmail.com>
Fixes: 0d3b12584972ce5781179ad3f15cca3cdb5cae05
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoblock/mq: Cure cpu hotplug lock inversion
Peter Zijlstra [Thu, 4 May 2017 13:05:26 +0000 (15:05 +0200)]
block/mq: Cure cpu hotplug lock inversion

By poking at /debug/sched_features I triggered the following splat:

 [] ======================================================
 [] WARNING: possible circular locking dependency detected
 [] 4.11.0-00873-g964c8b7-dirty #694 Not tainted
 [] ------------------------------------------------------
 [] bash/2109 is trying to acquire lock:
 []  (cpu_hotplug_lock.rw_sem){++++++}, at: [<ffffffff8120cb8b>] static_key_slow_dec+0x1b/0x50
 []
 [] but task is already holding lock:
 []  (&sb->s_type->i_mutex_key#4){+++++.}, at: [<ffffffff81140216>] sched_feat_write+0x86/0x170
 []
 [] which lock already depends on the new lock.
 []
 []
 [] the existing dependency chain (in reverse order) is:
 []
 [] -> #2 (&sb->s_type->i_mutex_key#4){+++++.}:
 []        lock_acquire+0x100/0x210
 []        down_write+0x28/0x60
 []        start_creating+0x5e/0xf0
 []        debugfs_create_dir+0x13/0x110
 []        blk_mq_debugfs_register+0x21/0x70
 []        blk_mq_register_dev+0x64/0xd0
 []        blk_register_queue+0x6a/0x170
 []        device_add_disk+0x22d/0x440
 []        loop_add+0x1f3/0x280
 []        loop_init+0x104/0x142
 []        do_one_initcall+0x43/0x180
 []        kernel_init_freeable+0x1de/0x266
 []        kernel_init+0xe/0x100
 []        ret_from_fork+0x31/0x40
 []
 [] -> #1 (all_q_mutex){+.+.+.}:
 []        lock_acquire+0x100/0x210
 []        __mutex_lock+0x6c/0x960
 []        mutex_lock_nested+0x1b/0x20
 []        blk_mq_init_allocated_queue+0x37c/0x4e0
 []        blk_mq_init_queue+0x3a/0x60
 []        loop_add+0xe5/0x280
 []        loop_init+0x104/0x142
 []        do_one_initcall+0x43/0x180
 []        kernel_init_freeable+0x1de/0x266
 []        kernel_init+0xe/0x100
 []        ret_from_fork+0x31/0x40

 []  *** DEADLOCK ***
 []
 [] 3 locks held by bash/2109:
 []  #0:  (sb_writers#11){.+.+.+}, at: [<ffffffff81292bcd>] vfs_write+0x17d/0x1a0
 []  #1:  (debugfs_srcu){......}, at: [<ffffffff8155a90d>] full_proxy_write+0x5d/0xd0
 []  #2:  (&sb->s_type->i_mutex_key#4){+++++.}, at: [<ffffffff81140216>] sched_feat_write+0x86/0x170
 []
 [] stack backtrace:
 [] CPU: 9 PID: 2109 Comm: bash Not tainted 4.11.0-00873-g964c8b7-dirty #694
 [] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
 [] Call Trace:

 []  lock_acquire+0x100/0x210
 []  get_online_cpus+0x2a/0x90
 []  static_key_slow_dec+0x1b/0x50
 []  static_key_disable+0x20/0x30
 []  sched_feat_write+0x131/0x170
 []  full_proxy_write+0x97/0xd0
 []  __vfs_write+0x28/0x120
 []  vfs_write+0xb5/0x1a0
 []  SyS_write+0x49/0xa0
 []  entry_SYSCALL_64_fastpath+0x23/0xc2

This is because of the cpu hotplug lock rework. Break the chain at #1
by reversing the lock acquisition order. This way i_mutex_key#4 no
longer depends on cpu_hotplug_lock and things are good.

Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agolightnvm: fix bad back free on error path
Javier González [Wed, 3 May 2017 09:19:05 +0000 (11:19 +0200)]
lightnvm: fix bad back free on error path

Free memory correctly when an allocation fails on a loop and we free
backwards previously successful allocations.

Signed-off-by: Javier González <javier@cnexlabs.com>
Reviewed-by: Matias Bjørling <matias@cnexlabs.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agolightnvm: create cmd before allocating request
Javier González [Wed, 3 May 2017 09:19:04 +0000 (11:19 +0200)]
lightnvm: create cmd before allocating request

Create nvme command before allocating a request using
nvme_alloc_request, which uses the command direction. Up until now, the
command has been zeroized, so all commands have been allocated as a
read operation.

Signed-off-by: Javier González <javier@cnexlabs.com>
Reviewed-by: Matias Bjørling <matias@cnexlabs.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
7 years agoKVM: put back #ifndef CONFIG_S390 around kvm_vcpu_kick
Paolo Bonzini [Thu, 4 May 2017 13:14:13 +0000 (15:14 +0200)]
KVM: put back #ifndef CONFIG_S390 around kvm_vcpu_kick

The #ifndef was removed in 75aaafb79f73516b69d5639ad30a72d72e75c8b4,
but it was also protecting smp_send_reschedule() in kvm_vcpu_kick().

Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
7 years agoSMB3: Work around mount failure when using SMB3 dialect to Macs
Steve French [Thu, 4 May 2017 02:12:20 +0000 (21:12 -0500)]
SMB3: Work around mount failure when using SMB3 dialect to Macs

Macs send the maximum buffer size in response on ioctl to validate
negotiate security information, which causes us to fail the mount
as the response buffer is larger than the expected response.

Changed ioctl response processing to allow for padding of validate
negotiate ioctl response and limit the maximum response size to
maximum buffer size.

Signed-off-by: Steve French <steve.french@primarydata.com>
CC: Stable <stable@vger.kernel.org>
7 years agoMerge tag 'modules-for-v4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu...
Linus Torvalds [Thu, 4 May 2017 02:12:27 +0000 (19:12 -0700)]
Merge tag 'modules-for-v4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux

Pull modules updates from Jessica Yu:

 - Minor code cleanups

 - Fix section alignment for .init_array

* tag 'modules-for-v4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
  kallsyms: Use bounded strnchr() when parsing string
  module: Unify the return value type of try_module_get
  module: set .init_array alignment to 8

7 years agof2fs: fix a mount fail for wrong next_scan_nid
Yunlei He [Wed, 26 Apr 2017 07:56:52 +0000 (15:56 +0800)]
f2fs: fix a mount fail for wrong next_scan_nid

-write_checkpoint
   -do_checkpoint
      -next_free_nid    <--- something wrong with next free nid

-f2fs_fill_super
   -build_node_manager
      -build_free_nids
          -get_current_nat_page
             -__get_meta_page   <--- attempt to access beyond end of device

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agof2fs: enhance scalability of trace macro
Chao Yu [Thu, 4 May 2017 01:35:43 +0000 (09:35 +0800)]
f2fs: enhance scalability of trace macro

Use __print_flags in show_bio_op_flags and show_cpreason instead of
__print_symbolic, it enables tracer function traverses and shows all
bits in the flag.

Additionally, add missing REQ_FUA into F2FS_OP_FLAGS.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
7 years agoMerge tag 'trace-v4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt...
Linus Torvalds [Thu, 4 May 2017 01:41:21 +0000 (18:41 -0700)]
Merge tag 'trace-v4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing updates from Steven Rostedt:
 "New features for this release:

   - Pretty much a full rewrite of the processing of function plugins.
     i.e. echo do_IRQ:stacktrace > set_ftrace_filter

   - The rewrite was needed to add plugins to be unique to tracing
     instances. i.e. mkdir instance/foo; cd instances/foo; echo
     do_IRQ:stacktrace > set_ftrace_filter The old way was written very
     hacky. This removes a lot of those hacks.

   - New "function-fork" tracing option. When set, pids in the
     set_ftrace_pid will have their children added when the processes
     with their pids listed in the set_ftrace_pid file forks.

   - Exposure of "maxactive" for kretprobe in kprobe_events

   - Allow for builtin init functions to be traced by the function
     tracer (via the kernel command line). Module init function tracing
     will come in the next release.

   - Added more selftests, and have selftests also test in an instance"

* tag 'trace-v4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (60 commits)
  ring-buffer: Return reader page back into existing ring buffer
  selftests: ftrace: Allow some event trigger tests to run in an instance
  selftests: ftrace: Have some basic tests run in a tracing instance too
  selftests: ftrace: Have event tests also run in an tracing instance
  selftests: ftrace: Make func_event_triggers and func_traceonoff_triggers tests do instances
  selftests: ftrace: Allow some tests to be run in a tracing instance
  tracing/ftrace: Allow for instances to trigger their own stacktrace probes
  tracing/ftrace: Allow for the traceonoff probe be unique to instances
  tracing/ftrace: Enable snapshot function trigger to work with instances
  tracing/ftrace: Allow instances to have their own function probes
  tracing/ftrace: Add a better way to pass data via the probe functions
  ftrace: Dynamically create the probe ftrace_ops for the trace_array
  tracing: Pass the trace_array into ftrace_probe_ops functions
  tracing: Have the trace_array hold the list of registered func probes
  ftrace: If the hash for a probe fails to update then free what was initialized
  ftrace: Have the function probes call their own function
  ftrace: Have each function probe use its own ftrace_ops
  ftrace: Have unregister_ftrace_function_probe_func() return a value
  ftrace: Add helper function ftrace_hash_move_and_update_ops()
  ftrace: Remove data field from ftrace_func_probe structure
  ...

7 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek...
Linus Torvalds [Thu, 4 May 2017 01:29:28 +0000 (18:29 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk

Pull printk updates from Petr Mladek:

 - There is a situation when early console is not deregistered because
   the preferred one matches a wrong entry. It caused messages to appear
   twice.

   This is the 2nd attempt to fix it. The first one was wrong, see the
   commit c6c7d83b9c9e ('Revert "console: don't prefer first registered
   if DT specifies stdout-path"').

   The fix is coupled with some small code clean up. Well, the console
   registration code would deserve a big one. We need to think about it.

 - Do not lose information about the preemtive context when the console
   semaphore is re-taken.

 - Do not block CPU hotplug when someone else is already pushing
   messages to the console.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk:
  printk: fix double printing with earlycon
  printk: rename selected_console -> preferred_console
  printk: fix name/type/scope of preferred_console var
  printk: Correctly handle preemption in console_unlock()
  printk: use console_trylock() in console_cpu_notify()

7 years agoMerge branch 'akpm' (patches from Andrew)
Linus Torvalds [Thu, 4 May 2017 00:55:59 +0000 (17:55 -0700)]
Merge branch 'akpm' (patches from Andrew)

Merge misc updates from Andrew Morton:

 - a few misc things

 - most of MM

 - KASAN updates

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (102 commits)
  kasan: separate report parts by empty lines
  kasan: improve double-free report format
  kasan: print page description after stacks
  kasan: improve slab object description
  kasan: change report header
  kasan: simplify address description logic
  kasan: change allocation and freeing stack traces headers
  kasan: unify report headers
  kasan: introduce helper functions for determining bug type
  mm: hwpoison: call shake_page() after try_to_unmap() for mlocked page
  mm: hwpoison: call shake_page() unconditionally
  mm/swapfile.c: fix swap space leak in error path of swap_free_entries()
  mm/gup.c: fix access_ok() argument type
  mm/truncate: avoid pointless cleancache_invalidate_inode() calls.
  mm/truncate: bail out early from invalidate_inode_pages2_range() if mapping is empty
  fs/block_dev: always invalidate cleancache in invalidate_bdev()
  fs: fix data invalidation in the cleancache during direct IO
  zram: reduce load operation in page_same_filled
  zram: use zram_free_page instead of open-coded
  zram: introduce zram data accessor
  ...

7 years agocifs: fix CIFS_IOC_GET_MNT_INFO oops
David Disseldorp [Wed, 3 May 2017 22:41:13 +0000 (00:41 +0200)]
cifs: fix CIFS_IOC_GET_MNT_INFO oops

An open directory may have a NULL private_data pointer prior to readdir.

Fixes: 0de1f4c6f6c0 ("Add way to query server fs info for smb3")
Cc: stable@vger.kernel.org
Signed-off-by: David Disseldorp <ddiss@suse.de>
Signed-off-by: Steve French <smfrench@gmail.com>
7 years agoCIFS: fix mapping of SFM_SPACE and SFM_PERIOD
Björn Jacke [Wed, 3 May 2017 21:47:44 +0000 (23:47 +0200)]
CIFS: fix mapping of SFM_SPACE and SFM_PERIOD

- trailing space maps to 0xF028
- trailing period maps to 0xF029

This fix corrects the mapping of file names which have a trailing character
that would otherwise be illegal (period or space) but is allowed by POSIX.

Signed-off-by: Bjoern Jacke <bjacke@samba.org>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
7 years agokasan: separate report parts by empty lines
Andrey Konovalov [Wed, 3 May 2017 21:56:50 +0000 (14:56 -0700)]
kasan: separate report parts by empty lines

Makes the report easier to read.

Link: http://lkml.kernel.org/r/20170302134851.101218-10-andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agokasan: improve double-free report format
Andrey Konovalov [Wed, 3 May 2017 21:56:47 +0000 (14:56 -0700)]
kasan: improve double-free report format

Changes double-free report header from

  BUG: Double free or freeing an invalid pointer
  Unexpected shadow byte: 0xFB

to

  BUG: KASAN: double-free or invalid-free in kmalloc_oob_left+0xe5/0xef

This makes a bug uniquely identifiable by the first report line.  To
account for removing of the unexpected shadow value, print shadow bytes
at the end of the report as in reports for other kinds of bugs.

Link: http://lkml.kernel.org/r/20170302134851.101218-9-andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agokasan: print page description after stacks
Andrey Konovalov [Wed, 3 May 2017 21:56:44 +0000 (14:56 -0700)]
kasan: print page description after stacks

Moves page description after the stacks since it's less important.

Link: http://lkml.kernel.org/r/20170302134851.101218-8-andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agokasan: improve slab object description
Andrey Konovalov [Wed, 3 May 2017 21:56:41 +0000 (14:56 -0700)]
kasan: improve slab object description

Changes slab object description from:

  Object at ffff880068388540, in cache kmalloc-128 size: 128

to:

  The buggy address belongs to the object at ffff880068388540
   which belongs to the cache kmalloc-128 of size 128
  The buggy address is located 123 bytes inside of
   128-byte region [ffff880068388540ffff8800683885c0)

Makes it more explanatory and adds information about relative offset of
the accessed address to the start of the object.

Link: http://lkml.kernel.org/r/20170302134851.101218-7-andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agokasan: change report header
Andrey Konovalov [Wed, 3 May 2017 21:56:38 +0000 (14:56 -0700)]
kasan: change report header

Change report header format from:

  BUG: KASAN: use-after-free in unwind_get_return_address+0x28a/0x2c0 at addr ffff880069437950
  Read of size 8 by task insmod/3925

to:

  BUG: KASAN: use-after-free in unwind_get_return_address+0x28a/0x2c0
  Read of size 8 at addr ffff880069437950 by task insmod/3925

The exact access address is not usually important, so move it to the
second line.  This also makes the header look visually balanced.

Link: http://lkml.kernel.org/r/20170302134851.101218-6-andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agokasan: simplify address description logic
Andrey Konovalov [Wed, 3 May 2017 21:56:34 +0000 (14:56 -0700)]
kasan: simplify address description logic

Simplify logic for describing a memory address.  Add addr_to_page()
helper function.

Makes the code easier to follow.

Link: http://lkml.kernel.org/r/20170302134851.101218-5-andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agokasan: change allocation and freeing stack traces headers
Andrey Konovalov [Wed, 3 May 2017 21:56:31 +0000 (14:56 -0700)]
kasan: change allocation and freeing stack traces headers

Change stack traces headers from:

  Allocated:
  PID = 42

to:

  Allocated by task 42:

Makes the report one line shorter and look better.

Link: http://lkml.kernel.org/r/20170302134851.101218-4-andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agokasan: unify report headers
Andrey Konovalov [Wed, 3 May 2017 21:56:28 +0000 (14:56 -0700)]
kasan: unify report headers

Unify KASAN report header format for different kinds of bad memory
accesses.  Makes the code simpler.

Link: http://lkml.kernel.org/r/20170302134851.101218-3-andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agokasan: introduce helper functions for determining bug type
Andrey Konovalov [Wed, 3 May 2017 21:56:25 +0000 (14:56 -0700)]
kasan: introduce helper functions for determining bug type

Patch series "kasan: improve error reports", v2.

This patchset improves KASAN reports by making them easier to read and a
little more detailed.  Also improves mm/kasan/report.c readability.

Effectively changes a use-after-free report to:

  ==================================================================
  BUG: KASAN: use-after-free in kmalloc_uaf+0xaa/0xb6 [test_kasan]
  Write of size 1 at addr ffff88006aa59da8 by task insmod/3951

  CPU: 1 PID: 3951 Comm: insmod Tainted: G    B           4.10.0+ #84
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
  Call Trace:
   dump_stack+0x292/0x398
   print_address_description+0x73/0x280
   kasan_report.part.2+0x207/0x2f0
   __asan_report_store1_noabort+0x2c/0x30
   kmalloc_uaf+0xaa/0xb6 [test_kasan]
   kmalloc_tests_init+0x4f/0xa48 [test_kasan]
   do_one_initcall+0xf3/0x390
   do_init_module+0x215/0x5d0
   load_module+0x54de/0x82b0
   SYSC_init_module+0x3be/0x430
   SyS_init_module+0x9/0x10
   entry_SYSCALL_64_fastpath+0x1f/0xc2
  RIP: 0033:0x7f22cfd0b9da
  RSP: 002b:00007ffe69118a78 EFLAGS: 00000206 ORIG_RAX: 00000000000000af
  RAX: ffffffffffffffda RBX: 0000555671242090 RCX: 00007f22cfd0b9da
  RDX: 00007f22cffcaf88 RSI: 000000000004df7e RDI: 00007f22d0399000
  RBP: 00007f22cffcaf88 R08: 0000000000000003 R09: 0000000000000000
  R10: 00007f22cfd07d0a R11: 0000000000000206 R12: 0000555671243190
  R13: 000000000001fe81 R14: 0000000000000000 R15: 0000000000000004

  Allocated by task 3951:
   save_stack_trace+0x16/0x20
   save_stack+0x43/0xd0
   kasan_kmalloc+0xad/0xe0
   kmem_cache_alloc_trace+0x82/0x270
   kmalloc_uaf+0x56/0xb6 [test_kasan]
   kmalloc_tests_init+0x4f/0xa48 [test_kasan]
   do_one_initcall+0xf3/0x390
   do_init_module+0x215/0x5d0
   load_module+0x54de/0x82b0
   SYSC_init_module+0x3be/0x430
   SyS_init_module+0x9/0x10
   entry_SYSCALL_64_fastpath+0x1f/0xc2

  Freed by task 3951:
   save_stack_trace+0x16/0x20
   save_stack+0x43/0xd0
   kasan_slab_free+0x72/0xc0
   kfree+0xe8/0x2b0
   kmalloc_uaf+0x85/0xb6 [test_kasan]
   kmalloc_tests_init+0x4f/0xa48 [test_kasan]
   do_one_initcall+0xf3/0x390
   do_init_module+0x215/0x5d0
   load_module+0x54de/0x82b0
   SYSC_init_module+0x3be/0x430
   SyS_init_module+0x9/0x10
   entry_SYSCALL_64_fastpath+0x1f/0xc

  The buggy address belongs to the object at ffff88006aa59da0
   which belongs to the cache kmalloc-16 of size 16
  The buggy address is located 8 bytes inside of
   16-byte region [ffff88006aa59da0ffff88006aa59db0)
  The buggy address belongs to the page:
  page:ffffea0001aa9640 count:1 mapcount:0 mapping:          (null) index:0x0
  flags: 0x100000000000100(slab)
  raw: 0100000000000100 0000000000000000 0000000000000000 0000000180800080
  raw: ffffea0001abe380 0000000700000007 ffff88006c401b40 0000000000000000
  page dumped because: kasan: bad access detected

  Memory state around the buggy address:
   ffff88006aa59c80: 00 00 fc fc 00 00 fc fc 00 00 fc fc 00 00 fc fc
   ffff88006aa59d00: 00 00 fc fc 00 00 fc fc 00 00 fc fc 00 00 fc fc
  >ffff88006aa59d80: fb fb fc fc fb fb fc fc fb fb fc fc fb fb fc fc
                                    ^
   ffff88006aa59e00: fb fb fc fc fb fb fc fc fb fb fc fc fb fb fc fc
   ffff88006aa59e80: fb fb fc fc 00 00 fc fc 00 00 fc fc 00 00 fc fc
  ==================================================================

from:

  ==================================================================
  BUG: KASAN: use-after-free in kmalloc_uaf+0xaa/0xb6 [test_kasan] at addr ffff88006c4dcb28
  Write of size 1 by task insmod/3984
  CPU: 1 PID: 3984 Comm: insmod Tainted: G    B           4.10.0+ #83
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
  Call Trace:
   dump_stack+0x292/0x398
   kasan_object_err+0x1c/0x70
   kasan_report.part.1+0x20e/0x4e0
   __asan_report_store1_noabort+0x2c/0x30
   kmalloc_uaf+0xaa/0xb6 [test_kasan]
   kmalloc_tests_init+0x4f/0xa48 [test_kasan]
   do_one_initcall+0xf3/0x390
   do_init_module+0x215/0x5d0
   load_module+0x54de/0x82b0
   SYSC_init_module+0x3be/0x430
   SyS_init_module+0x9/0x10
   entry_SYSCALL_64_fastpath+0x1f/0xc2
  RIP: 0033:0x7feca0f779da
  RSP: 002b:00007ffdfeae5218 EFLAGS: 00000206 ORIG_RAX: 00000000000000af
  RAX: ffffffffffffffda RBX: 000055a064c13090 RCX: 00007feca0f779da
  RDX: 00007feca1236f88 RSI: 000000000004df7e RDI: 00007feca1605000
  RBP: 00007feca1236f88 R08: 0000000000000003 R09: 0000000000000000
  R10: 00007feca0f73d0a R11: 0000000000000206 R12: 000055a064c14190
  R13: 000000000001fe81 R14: 0000000000000000 R15: 0000000000000004
  Object at ffff88006c4dcb20, in cache kmalloc-16 size: 16
  Allocated:
  PID = 3984
   save_stack_trace+0x16/0x20
   save_stack+0x43/0xd0
   kasan_kmalloc+0xad/0xe0
   kmem_cache_alloc_trace+0x82/0x270
   kmalloc_uaf+0x56/0xb6 [test_kasan]
   kmalloc_tests_init+0x4f/0xa48 [test_kasan]
   do_one_initcall+0xf3/0x390
   do_init_module+0x215/0x5d0
   load_module+0x54de/0x82b0
   SYSC_init_module+0x3be/0x430
   SyS_init_module+0x9/0x10
   entry_SYSCALL_64_fastpath+0x1f/0xc2
  Freed:
  PID = 3984
   save_stack_trace+0x16/0x20
   save_stack+0x43/0xd0
   kasan_slab_free+0x73/0xc0
   kfree+0xe8/0x2b0
   kmalloc_uaf+0x85/0xb6 [test_kasan]
   kmalloc_tests_init+0x4f/0xa48 [test_kasan]
   do_one_initcall+0xf3/0x390
   do_init_module+0x215/0x5d0
   load_module+0x54de/0x82b0
   SYSC_init_module+0x3be/0x430
   SyS_init_module+0x9/0x10
   entry_SYSCALL_64_fastpath+0x1f/0xc2
  Memory state around the buggy address:
   ffff88006c4dca00: fb fb fc fc fb fb fc fc fb fb fc fc fb fb fc fc
   ffff88006c4dca80: fb fb fc fc fb fb fc fc fb fb fc fc fb fb fc fc
  >ffff88006c4dcb00: fb fb fc fc fb fb fc fc fb fb fc fc fb fb fc fc
                                    ^
   ffff88006c4dcb80: fb fb fc fc 00 00 fc fc fb fb fc fc fb fb fc fc
   ffff88006c4dcc00: fb fb fc fc fb fb fc fc fb fb fc fc fb fb fc fc
  ==================================================================

This patch (of 9):

Introduce get_shadow_bug_type() function, which determines bug type
based on the shadow value for a particular kernel address.  Introduce
get_wild_bug_type() function, which determines bug type for addresses
which don't have a corresponding shadow value.

Link: http://lkml.kernel.org/r/20170302134851.101218-2-andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agomm: hwpoison: call shake_page() after try_to_unmap() for mlocked page
Naoya Horiguchi [Wed, 3 May 2017 21:56:22 +0000 (14:56 -0700)]
mm: hwpoison: call shake_page() after try_to_unmap() for mlocked page

Memory error handler calls try_to_unmap() for error pages in various
states.  If the error page is a mlocked page, error handling could fail
with "still referenced by 1 users" message.  This is because the page is
linked to and stays in lru cache after the following call chain.

  try_to_unmap_one
    page_remove_rmap
      clear_page_mlock
        putback_lru_page
          lru_cache_add

memory_failure() calls shake_page() to hanlde the similar issue, but
current code doesn't cover because shake_page() is called only before
try_to_unmap().  So this patches adds shake_page().

Fixes: 23a003bfd23ea9ea0b7756b920e51f64b284b468 ("mm/madvise: pass return code of memory_failure() to userspace")
Link: http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop
Link: http://lkml.kernel.org/r/1493197841-23986-3-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: kernel test robot <lkp@intel.com>
Cc: Xiaolong Ye <xiaolong.ye@intel.com>
Cc: Chen Gong <gong.chen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agomm: hwpoison: call shake_page() unconditionally
Naoya Horiguchi [Wed, 3 May 2017 21:56:19 +0000 (14:56 -0700)]
mm: hwpoison: call shake_page() unconditionally

shake_page() is called before going into core error handling code in
order to ensure that the error page is flushed from lru_cache lists
where pages stay during transferring among LRU lists.

But currently it's not fully functional because when the page is linked
to lru_cache by calling activate_page(), its PageLRU flag is set and
shake_page() is skipped.  The result is to fail error handling with
"still referenced by 1 users" message.

When the page is linked to lru_cache by isolate_lru_page(), its PageLRU
is clear, so that's fine.

This patch makes shake_page() unconditionally called to avoild the
failure.

Fixes: 23a003bfd23ea9ea0b7756b920e51f64b284b468 ("mm/madvise: pass return code of memory_failure() to userspace")
Link: http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop
Link: http://lkml.kernel.org/r/1493197841-23986-2-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: kernel test robot <lkp@intel.com>
Cc: Xiaolong Ye <xiaolong.ye@intel.com>
Cc: Chen Gong <gong.chen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agomm/swapfile.c: fix swap space leak in error path of swap_free_entries()
Huang Ying [Wed, 3 May 2017 21:56:16 +0000 (14:56 -0700)]
mm/swapfile.c: fix swap space leak in error path of swap_free_entries()

In swapcache_free_entries(), if swap_info_get_cont() returns NULL,
something wrong occurs for the swap entry.  But we should still continue
to free the following swap entries in the array instead of skip them to
avoid swap space leak.  This is just problem in error path, where system
may be in an inconsistent state, but it is still good to fix it.

Link: http://lkml.kernel.org/r/20170421124739.24534-1-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agomm/gup.c: fix access_ok() argument type
Arnd Bergmann [Wed, 3 May 2017 21:56:12 +0000 (14:56 -0700)]
mm/gup.c: fix access_ok() argument type

MIPS just got changed to only accept a pointer argument for access_ok(),
causing one warning in drivers/scsi/pmcraid.c.  I tried changing x86 the
same way and found the same warning in __get_user_pages_fast() and
nowhere else in the kernel during randconfig testing:

  mm/gup.c: In function '__get_user_pages_fast':
  mm/gup.c:1578:6: error: passing argument 1 of '__chk_range_not_ok' makes pointer from integer without a cast [-Werror=int-conversion]

It would probably be a good idea to enforce type-safety in general, so
let's change this file to not cause a warning if we do that.

I don't know why the warning did not appear on MIPS.

Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()")
Link: http://lkml.kernel.org/r/20170421162659.3314521-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agomm/truncate: avoid pointless cleancache_invalidate_inode() calls.
Andrey Ryabinin [Wed, 3 May 2017 21:56:09 +0000 (14:56 -0700)]
mm/truncate: avoid pointless cleancache_invalidate_inode() calls.

cleancache_invalidate_inode() called truncate_inode_pages_range() and
invalidate_inode_pages2_range() twice - on entry and on exit.  It's
stupid and waste of time.  It's enough to call it once at exit.

Link: http://lkml.kernel.org/r/20170424164135.22350-5-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Alexey Kuznetsov <kuznet@virtuozzo.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agomm/truncate: bail out early from invalidate_inode_pages2_range() if mapping is empty
Andrey Ryabinin [Wed, 3 May 2017 21:56:06 +0000 (14:56 -0700)]
mm/truncate: bail out early from invalidate_inode_pages2_range() if mapping is empty

If mapping is empty (both ->nrpages and ->nrexceptional is zero) we can
avoid pointless lookups in empty radix tree and bail out immediately
after cleancache invalidation.

Link: http://lkml.kernel.org/r/20170424164135.22350-4-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Alexey Kuznetsov <kuznet@virtuozzo.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agofs/block_dev: always invalidate cleancache in invalidate_bdev()
Andrey Ryabinin [Wed, 3 May 2017 21:56:02 +0000 (14:56 -0700)]
fs/block_dev: always invalidate cleancache in invalidate_bdev()

invalidate_bdev() calls cleancache_invalidate_inode() iff ->nrpages != 0
which doen't make any sense.

Make sure that invalidate_bdev() always calls cleancache_invalidate_inode()
regardless of mapping->nrpages value.

Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache")
Link: http://lkml.kernel.org/r/20170424164135.22350-3-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Alexey Kuznetsov <kuznet@virtuozzo.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Nikolay Borisov <n.borisov.lkml@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agofs: fix data invalidation in the cleancache during direct IO
Andrey Ryabinin [Wed, 3 May 2017 21:55:59 +0000 (14:55 -0700)]
fs: fix data invalidation in the cleancache during direct IO

Patch series "Properly invalidate data in the cleancache", v2.

We've noticed that after direct IO write, buffered read sometimes gets
stale data which is coming from the cleancache.  The reason for this is
that some direct write hooks call call invalidate_inode_pages2[_range]()
conditionally iff mapping->nrpages is not zero, so we may not invalidate
data in the cleancache.

Another odd thing is that we check only for ->nrpages and don't check
for ->nrexceptional, but invalidate_inode_pages2[_range] also
invalidates exceptional entries as well.  So we invalidate exceptional
entries only if ->nrpages != 0? This doesn't feel right.

 - Patch 1 fixes direct IO writes by removing ->nrpages check.
 - Patch 2 fixes similar case in invalidate_bdev().
     Note: I only fixed conditional cleancache_invalidate_inode() here.
       Do we also need to add ->nrexceptional check in into invalidate_bdev()?

 - Patches 3-4: some optimizations.

This patch (of 4):

Some direct IO write fs hooks call invalidate_inode_pages2[_range]()
conditionally iff mapping->nrpages is not zero.  This can't be right,
because invalidate_inode_pages2[_range]() also invalidate data in the
cleancache via cleancache_invalidate_inode() call.  So if page cache is
empty but there is some data in the cleancache, buffered read after
direct IO write would get stale data from the cleancache.

Also it doesn't feel right to check only for ->nrpages because
invalidate_inode_pages2[_range] invalidates exceptional entries as well.

Fix this by calling invalidate_inode_pages2[_range]() regardless of
nrpages state.

Note: nfs,cifs,9p doesn't need similar fix because the never call
cleancache_get_page() (nor directly, nor via mpage_readpage[s]()), so
they are not affected by this bug.

Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache")
Link: http://lkml.kernel.org/r/20170424164135.22350-2-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Alexey Kuznetsov <kuznet@virtuozzo.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Nikolay Borisov <n.borisov.lkml@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agozram: reduce load operation in page_same_filled
Sangwoo Park [Wed, 3 May 2017 21:55:56 +0000 (14:55 -0700)]
zram: reduce load operation in page_same_filled

In page_same_filled function, all elements in the page is compared with
next index value.  The current comparison routine compares the (i)th and
(i+1)th values of the page.

In this case, two load operaions occur for each comparison.  But if we
store first value of the page stores at 'val' variable and using it to
compare with others, the load opearation is reduced.  It reduce load
operation per page by up to 64times.

Link: http://lkml.kernel.org/r/1488428104-7257-1-git-send-email-sangwoo2.park@lge.com
Signed-off-by: Sangwoo Park <sangwoo2.park@lge.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agozram: use zram_free_page instead of open-coded
Minchan Kim [Wed, 3 May 2017 21:55:53 +0000 (14:55 -0700)]
zram: use zram_free_page instead of open-coded

The zram_free_page already handles NULL handle case and same page so use
it to reduce error probability.  (Acutaully, I made a mistake when I
handled same page feature)

Link: http://lkml.kernel.org/r/1492052365-16169-7-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agozram: introduce zram data accessor
Minchan Kim [Wed, 3 May 2017 21:55:50 +0000 (14:55 -0700)]
zram: introduce zram data accessor

With element, sometime I got confused handle and element access.  It
might be my bad but I think it's time to introduce accessor to prevent
future idiot like me.  This patch is just clean-up patch so it shouldn't
change any behavior.

Link: http://lkml.kernel.org/r/1492052365-16169-6-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agozram: remove zram_meta structure
Minchan Kim [Wed, 3 May 2017 21:55:47 +0000 (14:55 -0700)]
zram: remove zram_meta structure

It's redundant now.  Instead, remove it and use zram structure directly.

Link: http://lkml.kernel.org/r/1492052365-16169-5-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agozram: use zram_slot_lock instead of raw bit_spin_lock op
Minchan Kim [Wed, 3 May 2017 21:55:44 +0000 (14:55 -0700)]
zram: use zram_slot_lock instead of raw bit_spin_lock op

With this clean-up phase, I want to use zram's wrapper function to lock
table access which is more consistent with other zram's functions.

Link: http://lkml.kernel.org/r/1492052365-16169-4-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agozram: partial IO refactoring
Minchan Kim [Wed, 3 May 2017 21:55:41 +0000 (14:55 -0700)]
zram: partial IO refactoring

For architecture(PAGE_SIZE > 4K), zram have supported partial IO.
However, the mixed code for handling normal/partial IO is too mess,
error-prone to modify IO handler functions with upcoming feature so this
patch aims for cleaning up zram's IO handling functions.

Link: http://lkml.kernel.org/r/1492052365-16169-3-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agozram: handle multiple pages attached bio's bvec
Minchan Kim [Wed, 3 May 2017 21:55:38 +0000 (14:55 -0700)]
zram: handle multiple pages attached bio's bvec

Patch series "zram clean up", v2.

This patchset aims to clean up zram .

[1] clean up multiple pages's bvec handling.
[2] clean up partial IO handling
[3-6] clean up zram via using accessor and removing pointless structure.

With [2-6] applied, we can get a few hundred bytes as well as huge
readibility enhance.

x86: 708 byte save

    add/remove: 1/1 grow/shrink: 0/11 up/down: 478/-1186 (-708)
    function                                     old     new   delta
    zram_special_page_read                         -     478    +478
    zram_reset_device                            317     314      -3
    mem_used_max_store                           131     128      -3
    compact_store                                 96      93      -3
    mm_stat_show                                 203     197      -6
    zram_add                                     719     712      -7
    zram_slot_free_notify                        229     214     -15
    zram_make_request                            819     803     -16
    zram_meta_free                               128     111     -17
    zram_free_page                               180     151     -29
    disksize_store                               432     361     -71
    zram_decompress_page.isra                    504       -    -504
    zram_bvec_rw                                2592    2080    -512
    Total: Before=25350773, After=25350065, chg -0.00%

ppc64: 231 byte save

    add/remove: 2/0 grow/shrink: 1/9 up/down: 681/-912 (-231)
    function                                     old     new   delta
    zram_special_page_read                         -     480    +480
    zram_slot_lock                                 -     200    +200
    vermagic                                      39      40      +1
    mm_stat_show                                 256     248      -8
    zram_meta_free                               200     184     -16
    zram_add                                     944     912     -32
    zram_free_page                               348     308     -40
    disksize_store                               572     492     -80
    zram_decompress_page                         664     564    -100
    zram_slot_free_notify                        292     160    -132
    zram_make_request                           1132    1000    -132
    zram_bvec_rw                                2768    2396    -372
    Total: Before=17565825, After=17565594, chg -0.00%

This patch (of 6):

Johannes Thumshirn reported system goes the panic when using NVMe over
Fabrics loopback target with zram.

The reason is zram expects each bvec in bio contains a single page
but nvme can attach a huge bulk of pages attached to the bio's bvec
so that zram's index arithmetic could be wrong so that out-of-bound
access makes system panic.

[1] in mainline solved solved the problem by limiting max_sectors with
SECTORS_PER_PAGE but it makes zram slow because bio should split with
each pages so this patch makes zram aware of multiple pages in a bvec
so it could solve without any regression(ie, bio split).

[1] 0bc315381fe9, zram: set physical queue limits to avoid array out of
    bounds accesses

Link: http://lkml.kernel.org/r/20170413134057.GA27499@bbox
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: Johannes Thumshirn <jthumshirn@suse.de>
Tested-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Hannes Reinecke <hare@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agomm, page_alloc: remove debug_guardpage_minorder() test in warn_alloc()
Tetsuo Handa [Wed, 3 May 2017 21:55:34 +0000 (14:55 -0700)]
mm, page_alloc: remove debug_guardpage_minorder() test in warn_alloc()

Commit c0a32fc5a2e4 ("mm: more intensive memory corruption debugging")
changed to check debug_guardpage_minorder() > 0 when reporting
allocation failures.  The reasoning was

  When we use guard page to debug memory corruption, it shrinks
  available pages to 1/2, 1/4, 1/8 and so on, depending on parameter
  value. In such case memory allocation failures can be common and
  printing errors can flood dmesg. If somebody debug corruption,
  allocation failures are not the things he/she is interested about.

but this is misguided.

Allocation requests with __GFP_NOWARN flag by definition do not cause
flooding of allocation failure messages.  Allocation requests with
__GFP_NORETRY flag likely also have __GFP_NOWARN flag.  Costly
allocation requests likely also have __GFP_NOWARN flag.

Allocation requests without __GFP_DIRECT_RECLAIM flag likely also have
__GFP_NOWARN flag or __GFP_HIGH flag.  Non-costly allocation requests
with __GFP_DIRECT_RECLAIM flag basically retry forever due to the "too
small to fail" memory-allocation rule.

Therefore, as a whole, shrinking available pages by
debug_guardpage_minorder= kernel boot parameter might cause flooding of
OOM killer messages but unlikely causes flooding of allocation failure
messages.  Let's remove debug_guardpage_minorder() > 0 check which would
likely be pointless.

Link: http://lkml.kernel.org/r/1491910035-4231-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Rafael J . Wysocki" <rafael.j.wysocki@intel.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agomm/memory-failure.c: add page flag description in error paths
Anshuman Khandual [Wed, 3 May 2017 21:55:31 +0000 (14:55 -0700)]
mm/memory-failure.c: add page flag description in error paths

It helps to provide page flag description along with the raw value in
error paths during soft offline process.  From sample experiments

Before the patch:

  soft offline: 0x6100: migration failed 1, type 3ffff800008018
  soft offline: 0x7400: migration failed 1, type 3ffff800008018

After the patch:

  soft offline: 0x5900: migration failed 1, type 3ffff800008018 (uptodate|dirty|head)
  soft offline: 0x6c00: migration failed 1, type 3ffff800008018 (uptodate|dirty|head)

Link: http://lkml.kernel.org/r/20170409023829.10788-1-khandual@linux.vnet.ibm.com
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agomm/madvise: move up the behavior parameter validation
Anshuman Khandual [Wed, 3 May 2017 21:55:28 +0000 (14:55 -0700)]
mm/madvise: move up the behavior parameter validation

madvise_behavior_valid() should be called before acting upon the
behavior parameter.  Hence move up the function.  This also includes
MADV_SOFT_OFFLINE and MADV_HWPOISON options as valid behavior parameter
for the system call madvise().

Link: http://lkml.kernel.org/r/20170418052844.24891-1-khandual@linux.vnet.ibm.com
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agomm/madvise.c: clean up MADV_SOFT_OFFLINE and MADV_HWPOISON
Anshuman Khandual [Wed, 3 May 2017 21:55:25 +0000 (14:55 -0700)]
mm/madvise.c: clean up MADV_SOFT_OFFLINE and MADV_HWPOISON

This cleans up handling MADV_SOFT_OFFLINE and MADV_HWPOISON called
through madvise() system call.

* madvise_memory_failure() was misleading to accommodate handling of
  both memory_failure() as well as soft_offline_page() functions.
  Basically it handles memory error injection from user space which
  can go either way as memory failure or soft offline. Renamed as
  madvise_inject_error() instead.

* Renamed struct page pointer 'p' to 'page'.

* pr_info() was essentially printing PFN value but it said 'page'
  which was misleading. Made the process virtual address explicit.

Before the patch:

Soft offlining page 0x15e3e at 0x3fff8c230000
Soft offlining page 0x1f3 at 0x3fffa0da0000
Soft offlining page 0x744 at 0x3fff7d200000
Soft offlining page 0x1634d at 0x3fff95e20000
Soft offlining page 0x16349 at 0x3fff95e30000
Soft offlining page 0x1d6 at 0x3fff9e8b0000
Soft offlining page 0x5f3 at 0x3fff91bd0000

Injecting memory failure for page 0x15c8b at 0x3fff83280000
Injecting memory failure for page 0x16190 at 0x3fff83290000
Injecting memory failure for page 0x740 at 0x3fff9a2e0000
Injecting memory failure for page 0x741 at 0x3fff9a2f0000

After the patch:

Soft offlining pfn 0x1484e at process virtual address 0x3fff883c0000
Soft offlining pfn 0x1484f at process virtual address 0x3fff883d0000
Soft offlining pfn 0x14850 at process virtual address 0x3fff883e0000
Soft offlining pfn 0x14851 at process virtual address 0x3fff883f0000
Soft offlining pfn 0x14852 at process virtual address 0x3fff88400000
Soft offlining pfn 0x14853 at process virtual address 0x3fff88410000
Soft offlining pfn 0x14854 at process virtual address 0x3fff88420000
Soft offlining pfn 0x1521c at process virtual address 0x3fff6bc70000

Injecting memory failure for pfn 0x10fcf at process virtual address 0x3fff86310000
Injecting memory failure for pfn 0x10fd0 at process virtual address 0x3fff86320000
Injecting memory failure for pfn 0x10fd1 at process virtual address 0x3fff86330000
Injecting memory failure for pfn 0x10fd2 at process virtual address 0x3fff86340000
Injecting memory failure for pfn 0x10fd3 at process virtual address 0x3fff86350000
Injecting memory failure for pfn 0x10fd4 at process virtual address 0x3fff86360000
Injecting memory failure for pfn 0x10fd5 at process virtual address 0x3fff86370000

Link: http://lkml.kernel.org/r/20170410084701.11248-1-khandual@linux.vnet.ibm.com
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agoDocumentation: vm, add hugetlbfs reservation overview
Mike Kravetz [Wed, 3 May 2017 21:55:22 +0000 (14:55 -0700)]
Documentation: vm, add hugetlbfs reservation overview

Adding a brief overview of hugetlbfs reservation design and
implementation as an aid to those making code modifications in this
area.

Link: http://lkml.kernel.org/r/1491586995-13085-1-git-send-email-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agomm, swap: remove unused function prototype
Huang Ying [Wed, 3 May 2017 21:55:19 +0000 (14:55 -0700)]
mm, swap: remove unused function prototype

This is a code cleanup patch, no functionality changes.  There are 2
unused function prototype in swap.h, they are removed.

Link: http://lkml.kernel.org/r/20170405071017.23677-1-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agomm: memcontrol: use node page state naming scheme for memcg
Johannes Weiner [Wed, 3 May 2017 21:55:16 +0000 (14:55 -0700)]
mm: memcontrol: use node page state naming scheme for memcg

The memory controllers stat function names are awkwardly long and
arbitrarily different from the zone and node stat functions.

The current interface is named:

  mem_cgroup_read_stat()
  mem_cgroup_update_stat()
  mem_cgroup_inc_stat()
  mem_cgroup_dec_stat()
  mem_cgroup_update_page_stat()
  mem_cgroup_inc_page_stat()
  mem_cgroup_dec_page_stat()

This patch renames it to match the corresponding node stat functions:

  memcg_page_state() [node_page_state()]
  mod_memcg_state() [mod_node_state()]
  inc_memcg_state() [inc_node_state()]
  dec_memcg_state() [dec_node_state()]
  mod_memcg_page_state() [mod_node_page_state()]
  inc_memcg_page_state() [inc_node_page_state()]
  dec_memcg_page_state() [dec_node_page_state()]

Link: http://lkml.kernel.org/r/20170404220148.28338-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agomm: memcontrol: re-use node VM page state enum
Johannes Weiner [Wed, 3 May 2017 21:55:13 +0000 (14:55 -0700)]
mm: memcontrol: re-use node VM page state enum

The current duplication is a high-maintenance mess, and it's painful to
add new items or query memcg state from the rest of the VM.

This increases the size of the stat array marginally, but we should aim
to track all these stats on a per-cgroup level anyway.

Link: http://lkml.kernel.org/r/20170404220148.28338-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agomm: memcontrol: re-use global VM event enum
Johannes Weiner [Wed, 3 May 2017 21:55:10 +0000 (14:55 -0700)]
mm: memcontrol: re-use global VM event enum

The current duplication is a high-maintenance mess, and it's painful to
add new items.

This increases the size of the event array, but we'll eventually want
most of the VM events tracked on a per-cgroup basis anyway.

Link: http://lkml.kernel.org/r/20170404220148.28338-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agomm: memcontrol: clean up memory.events counting function
Johannes Weiner [Wed, 3 May 2017 21:55:07 +0000 (14:55 -0700)]
mm: memcontrol: clean up memory.events counting function

We only ever count single events, drop the @nr parameter.  Rename the
function accordingly.  Remove low-information kerneldoc.

Link: http://lkml.kernel.org/r/20170404220148.28338-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7 years agomm: vmscan: fix IO/refault regression in cache workingset transition
Johannes Weiner [Wed, 3 May 2017 21:55:03 +0000 (14:55 -0700)]
mm: vmscan: fix IO/refault regression in cache workingset transition

Since commit 59dc76b0d4df ("mm: vmscan: reduce size of inactive file
list") we noticed bigger IO spikes during changes in cache access
patterns.

The patch in question shrunk the inactive list size to leave more room
for the current workingset in the presence of streaming IO.  However,
workingset transitions that previously happened on the inactive list are
now pushed out of memory and incur more refaults to complete.

This patch disables active list protection when refaults are being
observed.  This accelerates workingset transitions, and allows more of
the new set to establish itself from memory, without eating into the
ability to protect the established workingset during stable periods.

The workloads that were measurably affected for us were hit pretty bad
by it, with refault/majfault rates doubling and tripling during cache
transitions, and the machines sustaining half-hour periods of 100% IO
utilization, where they'd previously have sub-minute peaks at 60-90%.

Stateful services that handle user data tend to be more conservative
with kernel upgrades.  As a result we hit most page cache issues with
some delay, as was the case here.

The severity seemed to warrant a stable tag.

Fixes: 59dc76b0d4df ("mm: vmscan: reduce size of inactive file list")
Link: http://lkml.kernel.org/r/20170404220052.27593-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: <stable@vger.kernel.org> [4.7+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>