Skip to content
Snippets Groups Projects
  1. May 15, 2019
    • Manfred Spraul's avatar
      ipc: do cyclic id allocation for the ipc object. · 99db46ea
      Manfred Spraul authored
      For ipcmni_extend mode, the sequence number space is only 7 bits.  So
      the chance of id reuse is relatively high compared with the non-extended
      mode.
      
      To alleviate this id reuse problem, this patch enables cyclic allocation
      for the index to the radix tree (idx).  The disadvantage is that this
      can cause a slight slow-down of the fast path, as the radix tree could
      be higher than necessary.
      
      To limit the radix tree height, I have chosen the following limits:
       1) The cycling is done over in_use*1.5.
       2) At least, the cycling is done over
         "normal" ipcnmi mode: RADIX_TREE_MAP_SIZE elements
         "ipcmni_extended": 4096 elements
      
      Result:
      - for normal mode:
      	No change for <= 42 active ipc elements. With more than 42
      	active ipc elements, a 2nd level would be added to the radix
      	tree.
      	Without cyclic allocation, a 2nd level would be added only with
      	more than 63 active elements.
      
      - for extended mode:
      	Cycling creates always at least a 2-level radix tree.
      	With more than 2730 active objects, a 3rd level would be
      	added, instead of > 4095 active objects until the 3rd level
      	is added without cyclic allocation.
      
      For a 2-level radix tree compared to a 1-level radix tree, I have
      observed < 1% performance impact.
      
      Notes:
      1) Normal "x=semget();y=semget();" is unaffected: Then the idx
        is e.g. a and a+1, regardless if idr_alloc() or idr_alloc_cyclic()
        is used.
      
      2) The -1% happens in a microbenchmark after this situation:
      	x=semget();
      	for(i=0;i<4000;i++) {t=semget();semctl(t,0,IPC_RMID);}
      	y=semget();
      	Now perform semget calls on x and y that do not sleep.
      
      3) The worst-case reuse cycle time is unfortunately unaffected:
         If you have 2^24-1 ipc objects allocated, and get/remove the last
         possible element in a loop, then the id is reused after 128
         get/remove pairs.
      
      Performance check:
      A microbenchmark that performes no-op semop() randomly on two IDs,
      with only these two IDs allocated.
      The IDs were set using /proc/sys/kernel/sem_next_id.
      The test was run 5 times, averages are shown.
      
      1 & 2: Base (6.22 seconds for 10.000.000 semops)
      1 & 40: -0.2%
      1 & 3348: - 0.8%
      1 & 27348: - 1.6%
      1 & 15777204: - 3.2%
      
      Or: ~12.6 cpu cycles per additional radix tree level.
      The cpu is an Intel I3-5010U. ~1300 cpu cycles/syscall is slower
      than what I remember (spectre impact?).
      
      V2 of the patch:
      - use "min" and "max"
      - use RADIX_TREE_MAP_SIZE * RADIX_TREE_MAP_SIZE instead of
      	(2<<12).
      
      [akpm@linux-foundation.org: fix max() warning]
      Link: http://lkml.kernel.org/r/20190329204930.21620-3-longman@redhat.com
      
      
      Signed-off-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Acked-by: default avatarWaiman Long <longman@redhat.com>
      Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      99db46ea
    • Manfred Spraul's avatar
      ipc: conserve sequence numbers in ipcmni_extend mode · 3278a2c2
      Manfred Spraul authored
      Rewrite, based on the patch from Waiman Long:
      
      The mixing in of a sequence number into the IPC IDs is probably to avoid
      ID reuse in userspace as much as possible.  With ipcmni_extend mode, the
      number of usable sequence numbers is greatly reduced leading to higher
      chance of ID reuse.
      
      To address this issue, we need to conserve the sequence number space as
      much as possible.  Right now, the sequence number is incremented for
      every new ID created.  In reality, we only need to increment the
      sequence number when new allocated ID is not greater than the last one
      allocated.  It is in such case that the new ID may collide with an
      existing one.  This is being done irrespective of the ipcmni mode.
      
      In order to avoid any races, the index is first allocated and then the
      pointer is replaced.
      
      Changes compared to the initial patch:
       - Handle failures from idr_alloc().
       - Avoid that concurrent operations can see the wrong sequence number.
         (This is achieved by using idr_replace()).
       - IPCMNI_SEQ_SHIFT is not a constant, thus renamed to
         ipcmni_seq_shift().
       - IPCMNI_SEQ_MAX is not a constant, thus renamed to ipcmni_seq_max().
      
      Link: http://lkml.kernel.org/r/20190329204930.21620-2-longman@redhat.com
      
      
      Signed-off-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Suggested-by: default avatarMatthew Wilcox <willy@infradead.org>
      Acked-by: default avatarWaiman Long <longman@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
      Cc: Takashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3278a2c2
    • Waiman Long's avatar
      ipc: allow boot time extension of IPCMNI from 32k to 16M · 5ac893b8
      Waiman Long authored
      The maximum number of unique System V IPC identifiers was limited to
      32k.  That limit should be big enough for most use cases.
      
      However, there are some users out there requesting for more, especially
      those that are migrating from Solaris which uses 24 bits for unique
      identifiers.  To satisfy the need of those users, a new boot time kernel
      option "ipcmni_extend" is added to extend the IPCMNI value to 16M.  This
      is a 512X increase which should be big enough for users out there that
      need a large number of unique IPC identifier.
      
      The use of this new option will change the pattern of the IPC
      identifiers returned by functions like shmget(2).  An application that
      depends on such pattern may not work properly.  So it should only be
      used if the users really need more than 32k of unique IPC numbers.
      
      This new option does have the side effect of reducing the maximum number
      of unique sequence numbers from 64k down to 128.  So it is a trade-off.
      
      The computation of a new IPC id is not done in the performance critical
      path.  So a little bit of additional overhead shouldn't have any real
      performance impact.
      
      Link: http://lkml.kernel.org/r/20190329204930.21620-1-longman@redhat.com
      
      
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Acked-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Takashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5ac893b8
    • Davidlohr Bueso's avatar
      ipc/mqueue: optimize msg_get() · a5091fda
      Davidlohr Bueso authored
      Our msg priorities became an rbtree as of d6629859 ("ipc/mqueue:
      improve performance of send/recv").  However, consuming a msg in
      msg_get() remains logarithmic (still being better than the case before
      of course).  By applying well known techniques to cache pointers we can
      have the node with the highest priority in O(1), which is specially nice
      for the rt cases.  Furthermore, some callers can call msg_get() in a
      loop.
      
      A new msg_tree_erase() helper is also added to encapsulate the tree
      removal and node_cache game.  Passes ltp mq testcases.
      
      Link: http://lkml.kernel.org/r/20190321190216.1719-2-dave@stgolabs.net
      
      
      Signed-off-by: default avatarDavidlohr Bueso <dbueso@suse.de>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a5091fda
    • Davidlohr Bueso's avatar
      ipc/mqueue: remove redundant wq task assignment · 0ecb5821
      Davidlohr Bueso authored
      We already store the current task fo the new waiter before calling
      wq_sleep() in both send and recv paths.  Trivially remove the redundant
      assignment.
      
      Link: http://lkml.kernel.org/r/20190321190216.1719-1-dave@stgolabs.net
      
      
      Signed-off-by: default avatarDavidlohr Bueso <dbueso@suse.de>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0ecb5821
    • Li Rongqing's avatar
      ipc: prevent lockup on alloc_msg and free_msg · d6a2946a
      Li Rongqing authored
      msgctl10 of ltp triggers the following lockup When CONFIG_KASAN is
      enabled on large memory SMP systems, the pages initialization can take a
      long time, if msgctl10 requests a huge block memory, and it will block
      rcu scheduler, so release cpu actively.
      
      After adding schedule() in free_msg, free_msg can not be called when
      holding spinlock, so adding msg to a tmp list, and free it out of
      spinlock
      
        rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
        rcu:     Tasks blocked on level-1 rcu_node (CPUs 16-31): P32505
        rcu:     Tasks blocked on level-1 rcu_node (CPUs 48-63): P34978
        rcu:     (detected by 11, t=35024 jiffies, g=44237529, q=16542267)
        msgctl10        R  running task    21608 32505   2794 0x00000082
        Call Trace:
         preempt_schedule_irq+0x4c/0xb0
         retint_kernel+0x1b/0x2d
        RIP: 0010:__is_insn_slot_addr+0xfb/0x250
        Code: 82 1d 00 48 8b 9b 90 00 00 00 4c 89 f7 49 c1 ee 03 e8 59 83 1d 00 48 b8 00 00 00 00 00 fc ff df 4c 39 eb 48 89 9d 58 ff ff ff <41> c6 04 06 f8 74 66 4c 8d 75 98 4c 89 f1 48 c1 e9 03 48 01 c8 48
        RSP: 0018:ffff88bce041f758 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
        RAX: dffffc0000000000 RBX: ffffffff8471bc50 RCX: ffffffff828a2a57
        RDX: dffffc0000000000 RSI: dffffc0000000000 RDI: ffff88bce041f780
        RBP: ffff88bce041f828 R08: ffffed15f3f4c5b3 R09: ffffed15f3f4c5b3
        R10: 0000000000000001 R11: ffffed15f3f4c5b2 R12: 000000318aee9b73
        R13: ffffffff8471bc50 R14: 1ffff1179c083ef0 R15: 1ffff1179c083eec
         kernel_text_address+0xc1/0x100
         __kernel_text_address+0xe/0x30
         unwind_get_return_address+0x2f/0x50
         __save_stack_trace+0x92/0x100
         create_object+0x380/0x650
         __kmalloc+0x14c/0x2b0
         load_msg+0x38/0x1a0
         do_msgsnd+0x19e/0xcf0
         do_syscall_64+0x117/0x400
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
        rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
        rcu:     Tasks blocked on level-1 rcu_node (CPUs 0-15): P32170
        rcu:     (detected by 14, t=35016 jiffies, g=44237525, q=12423063)
        msgctl10        R  running task    21608 32170  32155 0x00000082
        Call Trace:
         preempt_schedule_irq+0x4c/0xb0
         retint_kernel+0x1b/0x2d
        RIP: 0010:lock_acquire+0x4d/0x340
        Code: 48 81 ec c0 00 00 00 45 89 c6 4d 89 cf 48 8d 6c 24 20 48 89 3c 24 48 8d bb e4 0c 00 00 89 74 24 0c 48 c7 44 24 20 b3 8a b5 41 <48> c1 ed 03 48 c7 44 24 28 b4 25 18 84 48 c7 44 24 30 d0 54 7a 82
        RSP: 0018:ffff88af83417738 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
        RAX: dffffc0000000000 RBX: ffff88bd335f3080 RCX: 0000000000000002
        RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88bd335f3d64
        RBP: ffff88af83417758 R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000001 R11: ffffed13f3f745b2 R12: 0000000000000000
        R13: 0000000000000002 R14: 0000000000000000 R15: 0000000000000000
         is_bpf_text_address+0x32/0xe0
         kernel_text_address+0xec/0x100
         __kernel_text_address+0xe/0x30
         unwind_get_return_address+0x2f/0x50
         __save_stack_trace+0x92/0x100
         save_stack+0x32/0xb0
         __kasan_slab_free+0x130/0x180
         kfree+0xfa/0x2d0
         free_msg+0x24/0x50
         do_msgrcv+0x508/0xe60
         do_syscall_64+0x117/0x400
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Davidlohr said:
       "So after releasing the lock, the msg rbtree/list is empty and new
        calls will not see those in the newly populated tmp_msg list, and
        therefore they cannot access the delayed msg freeing pointers, which
        is good. Also the fact that the node_cache is now freed before the
        actual messages seems to be harmless as this is wanted for
        msg_insert() avoiding GFP_ATOMIC allocations, and after releasing the
        info->lock the thing is freed anyway so it should not change things"
      
      Link: http://lkml.kernel.org/r/1552029161-4957-1-git-send-email-lirongqing@baidu.com
      
      
      Signed-off-by: default avatarLi RongQing <lirongqing@baidu.com>
      Signed-off-by: default avatarZhang Yu <zhangyu31@baidu.com>
      Reviewed-by: default avatarDavidlohr Bueso <dbueso@suse.de>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d6a2946a
  2. May 02, 2019
  3. Apr 08, 2019
    • NeilBrown's avatar
      rhashtable: use bit_spin_locks to protect hash bucket. · 8f0db018
      NeilBrown authored
      
      This patch changes rhashtables to use a bit_spin_lock on BIT(1) of the
      bucket pointer to lock the hash chain for that bucket.
      
      The benefits of a bit spin_lock are:
       - no need to allocate a separate array of locks.
       - no need to have a configuration option to guide the
         choice of the size of this array
       - locking cost is often a single test-and-set in a cache line
         that will have to be loaded anyway.  When inserting at, or removing
         from, the head of the chain, the unlock is free - writing the new
         address in the bucket head implicitly clears the lock bit.
         For __rhashtable_insert_fast() we ensure this always happens
         when adding a new key.
       - even when lockings costs 2 updates (lock and unlock), they are
         in a cacheline that needs to be read anyway.
      
      The cost of using a bit spin_lock is a little bit of code complexity,
      which I think is quite manageable.
      
      Bit spin_locks are sometimes inappropriate because they are not fair -
      if multiple CPUs repeatedly contend of the same lock, one CPU can
      easily be starved.  This is not a credible situation with rhashtable.
      Multiple CPUs may want to repeatedly add or remove objects, but they
      will typically do so at different buckets, so they will attempt to
      acquire different locks.
      
      As we have more bit-locks than we previously had spinlocks (by at
      least a factor of two) we can expect slightly less contention to
      go with the slightly better cache behavior and reduced memory
      consumption.
      
      To enhance type checking, a new struct is introduced to represent the
        pointer plus lock-bit
      that is stored in the bucket-table.  This is "struct rhash_lock_head"
      and is empty.  A pointer to this needs to be cast to either an
      unsigned lock, or a "struct rhash_head *" to be useful.
      Variables of this type are most often called "bkt".
      
      Previously "pprev" would sometimes point to a bucket, and sometimes a
      ->next pointer in an rhash_head.  As these are now different types,
      pprev is NULL when it would have pointed to the bucket. In that case,
      'blk' is used, together with correct locking protocol.
      
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8f0db018
  4. Mar 08, 2019
  5. Feb 28, 2019
    • David Howells's avatar
      ipc: Convert mqueue fs to fs_context · 935c6912
      David Howells authored
      
      Convert the mqueue filesystem to use the filesystem context stuff.
      
      Notes:
      
       (1) The relevant ipc namespace is selected in when the context is
           initialised (and it defaults to the current task's ipc namespace).
           The caller can override this before calling vfs_get_tree().
      
       (2) Rather than simply calling kern_mount_data(), mq_init_ns() and
           mq_internal_mount() create a context, adjust it and then do the rest
           of the mount procedure.
      
       (3) The lazy mqueue mounting on creation of a new namespace is retained
           from a previous patch, but the avoidance of sget() if no superblock
           yet exists is reverted and the superblock is again keyed on the
           namespace pointer.
      
           Yes, there was a performance gain in not searching the superblock
           hash, but it's only paid once per ipc namespace - and only if someone
           uses mqueue within that namespace, so I'm not sure it's worth it,
           especially as calling sget() allows avoidance of recursion.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      935c6912
  6. Feb 06, 2019
    • Arnd Bergmann's avatar
      y2038: syscalls: rename y2038 compat syscalls · 8dabe724
      Arnd Bergmann authored
      
      A lot of system calls that pass a time_t somewhere have an implementation
      using a COMPAT_SYSCALL_DEFINEx() on 64-bit architectures, and have
      been reworked so that this implementation can now be used on 32-bit
      architectures as well.
      
      The missing step is to redefine them using the regular SYSCALL_DEFINEx()
      to get them out of the compat namespace and make it possible to build them
      on 32-bit architectures.
      
      Any system call that ends in 'time' gets a '32' suffix on its name for
      that version, while the others get a '_time32' suffix, to distinguish
      them from the normal version, which takes a 64-bit time argument in the
      future.
      
      In this step, only 64-bit architectures are changed, doing this rename
      first lets us avoid touching the 32-bit architectures twice.
      
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      8dabe724
  7. Jan 25, 2019
    • Arnd Bergmann's avatar
      ipc: rename old-style shmctl/semctl/msgctl syscalls · 275f2214
      Arnd Bergmann authored
      
      The behavior of these system calls is slightly different between
      architectures, as determined by the CONFIG_ARCH_WANT_IPC_PARSE_VERSION
      symbol. Most architectures that implement the split IPC syscalls don't set
      that symbol and only get the modern version, but alpha, arm, microblaze,
      mips-n32, mips-n64 and xtensa expect the caller to pass the IPC_64 flag.
      
      For the architectures that so far only implement sys_ipc(), i.e. m68k,
      mips-o32, powerpc, s390, sh, sparc, and x86-32, we want the new behavior
      when adding the split syscalls, so we need to distinguish between the
      two groups of architectures.
      
      The method I picked for this distinction is to have a separate system call
      entry point: sys_old_*ctl() now uses ipc_parse_version, while sys_*ctl()
      does not. The system call tables of the five architectures are changed
      accordingly.
      
      As an additional benefit, we no longer need the configuration specific
      definition for ipc_parse_version(), it always does the same thing now,
      but simply won't get called on architectures with the modern interface.
      
      A small downside is that on architectures that do set
      ARCH_WANT_IPC_PARSE_VERSION, we now have an extra set of entry points
      that are never called. They only add a few bytes of bloat, so it seems
      better to keep them compared to adding yet another Kconfig symbol.
      I considered adding new syscall numbers for the IPC_64 variants for
      consistency, but decided against that for now.
      
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      275f2214
  8. Jan 18, 2019
  9. Oct 31, 2018
  10. Oct 05, 2018
  11. Oct 03, 2018
    • Eric W. Biederman's avatar
      signal: Distinguish between kernel_siginfo and siginfo · ae7795bc
      Eric W. Biederman authored
      
      Linus recently observed that if we did not worry about the padding
      member in struct siginfo it is only about 48 bytes, and 48 bytes is
      much nicer than 128 bytes for allocating on the stack and copying
      around in the kernel.
      
      The obvious thing of only adding the padding when userspace is
      including siginfo.h won't work as there are sigframe definitions in
      the kernel that embed struct siginfo.
      
      So split siginfo in two; kernel_siginfo and siginfo.  Keeping the
      traditional name for the userspace definition.  While the version that
      is used internally to the kernel and ultimately will not be padded to
      128 bytes is called kernel_siginfo.
      
      The definition of struct kernel_siginfo I have put in include/signal_types.h
      
      A set of buildtime checks has been added to verify the two structures have
      the same field offsets.
      
      To make it easy to verify the change kernel_siginfo retains the same
      size as siginfo.  The reduction in size comes in a following change.
      
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      ae7795bc
  12. Sep 04, 2018
  13. Aug 27, 2018
    • Arnd Bergmann's avatar
      y2038: globally rename compat_time to old_time32 · 9afc5eee
      Arnd Bergmann authored
      
      Christoph Hellwig suggested a slightly different path for handling
      backwards compatibility with the 32-bit time_t based system calls:
      
      Rather than simply reusing the compat_sys_* entry points on 32-bit
      architectures unchanged, we get rid of those entry points and the
      compat_time types by renaming them to something that makes more sense
      on 32-bit architectures (which don't have a compat mode otherwise),
      and then share the entry points under the new name with the 64-bit
      architectures that use them for implementing the compatibility.
      
      The following types and interfaces are renamed here, and moved
      from linux/compat_time.h to linux/time32.h:
      
      old				new
      ---				---
      compat_time_t			old_time32_t
      struct compat_timeval		struct old_timeval32
      struct compat_timespec		struct old_timespec32
      struct compat_itimerspec	struct old_itimerspec32
      ns_to_compat_timeval()		ns_to_old_timeval32()
      get_compat_itimerspec64()	get_old_itimerspec32()
      put_compat_itimerspec64()	put_old_itimerspec32()
      compat_get_timespec64()		get_old_timespec32()
      compat_put_timespec64()		put_old_timespec32()
      
      As we already have aliases in place, this patch addresses only the
      instances that are relevant to the system call interface in particular,
      not those that occur in device drivers and other modules. Those
      will get handled separately, while providing the 64-bit version
      of the respective interfaces.
      
      I'm not renaming the timex, rusage and itimerval structures, as we are
      still debating what the new interface will look like, and whether we
      will need a replacement at all.
      
      This also doesn't change the names of the syscall entry points, which can
      be done more easily when we actually switch over the 32-bit architectures
      to use them, at that point we need to change COMPAT_SYSCALL_DEFINEx to
      SYSCALL_DEFINEx with a new name, e.g. with a _time32 suffix.
      
      Suggested-by: default avatarChristoph Hellwig <hch@infradead.org>
      Link: https://lore.kernel.org/lkml/20180705222110.GA5698@infradead.org/
      
      
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      9afc5eee
  14. Aug 22, 2018
  15. Aug 02, 2018
    • Jane Chu's avatar
      ipc/shm.c add ->pagesize function to shm_vm_ops · eec3636a
      Jane Chu authored
      Commit 05ea8860 ("mm, hugetlbfs: introduce ->pagesize() to
      vm_operations_struct") adds a new ->pagesize() function to
      hugetlb_vm_ops, intended to cover all hugetlbfs backed files.
      
      With System V shared memory model, if "huge page" is specified, the
      "shared memory" is backed by hugetlbfs files, but the mappings initiated
      via shmget/shmat have their original vm_ops overwritten with shm_vm_ops,
      so we need to add a ->pagesize function to shm_vm_ops.  Otherwise,
      vma_kernel_pagesize() returns PAGE_SIZE given a hugetlbfs backed vma,
      result in below BUG:
      
        fs/hugetlbfs/inode.c
              443             if (unlikely(page_mapped(page))) {
              444                     BUG_ON(truncate_op);
      
      resulting in
      
        hugetlbfs: oracle (4592): Using mlock ulimits for SHM_HUGETLB is deprecated
        ------------[ cut here ]------------
        kernel BUG at fs/hugetlbfs/inode.c:444!
        Modules linked in: nfsv3 rpcsec_gss_krb5 nfsv4 ...
        CPU: 35 PID: 5583 Comm: oracle_5583_sbt Not tainted 4.14.35-1829.el7uek.x86_64 #2
        RIP: 0010:remove_inode_hugepages+0x3db/0x3e2
        ....
        Call Trace:
          hugetlbfs_evict_inode+0x1e/0x3e
          evict+0xdb/0x1af
          iput+0x1a2/0x1f7
          dentry_unlink_inode+0xc6/0xf0
          __dentry_kill+0xd8/0x18d
          dput+0x1b5/0x1ed
          __fput+0x18b/0x216
          ____fput+0xe/0x10
          task_work_run+0x90/0xa7
          exit_to_usermode_loop+0xdd/0x116
          do_syscall_64+0x187/0x1ae
          entry_SYSCALL_64_after_hwframe+0x150/0x0
      
      [jane.chu@oracle.com: relocate comment]
        Link: http://lkml.kernel.org/r/20180731044831.26036-1-jane.chu@oracle.com
      Link: http://lkml.kernel.org/r/20180727211727.5020-1-jane.chu@oracle.com
      
      
      Fixes: 05ea8860 ("mm, hugetlbfs: introduce ->pagesize() to vm_operations_struct")
      Signed-off-by: default avatarJane Chu <jane.chu@oracle.com>
      Suggested-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarDavidlohr Bueso <dave@stgolabs.net>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eec3636a
  16. Jul 27, 2018
  17. Jul 12, 2018
  18. Jun 22, 2018
    • NeilBrown's avatar
      rhashtable: split rhashtable.h · 0eb71a9d
      NeilBrown authored
      
      Due to the use of rhashtables in net namespaces,
      rhashtable.h is included in lots of the kernel,
      so a small changes can required a large recompilation.
      This makes development painful.
      
      This patch splits out rhashtable-types.h which just includes
      the major type declarations, and does not include (non-trivial)
      inline code.  rhashtable.h is no longer included by anything
      in the include/ directory.
      Common include files only include rhashtable-types.h so a large
      recompilation is only triggered when that changes.
      
      Acked-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0eb71a9d
  19. Jun 14, 2018
  20. Jun 12, 2018
    • Kees Cook's avatar
      treewide: kvmalloc() -> kvmalloc_array() · 344476e1
      Kees Cook authored
      
      The kvmalloc() function has a 2-factor argument form, kvmalloc_array(). This
      patch replaces cases of:
      
              kvmalloc(a * b, gfp)
      
      with:
              kvmalloc_array(a * b, gfp)
      
      as well as handling cases of:
      
              kvmalloc(a * b * c, gfp)
      
      with:
      
              kvmalloc(array3_size(a, b, c), gfp)
      
      as it's slightly less ugly than:
      
              kvmalloc_array(array_size(a, b), c, gfp)
      
      This does, however, attempt to ignore constant size factors like:
      
              kvmalloc(4 * 1024, gfp)
      
      though any constants defined via macros get caught up in the conversion.
      
      Any factors with a sizeof() of "unsigned char", "char", and "u8" were
      dropped, since they're redundant.
      
      The Coccinelle script used for this was:
      
      // Fix redundant parens around sizeof().
      @@
      type TYPE;
      expression THING, E;
      @@
      
      (
        kvmalloc(
      -	(sizeof(TYPE)) * E
      +	sizeof(TYPE) * E
        , ...)
      |
        kvmalloc(
      -	(sizeof(THING)) * E
      +	sizeof(THING) * E
        , ...)
      )
      
      // Drop single-byte sizes and redundant parens.
      @@
      expression COUNT;
      typedef u8;
      typedef __u8;
      @@
      
      (
        kvmalloc(
      -	sizeof(u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kvmalloc(
      -	sizeof(__u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kvmalloc(
      -	sizeof(char) * (COUNT)
      +	COUNT
        , ...)
      |
        kvmalloc(
      -	sizeof(unsigned char) * (COUNT)
      +	COUNT
        , ...)
      |
        kvmalloc(
      -	sizeof(u8) * COUNT
      +	COUNT
        , ...)
      |
        kvmalloc(
      -	sizeof(__u8) * COUNT
      +	COUNT
        , ...)
      |
        kvmalloc(
      -	sizeof(char) * COUNT
      +	COUNT
        , ...)
      |
        kvmalloc(
      -	sizeof(unsigned char) * COUNT
      +	COUNT
        , ...)
      )
      
      // 2-factor product with sizeof(type/expression) and identifier or constant.
      @@
      type TYPE;
      expression THING;
      identifier COUNT_ID;
      constant COUNT_CONST;
      @@
      
      (
      - kvmalloc
      + kvmalloc_array
        (
      -	sizeof(TYPE) * (COUNT_ID)
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kvmalloc
      + kvmalloc_array
        (
      -	sizeof(TYPE) * COUNT_ID
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kvmalloc
      + kvmalloc_array
        (
      -	sizeof(TYPE) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kvmalloc
      + kvmalloc_array
        (
      -	sizeof(TYPE) * COUNT_CONST
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kvmalloc
      + kvmalloc_array
        (
      -	sizeof(THING) * (COUNT_ID)
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kvmalloc
      + kvmalloc_array
        (
      -	sizeof(THING) * COUNT_ID
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kvmalloc
      + kvmalloc_array
        (
      -	sizeof(THING) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(THING)
        , ...)
      |
      - kvmalloc
      + kvmalloc_array
        (
      -	sizeof(THING) * COUNT_CONST
      +	COUNT_CONST, sizeof(THING)
        , ...)
      )
      
      // 2-factor product, only identifiers.
      @@
      identifier SIZE, COUNT;
      @@
      
      - kvmalloc
      + kvmalloc_array
        (
      -	SIZE * COUNT
      +	COUNT, SIZE
        , ...)
      
      // 3-factor product with 1 sizeof(type) or sizeof(expression), with
      // redundant parens removed.
      @@
      expression THING;
      identifier STRIDE, COUNT;
      type TYPE;
      @@
      
      (
        kvmalloc(
      -	sizeof(TYPE) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kvmalloc(
      -	sizeof(TYPE) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kvmalloc(
      -	sizeof(TYPE) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kvmalloc(
      -	sizeof(TYPE) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kvmalloc(
      -	sizeof(THING) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kvmalloc(
      -	sizeof(THING) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kvmalloc(
      -	sizeof(THING) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kvmalloc(
      -	sizeof(THING) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      )
      
      // 3-factor product with 2 sizeof(variable), with redundant parens removed.
      @@
      expression THING1, THING2;
      identifier COUNT;
      type TYPE1, TYPE2;
      @@
      
      (
        kvmalloc(
      -	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kvmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kvmalloc(
      -	sizeof(THING1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kvmalloc(
      -	sizeof(THING1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kvmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      |
        kvmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      )
      
      // 3-factor product, only identifiers, with redundant parens removed.
      @@
      identifier STRIDE, SIZE, COUNT;
      @@
      
      (
        kvmalloc(
      -	(COUNT) * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kvmalloc(
      -	COUNT * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kvmalloc(
      -	COUNT * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kvmalloc(
      -	(COUNT) * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kvmalloc(
      -	COUNT * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kvmalloc(
      -	(COUNT) * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kvmalloc(
      -	(COUNT) * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kvmalloc(
      -	COUNT * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      )
      
      // Any remaining multi-factor products, first at least 3-factor products,
      // when they're not all constants...
      @@
      expression E1, E2, E3;
      constant C1, C2, C3;
      @@
      
      (
        kvmalloc(C1 * C2 * C3, ...)
      |
        kvmalloc(
      -	(E1) * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kvmalloc(
      -	(E1) * (E2) * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kvmalloc(
      -	(E1) * (E2) * (E3)
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kvmalloc(
      -	E1 * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      )
      
      // And then all remaining 2 factors products when they're not all constants,
      // keeping sizeof() as the second factor argument.
      @@
      expression THING, E1, E2;
      type TYPE;
      constant C1, C2, C3;
      @@
      
      (
        kvmalloc(sizeof(THING) * C2, ...)
      |
        kvmalloc(sizeof(TYPE) * C2, ...)
      |
        kvmalloc(C1 * C2 * C3, ...)
      |
        kvmalloc(C1 * C2, ...)
      |
      - kvmalloc
      + kvmalloc_array
        (
      -	sizeof(TYPE) * (E2)
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kvmalloc
      + kvmalloc_array
        (
      -	sizeof(TYPE) * E2
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kvmalloc
      + kvmalloc_array
        (
      -	sizeof(THING) * (E2)
      +	E2, sizeof(THING)
        , ...)
      |
      - kvmalloc
      + kvmalloc_array
        (
      -	sizeof(THING) * E2
      +	E2, sizeof(THING)
        , ...)
      |
      - kvmalloc
      + kvmalloc_array
        (
      -	(E1) * E2
      +	E1, E2
        , ...)
      |
      - kvmalloc
      + kvmalloc_array
        (
      -	(E1) * (E2)
      +	E1, E2
        , ...)
      |
      - kvmalloc
      + kvmalloc_array
        (
      -	E1 * E2
      +	E1, E2
        , ...)
      )
      
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      344476e1
  21. May 26, 2018
Loading