Skip to content
Snippets Groups Projects
  1. May 05, 2009
    • Mel Gorman's avatar
      Ignore madvise(MADV_WILLNEED) for hugetlbfs-backed regions · a425a638
      Mel Gorman authored
      
      madvise(MADV_WILLNEED) forces page cache readahead on a range of memory
      backed by a file.  The assumption is made that the page required is
      order-0 and "normal" page cache.
      
      On hugetlbfs, this assumption is not true and order-0 pages are
      allocated and inserted into the hugetlbfs page cache.  This leaks
      hugetlbfs page reservations and can cause BUGs to trigger related to
      corrupted page tables.
      
      This patch causes MADV_WILLNEED to be ignored for hugetlbfs-backed
      regions.
      
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: stable@kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a425a638
  2. May 02, 2009
    • Andrew Morton's avatar
      vmscan: avoid multiplication overflow in shrink_zone() · 8713e012
      Andrew Morton authored
      
      Local variable `scan' can overflow on zones which are larger than
      
      	(2G * 4k) / 100 = 80GB.
      
      Making it 64-bit on 64-bit will fix that up.
      
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8713e012
    • KOSAKI Motohiro's avatar
      mm: fix Committed_AS underflow on large NR_CPUS environment · 00a62ce9
      KOSAKI Motohiro authored
      
      The Committed_AS field can underflow in certain situations:
      
      >         # while true; do cat /proc/meminfo  | grep _AS; sleep 1; done | uniq -c
      >               1 Committed_AS: 18446744073709323392 kB
      >              11 Committed_AS: 18446744073709455488 kB
      >               6 Committed_AS:    35136 kB
      >               5 Committed_AS: 18446744073709454400 kB
      >               7 Committed_AS:    35904 kB
      >               3 Committed_AS: 18446744073709453248 kB
      >               2 Committed_AS:    34752 kB
      >               9 Committed_AS: 18446744073709453248 kB
      >               8 Committed_AS:    34752 kB
      >               3 Committed_AS: 18446744073709320960 kB
      >               7 Committed_AS: 18446744073709454080 kB
      >               3 Committed_AS: 18446744073709320960 kB
      >               5 Committed_AS: 18446744073709454080 kB
      >               6 Committed_AS: 18446744073709320960 kB
      
      Because NR_CPUS can be greater than 1000 and meminfo_proc_show() does
      not check for underflow.
      
      But NR_CPUS proportional isn't good calculation.  In general,
      possibility of lock contention is proportional to the number of online
      cpus, not theorical maximum cpus (NR_CPUS).
      
      The current kernel has generic percpu-counter stuff.  using it is right
      way.  it makes code simplify and percpu_counter_read_positive() don't
      make underflow issue.
      
      Reported-by: default avatarDave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Eric B Munson <ebmunson@us.ibm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: <stable@kernel.org>		[All kernel versions]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      00a62ce9
    • Daisuke Nishimura's avatar
      memcg: fix mem_cgroup_shrink_usage() · ae3abae6
      Daisuke Nishimura authored
      
      Current mem_cgroup_shrink_usage() has two problems.
      
      1. It doesn't call mem_cgroup_out_of_memory and doesn't update
         last_oom_jiffies, so pagefault_out_of_memory invokes global OOM.
      
      2. Considering hierarchy, shrinking has to be done from the
         mem_over_limit, not from the memcg which the page would be charged to.
      
      mem_cgroup_try_charge_swapin() does all of these things properly, so we
      use it and call cancel_charge_swapin when it succeeded.
      
      The name of "shrink_usage" is not appropriate for this behavior, so we
      change it too.
      
      Signed-off-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.cn>
      Cc: Paul Menage <menage@google.com>
      Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ae3abae6
    • Nicholas Piggin's avatar
      mm: close page_mkwrite races · b827e496
      Nicholas Piggin authored
      Change page_mkwrite to allow implementations to return with the page
      locked, and also change it's callers (in page fault paths) to hold the
      lock until the page is marked dirty.  This allows the filesystem to have
      full control of page dirtying events coming from the VM.
      
      Rather than simply hold the page locked over the page_mkwrite call, we
      call page_mkwrite with the page unlocked and allow callers to return with
      it locked, so filesystems can avoid LOR conditions with page lock.
      
      The problem with the current scheme is this: a filesystem that wants to
      associate some metadata with a page as long as the page is dirty, will
      perform this manipulation in its ->page_mkwrite.  It currently then must
      return with the page unlocked and may not hold any other locks (according
      to existing page_mkwrite convention).
      
      In this window, the VM could write out the page, clearing page-dirty.  The
      filesystem has no good way to detect that a dirty pte is about to be
      attached, so it will happily write out the page, at which point, the
      filesystem may manipulate the metadata to reflect that the page is no
      longer dirty.
      
      It is not always possible to perform the required metadata manipulation in
      ->set_page_dirty, because that function cannot block or fail.  The
      filesystem may need to allocate some data structure, for example.
      
      And the VM cannot mark the pte dirty before page_mkwrite, because
      page_mkwrite is allowed to fail, so we must not allow any window where the
      page could be written to if page_mkwrite does fail.
      
      This solution of holding the page locked over the 3 critical operations
      (page_mkwrite, setting the pte dirty, and finally setting the page dirty)
      closes out races nicely, preventing page cleaning for writeout being
      initiated in that window.  This provides the filesystem with a strong
      synchronisation against the VM here.
      
      - Sage needs this race closed for ceph filesystem.
      - Trond for NFS (http://bugzilla.kernel.org/show_bug.cgi?id=12913
      
      ).
      - I need it for fsblock.
      - I suspect other filesystems may need it too (eg. btrfs).
      - I have converted buffer.c to the new locking. Even simple block allocation
        under dirty pages might be susceptible to i_size changing under partial page
        at the end of file (we also have a buffer.c-side problem here, but it cannot
        be fixed properly without this patch).
      - Other filesystems (eg. NFS, maybe btrfs) will need to change their
        page_mkwrite functions themselves.
      
      [ This also moves page_mkwrite another step closer to fault, which should
        eventually allow page_mkwrite to be moved into ->fault, and thus avoiding a
        filesystem calldown and page lock/unlock cycle in __do_fault. ]
      
      [akpm@linux-foundation.org: fix derefs of NULL ->mapping]
      Cc: Sage Weil <sage@newdream.net>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b827e496
    • Daisuke Nishimura's avatar
      memcg: fix try_get_mem_cgroup_from_swapcache() · c0bd3f63
      Daisuke Nishimura authored
      
      This is a bugfix for commit 3c776e64
      ("memcg: charge swapcache to proper memcg").
      
      Used bit of swapcache is solid under page lock, but considering
      move_account, pc->mem_cgroup is not.
      
      We need lock_page_cgroup() anyway.
      
      Signed-off-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c0bd3f63
    • Johannes Weiner's avatar
      mm: fix pageref leak in do_swap_page() · bc43f75c
      Johannes Weiner authored
      
      By the time the memory cgroup code is notified about a swapin we
      already hold a reference on the fault page.
      
      If the cgroup callback fails make sure to unlock AND release the page
      reference which was taken by lookup_swap_cach(), or we leak the reference.
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bc43f75c
  3. Apr 21, 2009
  4. Apr 18, 2009
  5. Apr 16, 2009
  6. Apr 13, 2009
  7. Apr 07, 2009
    • Peter W Morreale's avatar
      mm: add /proc controls for pdflush threads · fafd688e
      Peter W Morreale authored
      
      Add /proc entries to give the admin the ability to control the minimum and
      maximum number of pdflush threads.  This allows finer control of pdflush
      on both large and small machines.
      
      The rationale is simply one size does not fit all.  Admins on large and/or
      small systems may want to tune the min/max pdflush thread count to best
      suit their needs.  Right now the min/max is hardcoded to 2/8.  While
      probably a fair estimate for smaller machines, large machines with large
      numbers of CPUs and large numbers of filesystems/block devices may benefit
      from larger numbers of threads working on different block devices.
      
      Even if the background flushing algorithm is radically changed, it is
      still likely that multiple threads will be involved and admins would still
      desire finer control on the min/max other than to have to recompile the
      kernel.
      
      The patch adds '/proc/sys/vm/nr_pdflush_threads_min' and
      '/proc/sys/vm/nr_pdflush_threads_max' with r/w permissions.
      
      The minimum value for nr_pdflush_threads_min is 1 and the maximum value is
      the current value of nr_pdflush_threads_max.  This minimum is required
      since additional thread creation is performed in a pdflush thread itself.
      
      The minimum value for nr_pdflush_threads_max is the current value of
      nr_pdflush_threads_min and the maximum value can be 1000.
      
      Documentation/sysctl/vm.txt is also updated.
      
      [akpm@linux-foundation.org: fix comment, fix whitespace, use __read_mostly]
      Signed-off-by: default avatarPeter W Morreale <pmorreale@novell.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fafd688e
    • Peter W Morreale's avatar
      mm: fix pdflush thread creation upper bound · a56ed663
      Peter W Morreale authored
      
      Fix a race on creating pdflush threads.  Without the patch, it is possible
      to create more than MAX_PDFLUSH_THREADS threads, and this has been
      observed in practice on IO loaded SMP machines.
      
      The fix involves moving the lock around to protect the check against the
      thread count and correctly dealing with thread creation failure.
      
      This fix also _mostly_ repairs a race condition on how quickly the threads
      are created.  The original intent was to create a pdflush thread (up to
      the max allowed) every second.  Without this patch is is possible to
      create NCPUS pdflush threads concurrently.  The 'mostly' caveat is because
      an assumption is made that thread creation will be successful.  If we fail
      to create the thread, the miss is not considered fatal.  (we will try
      again in 1 second)
      
      Signed-off-by: default avatarPeter W Morreale <pmorreale@novell.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a56ed663
  8. Apr 06, 2009
  9. Apr 03, 2009
    • Evgeniy Polyakov's avatar
      Staging: pohmelfs: kconfig/makefile and vfs changes. · 18bc0bbd
      Evgeniy Polyakov authored
      
      This patch adds Kconfig and Makefile entries and exports to
      VFS functions to be used by POHMELFS.
      
      Signed-off-by: default avatarEvgeniy Polyakov <zbr@ioremap.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      18bc0bbd
    • David Howells's avatar
      CacheFiles: Permit the page lock state to be monitored · 385e1ca5
      David Howells authored
      
      Add a function to install a monitor on the page lock waitqueue for a particular
      page, thus allowing the page being unlocked to be detected.
      
      This is used by CacheFiles to detect read completion on a page in the backing
      filesystem so that it can then copy the data to the waiting netfs page.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Acked-by: default avatarSteve Dickson <steved@redhat.com>
      Acked-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Tested-by: default avatarDaire Byrne <Daire.Byrne@framestore.com>
      385e1ca5
    • David Howells's avatar
      FS-Cache: Recruit a page flags for cache management · 266cf658
      David Howells authored
      
      Recruit a page flag to aid in cache management.  The following extra flag is
      defined:
      
       (1) PG_fscache (PG_private_2)
      
           The marked page is backed by a local cache and is pinning resources in the
           cache driver.
      
      If PG_fscache is set, then things that checked for PG_private will now also
      check for that.  This includes things like truncation and page invalidation.
      The function page_has_private() had been added to make the checks for both
      PG_private and PG_private_2 at the same time.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Acked-by: default avatarSteve Dickson <steved@redhat.com>
      Acked-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Tested-by: default avatarDaire Byrne <Daire.Byrne@framestore.com>
      266cf658
    • David Howells's avatar
      FS-Cache: Release page->private after failed readahead · 03fb3d2a
      David Howells authored
      
      The attached patch causes read_cache_pages() to release page-private data on a
      page for which add_to_page_cache() fails.  If the filler function fails, then
      the problematic page is left attached to the pagecache (with appropriate flags
      set, one presumes) and the remaining to-be-attached pages are invalidated and
      discarded.  This permits pages with caching references associated with them to
      be cleaned up.
      
      The invalidatepage() address space op is called (indirectly) to do the honours.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Acked-by: default avatarSteve Dickson <steved@redhat.com>
      Acked-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Tested-by: default avatarDaire Byrne <Daire.Byrne@framestore.com>
      03fb3d2a
    • Pekka Enberg's avatar
      kmemtrace: trace kfree() calls with NULL or zero-length objects · 2121db74
      Pekka Enberg authored
      
      Impact: also output kfree(NULL) entries
      
      This patch moves the trace_kfree() calls before the ZERO_OR_NULL_PTR
      check so that we can trace call-sites that call kfree() with NULL many
      times which might be an indication of a bug.
      
      Signed-off-by: default avatarPekka Enberg <penberg@cs.helsinki.fi>
      Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
      LKML-Reference: <1237971957.30175.18.camel@penberg-laptop>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      2121db74
    • Eduard - Gabriel Munteanu's avatar
      kmemtrace: use tracepoints · ca2b84cb
      Eduard - Gabriel Munteanu authored
      
      kmemtrace now uses tracepoints instead of markers. We no longer need to
      use format specifiers to pass arguments.
      
      Signed-off-by: default avatarEduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
      [ folded: Use the new TP_PROTO and TP_ARGS to fix the build.     ]
      [ folded: fix build when CONFIG_KMEMTRACE is disabled.           ]
      [ folded: define tracepoints when CONFIG_TRACEPOINTS is enabled. ]
      Signed-off-by: default avatarPekka Enberg <penberg@cs.helsinki.fi>
      LKML-Reference: <ae61c0f37156db8ec8dc0d5778018edde60a92e3.1237813499.git.eduard.munteanu@linux360.ro>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      ca2b84cb
    • Pekka Enberg's avatar
      kmemtrace, mm: fix slab.h dependency problem in mm/failslab.c · 255d11bc
      Pekka Enberg authored
      
      Impact: cleanup
      
      mm/failslab.c depends on slab.h without including it:
      
          CC      mm/failslab.o
        mm/failslab.c: In function ‘should_failslab’:
        mm/failslab.c:16: error: ‘__GFP_NOFAIL’ undeclared (first use in this function)
        mm/failslab.c:16: error: (Each undeclared identifier is reported only once
        mm/failslab.c:16: error: for each function it appears in.)
        mm/failslab.c:19: error: ‘__GFP_WAIT’ undeclared (first use in this function)
        make[1]: *** [mm/failslab.o] Error 1
        make: *** [mm] Error 2
      
      It gets included implicitly currently - but this will not be the
      case with upcoming kmemtrace changes.
      
      Signed-off-by: default avatarPekka Enberg <penberg@cs.helsinki.fi>
      Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
      LKML-Reference: <1237888761.25315.69.camel@penberg-laptop>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      255d11bc
    • Daisuke Nishimura's avatar
      memcg: cleanup cache_charge · 83aae4c7
      Daisuke Nishimura authored
      
      Current mem_cgroup_cache_charge is a bit complicated especially
      in the case of shmem's swap-in.
      
      This patch cleans it up by using try_charge_swapin and commit_charge_swapin.
      
      Signed-off-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      83aae4c7
    • KAMEZAWA Hiroyuki's avatar
      memcg: remove redundant message at swapon · 627991a2
      KAMEZAWA Hiroyuki authored
      
      It's pointed out that swap_cgroup's message at swapon() is nonsense.
      Because
      
        * It can be calculated very easily if all necessary information is
          written in Kconfig.
      
        * It's not necessary to annoying people at every swapon().
      
      In other view, now, memory usage per swp_entry is reduced to 2bytes from
      8bytes(64bit) and I think it's reasonably small.
      
      Reported-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      627991a2
    • KAMEZAWA Hiroyuki's avatar
      cgroups: use css id in swap cgroup for saving memory v5 · a3b2d692
      KAMEZAWA Hiroyuki authored
      
      Try to use CSS ID for records in swap_cgroup.  By this, on 64bit machine,
      size of swap_cgroup goes down to 2 bytes from 8bytes.
      
      This means, when 2GB of swap is equipped, (assume the page size is 4096bytes)
      
      	From size of swap_cgroup = 2G/4k * 8 = 4Mbytes.
      	To   size of swap_cgroup = 2G/4k * 2 = 1Mbytes.
      
      Reduction is large.  Of course, there are trade-offs.  This CSS ID will
      add overhead to swap-in/swap-out/swap-free.
      
      But in general,
        - swap is a resource which the user tend to avoid use.
        - If swap is never used, swap_cgroup area is not used.
        - Reading traditional manuals, size of swap should be proportional to
          size of memory. Memory size of machine is increasing now.
      
      I think reducing size of swap_cgroup makes sense.
      
      Note:
        - ID->CSS lookup routine has no locks, it's under RCU-Read-Side.
        - memcg can be obsolete at rmdir() but not freed while refcnt from
          swap_cgroup is available.
      
      Changelog v4->v5:
       - reworked on to memcg-charge-swapcache-to-proper-memcg.patch
      Changlog ->v4:
       - fixed not configured case.
       - deleted unnecessary comments.
       - fixed NULL pointer bug.
       - fixed message in dmesg.
      
      [nishimura@mxp.nes.nec.co.jp: css_tryget can be called twice in !PageCgroupUsed case]
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a3b2d692
    • Daisuke Nishimura's avatar
      memcg: charge swapcache to proper memcg · 3c776e64
      Daisuke Nishimura authored
      
      memcg_test.txt says at 4.1:
      
      	This swap-in is one of the most complicated work. In do_swap_page(),
      	following events occur when pte is unchanged.
      
      	(1) the page (SwapCache) is looked up.
      	(2) lock_page()
      	(3) try_charge_swapin()
      	(4) reuse_swap_page() (may call delete_swap_cache())
      	(5) commit_charge_swapin()
      	(6) swap_free().
      
      	Considering following situation for example.
      
      	(A) The page has not been charged before (2) and reuse_swap_page()
      	    doesn't call delete_from_swap_cache().
      	(B) The page has not been charged before (2) and reuse_swap_page()
      	    calls delete_from_swap_cache().
      	(C) The page has been charged before (2) and reuse_swap_page() doesn't
      	    call delete_from_swap_cache().
      	(D) The page has been charged before (2) and reuse_swap_page() calls
      	    delete_from_swap_cache().
      
      	    memory.usage/memsw.usage changes to this page/swp_entry will be
      	 Case          (A)      (B)       (C)     (D)
               Event
             Before (2)     0/ 1     0/ 1      1/ 1    1/ 1
                ===========================================
                (3)        +1/+1    +1/+1     +1/+1   +1/+1
                (4)          -       0/ 0       -     -1/ 0
                (5)         0/-1     0/ 0     -1/-1    0/ 0
                (6)          -       0/-1       -      0/-1
                ===========================================
             Result         1/ 1     1/ 1      1/ 1    1/ 1
      
             In any cases, charges to this page should be 1/ 1.
      
      In case of (D), mem_cgroup_try_get_from_swapcache() returns NULL
      (because lookup_swap_cgroup() returns NULL), so "+1/+1" at (3) means
      charges to the memcg("foo") to which the "current" belongs.
      OTOH, "-1/0" at (4) and "0/-1" at (6) means uncharges from the memcg("baa")
      to which the page has been charged.
      
      So, if the "foo" and "baa" is different(for example because of task move),
      this charge will be moved from "baa" to "foo".
      
      I think this is an unexpected behavior.
      
      This patch fixes this by modifying mem_cgroup_try_get_from_swapcache()
      to return the memcg to which the swapcache has been charged if PCG_USED bit
      is set.
      IIUC, checking PCG_USED bit of swapcache is safe under page lock.
      
      Signed-off-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3c776e64
    • KOSAKI Motohiro's avatar
      memcg: remove mem_cgroup_calc_mapped_ratio() · c137b5ec
      KOSAKI Motohiro authored
      
      Currently, mem_cgroup_calc_mapped_ratio() is unused at all.  it can be
      removed and KAMEZAWA-san suggested it.
      
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c137b5ec
    • Balbir Singh's avatar
      memcg: show memcg information during OOM · e222432b
      Balbir Singh authored
      
      Add RSS and swap to OOM output from memcg
      
      Display memcg values like failcnt, usage and limit when an OOM occurs due
      to memcg.
      
      Thanks to Johannes Weiner, Li Zefan, David Rientjes, Kamezawa Hiroyuki,
      Daisuke Nishimura and KOSAKI Motohiro for review.
      
      Sample output
      -------------
      
      Task in /a/x killed as a result of limit of /a
      memory: usage 1048576kB, limit 1048576kB, failcnt 4183
      memory+swap: usage 1400964kB, limit 9007199254740991kB, failcnt 0
      
      [akpm@linux-foundation.org: compilation fix]
      [akpm@linux-foundation.org: fix kerneldoc and whitespace]
      [akpm@linux-foundation.org: add printk facility level]
      Signed-off-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e222432b
    • KAMEZAWA Hiroyuki's avatar
      memcg: fix OOM killer under memcg · 0b7f569e
      KAMEZAWA Hiroyuki authored
      
      This patch tries to fix OOM Killer problems caused by hierarchy.
      Now, memcg itself has OOM KILL function (in oom_kill.c) and tries to
      kill a task in memcg.
      
      But, when hierarchy is used, it's broken and correct task cannot
      be killed. For example, in following cgroup
      
      	/groupA/	hierarchy=1, limit=1G,
      		01	nolimit
      		02	nolimit
      All tasks' memory usage under /groupA, /groupA/01, groupA/02 is limited to
      groupA's 1Gbytes but OOM Killer just kills tasks in groupA.
      
      This patch provides makes the bad process be selected from all tasks
      under hierarchy. BTW, currently, oom_jiffies is updated against groupA
      in above case. oom_jiffies of tree should be updated.
      
      To see how oom_jiffies is used, please check mem_cgroup_oom_called()
      callers.
      
      [akpm@linux-foundation.org: build fix]
      [akpm@linux-foundation.org: const fix]
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0b7f569e
    • KAMEZAWA Hiroyuki's avatar
      memcg: fix shrinking memory to return -EBUSY by fixing retry algorithm · 81d39c20
      KAMEZAWA Hiroyuki authored
      
      As pointed out, shrinking memcg's limit should return -EBUSY after
      reasonable retries.  This patch tries to fix the current behavior of
      shrink_usage.
      
      Before looking into "shrink should return -EBUSY" problem, we should fix
      hierarchical reclaim code.  It compares current usage and current limit,
      but it only makes sense when the kernel reclaims memory because hit
      limits.  This is also a problem.
      
      What this patch does are.
      
        1. add new argument "shrink" to hierarchical reclaim. If "shrink==true",
           hierarchical reclaim returns immediately and the caller checks the kernel
           should shrink more or not.
           (At shrinking memory, usage is always smaller than limit. So check for
            usage < limit is useless.)
      
        2. For adjusting to above change, 2 changes in "shrink"'s retry path.
           2-a. retry_count depends on # of children because the kernel visits
      	  the children under hierarchy one by one.
           2-b. rather than checking return value of hierarchical_reclaim's progress,
      	  compares usage-before-shrink and usage-after-shrink.
      	  If usage-before-shrink <= usage-after-shrink, retry_count is
      	  decremented.
      
      Reported-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      81d39c20
    • KAMEZAWA Hiroyuki's avatar
      memcg: hierarchical stat · 14067bb3
      KAMEZAWA Hiroyuki authored
      
      Clean up memory.stat file routine and show "total" hierarchical stat.
      
      This patch does
        - renamed get_all_zonestat to be get_local_zonestat.
        - remove old mem_cgroup_stat_desc, which is only for per-cpu stat.
        - add mcs_stat to cover both of per-cpu/per-lru stat.
        - add "total" stat of hierarchy (*)
        - add a callback system to scan all memcg under a root.
      == "total" is added.
      [kamezawa@localhost ~]$ cat /opt/cgroup/xxx/memory.stat
      cache 0
      rss 0
      pgpgin 0
      pgpgout 0
      inactive_anon 0
      active_anon 0
      inactive_file 0
      active_file 0
      unevictable 0
      hierarchical_memory_limit 50331648
      hierarchical_memsw_limit 9223372036854775807
      total_cache 65536
      total_rss 192512
      total_pgpgin 218
      total_pgpgout 155
      total_inactive_anon 0
      total_active_anon 135168
      total_inactive_file 61440
      total_active_file 4096
      total_unevictable 0
      ==
      (*) maybe the user can do calc hierarchical stat by his own program
         in userland but if it can be written in clean way, it's worth to be
         shown, I think.
      
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      14067bb3
    • KAMEZAWA Hiroyuki's avatar
      memcg: use CSS ID · 04046e1a
      KAMEZAWA Hiroyuki authored
      
      Assigning CSS ID for each memcg and use css_get_next() for scanning hierarchy.
      
      	Assume folloing tree.
      
      	group_A (ID=3)
      		/01 (ID=4)
      		   /0A (ID=7)
      		/02 (ID=10)
      	group_B (ID=5)
      	and task in group_A/01/0A hits limit at group_A.
      
      	reclaim will be done in following order (round-robin).
      	group_A(3) -> group_A/01 (4) -> group_A/01/0A (7) -> group_A/02(10)
      	-> group_A -> .....
      
      	Round robin by ID. The last visited cgroup is recorded and restart
      	from it when it start reclaim again.
      	(More smart algorithm can be implemented..)
      
      	No cgroup_mutex or hierarchy_mutex is required.
      
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      04046e1a
    • KAMEZAWA Hiroyuki's avatar
      cgroup: fix frequent -EBUSY at rmdir · ec64f515
      KAMEZAWA Hiroyuki authored
      
      In following situation, with memory subsystem,
      
      	/groupA use_hierarchy==1
      		/01 some tasks
      		/02 some tasks
      		/03 some tasks
      		/04 empty
      
      When tasks under 01/02/03 hit limit on /groupA, hierarchical reclaim
      is triggered and the kernel walks tree under groupA. In this case,
      rmdir /groupA/04 fails with -EBUSY frequently because of temporal
      refcnt from the kernel.
      
      In general. cgroup can be rmdir'd if there are no children groups and
      no tasks. Frequent fails of rmdir() is not useful to users.
      (And the reason for -EBUSY is unknown to users.....in most cases)
      
      This patch tries to modify above behavior, by
      	- retries if css_refcnt is got by someone.
      	- add "return value" to pre_destroy() and allows subsystem to
      	  say "we're really busy!"
      
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ec64f515
    • Jean Delvare's avatar
      workqueue: add to_delayed_work() helper function · bf6aede7
      Jean Delvare authored
      
      It is a fairly common operation to have a pointer to a work and to need a
      pointer to the delayed work it is contained in.  In particular, all
      delayed works which want to rearm themselves will have to do that.  So it
      would seem fair to offer a helper function for this operation.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarJean Delvare <khali@linux-fr.org>
      Acked-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bf6aede7
Loading