Skip to content
Snippets Groups Projects
  1. Feb 11, 2018
    • Linus Torvalds's avatar
      vfs: do bulk POLL* -> EPOLL* replacement · a9a08845
      Linus Torvalds authored
      
      This is the mindless scripted replacement of kernel use of POLL*
      variables as described by Al, done by this script:
      
          for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
              L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
              for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
          done
      
      with de-mangling cleanups yet to come.
      
      NOTE! On almost all architectures, the EPOLL* constants have the same
      values as the POLL* constants do.  But they keyword here is "almost".
      For various bad reasons they aren't the same, and epoll() doesn't
      actually work quite correctly in some cases due to this on Sparc et al.
      
      The next patch from Al will sort out the final differences, and we
      should be all done.
      
      Scripted-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a9a08845
  2. Feb 06, 2018
  3. Feb 01, 2018
  4. Jan 31, 2018
    • Masatake YAMATO's avatar
      kvm: embed vcpu id to dentry of vcpu anon inode · e46b4692
      Masatake YAMATO authored
      
      All d-entries for vcpu have the same, "anon_inode:kvm-vcpu". That means
      it is impossible to know the mapping between fds for vcpu and vcpu
      from userland.
      
          # LC_ALL=C ls -l /proc/617/fd | grep vcpu
          lrwx------. 1 qemu qemu 64 Jan  7 16:50 18 -> anon_inode:kvm-vcpu
          lrwx------. 1 qemu qemu 64 Jan  7 16:50 19 -> anon_inode:kvm-vcpu
      
      It is also impossible to know the mapping between vma for kvm_run
      structure and vcpu from userland.
      
          # LC_ALL=C grep vcpu /proc/617/maps
          7f9d842d0000-7f9d842d3000 rw-s 00000000 00:0d 20393                      anon_inode:kvm-vcpu
          7f9d842d3000-7f9d842d6000 rw-s 00000000 00:0d 20393                      anon_inode:kvm-vcpu
      
      This change adds vcpu id to d-entries for vcpu. With this change
      you can get the following output:
      
          # LC_ALL=C ls -l /proc/617/fd | grep vcpu
          lrwx------. 1 qemu qemu 64 Jan  7 16:50 18 -> anon_inode:kvm-vcpu:0
          lrwx------. 1 qemu qemu 64 Jan  7 16:50 19 -> anon_inode:kvm-vcpu:1
      
          # LC_ALL=C grep vcpu /proc/617/maps
          7f9d842d0000-7f9d842d3000 rw-s 00000000 00:0d 20393                      anon_inode:kvm-vcpu:0
          7f9d842d3000-7f9d842d6000 rw-s 00000000 00:0d 20393                      anon_inode:kvm-vcpu:1
      
      With the mappings known from the output, a tool like strace can report more details
      of qemu-kvm process activities. Here is the strace output of my local prototype:
      
          # ./strace -KK -f -p 617 2>&1 | grep 'KVM_RUN\| K'
          ...
          [pid   664] ioctl(18, KVM_RUN, 0)       = 0 (KVM_EXIT_MMIO)
           K ready_for_interrupt_injection=1, if_flag=0, flags=0, cr8=0000000000000000, apic_base=0x000000fee00d00
           K phys_addr=0, len=1634035803, [33, 0, 0, 0, 0, 0, 0, 0], is_write=112
          [pid   664] ioctl(18, KVM_RUN, 0)       = 0 (KVM_EXIT_MMIO)
           K ready_for_interrupt_injection=1, if_flag=1, flags=0, cr8=0000000000000000, apic_base=0x000000fee00d00
           K phys_addr=0, len=1634035803, [33, 0, 0, 0, 0, 0, 0, 0], is_write=112
          ...
      
      Signed-off-by: default avatarMasatake YAMATO <yamato@redhat.com>
      Acked-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      e46b4692
    • KarimAllah Ahmed's avatar
      kvm: Map PFN-type memory regions as writable (if possible) · a340b3e2
      KarimAllah Ahmed authored
      
      For EPT-violations that are triggered by a read, the pages are also mapped with
      write permissions (if their memory region is also writable). That would avoid
      getting yet another fault on the same page when a write occurs.
      
      This optimization only happens when you have a "struct page" backing the memory
      region. So also enable it for memory regions that do not have a "struct page".
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarKarimAllah Ahmed <karahmed@amazon.de>
      Reviewed-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      a340b3e2
    • Christoffer Dall's avatar
      KVM: arm/arm64: Fixup userspace irqchip static key optimization · cd15d205
      Christoffer Dall authored
      
      When I introduced a static key to avoid work in the critical path for
      userspace irqchips which is very rarely used, I accidentally messed up
      my logic and used && where I should have used ||, because the point was
      to short-circuit the evaluation in case userspace irqchips weren't even
      in use.
      
      This fixes an issue when running in-kernel irqchip VMs alongside
      userspace irqchip VMs.
      
      Acked-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Fixes: c44c232ee2d3 ("KVM: arm/arm64: Avoid work when userspace iqchips are not used")
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      cd15d205
    • Christoffer Dall's avatar
      KVM: arm/arm64: Fix userspace_irqchip_in_use counting · f1d7231c
      Christoffer Dall authored
      
      We were not decrementing the static key count in the right location.
      kvm_arch_vcpu_destroy() is only called to clean up after a failed
      VCPU create attempt, whereas kvm_arch_vcpu_free() is called on teardown
      of the VM as well.  Move the static key decrement call to
      kvm_arch_vcpu_free().
      
      Acked-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      f1d7231c
    • Christoffer Dall's avatar
      KVM: arm/arm64: Fix incorrect timer_is_pending logic · 13e59ece
      Christoffer Dall authored
      
      After the recently introduced support for level-triggered mapped
      interrupt, I accidentally left the VCPU thread busily going back and
      forward between the guest and the hypervisor whenever the guest was
      blocking, because I would always incorrectly report that a timer
      interrupt was pending.
      
      This is because the timer->irq.level field is not valid for mapped
      interrupts, where we offload the level state to the hardware, and as a
      result this field is always true.
      
      Luckily the problem can be relatively easily solved by not checking the
      cached signal state of either timer in kvm_timer_should_fire() but
      instead compute the timer state on the fly, which we do already if the
      cached signal state wasn't high.  In fact, the only reason for checking
      the cached signal state was a tiny optimization which would only be
      potentially faster when the polling loop detects a pending timer
      interrupt, which is quite unlikely.
      
      Instead of duplicating the logic from kvm_arch_timer_handler(), we
      enlighten kvm_timer_should_fire() to report something valid when the
      timer state is loaded onto the hardware.  We can then call this from
      kvm_arch_timer_handler() as well and avoid the call to
      __timer_snapshot_state() in kvm_arch_timer_get_input_level().
      
      Reported-by: default avatarTomasz Nowicki <tn@semihalf.com>
      Tested-by: default avatarTomasz Nowicki <tn@semihalf.com>
      Reviewed-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      13e59ece
  5. Jan 23, 2018
    • James Morse's avatar
      KVM: arm/arm64: Handle CPU_PM_ENTER_FAILED · 58d6b15e
      James Morse authored
      
      cpu_pm_enter() calls the pm notifier chain with CPU_PM_ENTER, then if
      there is a failure: CPU_PM_ENTER_FAILED.
      
      When KVM receives CPU_PM_ENTER it calls cpu_hyp_reset() which will
      return us to the hyp-stub. If we subsequently get a CPU_PM_ENTER_FAILED,
      KVM does nothing, leaving the CPU running with the hyp-stub, at odds
      with kvm_arm_hardware_enabled.
      
      Add CPU_PM_ENTER_FAILED as a fallthrough for CPU_PM_EXIT, this reloads
      KVM based on kvm_arm_hardware_enabled. This is safe even if CPU_PM_ENTER
      never gets as far as KVM, as cpu_hyp_reinit() calls cpu_hyp_reset()
      to make sure the hyp-stub is loaded before reloading KVM.
      
      Fixes: 67f69197 ("arm64: kvm: allows kvm cpu hotplug")
      Cc: <stable@vger.kernel.org> # v4.7+
      CC: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      Reviewed-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      Signed-off-by: default avatarJames Morse <james.morse@arm.com>
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      58d6b15e
  6. Jan 16, 2018
    • James Morse's avatar
      KVM: arm64: Handle RAS SErrors from EL1 on guest exit · 3368bd80
      James Morse authored
      
      We expect to have firmware-first handling of RAS SErrors, with errors
      notified via an APEI method. For systems without firmware-first, add
      some minimal handling to KVM.
      
      There are two ways KVM can take an SError due to a guest, either may be a
      RAS error: we exit the guest due to an SError routed to EL2 by HCR_EL2.AMO,
      or we take an SError from EL2 when we unmask PSTATE.A from __guest_exit.
      
      For SError that interrupt a guest and are routed to EL2 the existing
      behaviour is to inject an impdef SError into the guest.
      
      Add code to handle RAS SError based on the ESR. For uncontained and
      uncategorized errors arm64_is_fatal_ras_serror() will panic(), these
      errors compromise the host too. All other error types are contained:
      For the fatal errors the vCPU can't make progress, so we inject a virtual
      SError. We ignore contained errors where we can make progress as if
      we're lucky, we may not hit them again.
      
      If only some of the CPUs support RAS the guest will see the cpufeature
      sanitised version of the id registers, but we may still take RAS SError
      on this CPU. Move the SError handling out of handle_exit() into a new
      handler that runs before we can be preempted. This allows us to use
      this_cpu_has_cap(), via arm64_is_ras_serror().
      
      Acked-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarJames Morse <james.morse@arm.com>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      3368bd80
    • James Morse's avatar
      KVM: arm/arm64: mask/unmask daif around VHE guests · 4f5abad9
      James Morse authored
      
      Non-VHE systems take an exception to EL2 in order to world-switch into the
      guest. When returning from the guest KVM implicitly restores the DAIF
      flags when it returns to the kernel at EL1.
      
      With VHE none of this exception-level jumping happens, so KVMs
      world-switch code is exposed to the host kernel's DAIF values, and KVM
      spills the guest-exit DAIF values back into the host kernel.
      On entry to a guest we have Debug and SError exceptions unmasked, KVM
      has switched VBAR but isn't prepared to handle these. On guest exit
      Debug exceptions are left disabled once we return to the host and will
      stay this way until we enter user space.
      
      Add a helper to mask/unmask DAIF around VHE guests. The unmask can only
      happen after the hosts VBAR value has been synchronised by the isb in
      __vhe_hyp_call (via kvm_call_hyp()). Masking could be as late as
      setting KVMs VBAR value, but is kept here for symmetry.
      
      Acked-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarJames Morse <james.morse@arm.com>
      Reviewed-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      4f5abad9
  7. Jan 15, 2018
  8. Jan 13, 2018
  9. Jan 12, 2018
  10. Jan 11, 2018
  11. Jan 08, 2018
  12. Jan 02, 2018
    • Christoffer Dall's avatar
      KVM: arm/arm64: Avoid work when userspace iqchips are not used · 61bbe380
      Christoffer Dall authored
      
      We currently check if the VM has a userspace irqchip in several places
      along the critical path, and if so, we do some work which is only
      required for having an irqchip in userspace.  This is unfortunate, as we
      could avoid doing any work entirely, if we didn't have to support
      irqchip in userspace.
      
      Realizing the userspace irqchip on ARM is mostly a developer or hobby
      feature, and is unlikely to be used in servers or other scenarios where
      performance is a priority, we can use a refcounted static key to only
      check the irqchip configuration when we have at least one VM that uses
      an irqchip in userspace.
      
      Reviewed-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      61bbe380
    • Christoffer Dall's avatar
      KVM: arm/arm64: Provide a get_input_level for the arch timer · 4c60e360
      Christoffer Dall authored
      
      The VGIC can now support the life-cycle of mapped level-triggered
      interrupts, and we no longer have to read back the timer state on every
      exit from the VM if we had an asserted timer interrupt signal, because
      the VGIC already knows if we hit the unlikely case where the guest
      disables the timer without ACKing the virtual timer interrupt.
      
      This means we rework a bit of the code to factor out the functionality
      to snapshot the timer state from vtimer_save_state(), and we can reuse
      this functionality in the sync path when we have an irqchip in
      userspace, and also to support our implementation of the
      get_input_level() function for the timer.
      
      This change also means that we can no longer rely on the timer's view of
      the interrupt line to set the active state, because we no longer
      maintain this state for mapped interrupts when exiting from the guest.
      Instead, we only set the active state if the virtual interrupt is
      active, and otherwise we simply let the timer fire again and raise the
      virtual interrupt from the ISR.
      
      Reviewed-by: default avatarEric Auger <eric.auger@redhat.com>
      Reviewed-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      4c60e360
    • Christoffer Dall's avatar
      KVM: arm/arm64: Support VGIC dist pend/active changes for mapped IRQs · df635c5b
      Christoffer Dall authored
      
      For mapped IRQs (with the HW bit set in the LR) we have to follow some
      rules of the architecture.  One of these rules is that VM must not be
      allowed to deactivate a virtual interrupt with the HW bit set unless the
      physical interrupt is also active.
      
      This works fine when injecting mapped interrupts, because we leave it up
      to the injector to either set EOImode==1 or manually set the active
      state of the physical interrupt.
      
      However, the guest can set virtual interrupt to be pending or active by
      writing to the virtual distributor, which could lead to deactivating a
      virtual interrupt with the HW bit set without the physical interrupt
      being active.
      
      We could set the physical interrupt to active whenever we are about to
      enter the VM with a HW interrupt either pending or active, but that
      would be really slow, especially on GICv2.  So we take the long way
      around and do the hard work when needed, which is expected to be
      extremely rare.
      
      When the VM sets the pending state for a HW interrupt on the virtual
      distributor we set the active state on the physical distributor, because
      the virtual interrupt can become active and then the guest can
      deactivate it.
      
      When the VM clears the pending state we also clear it on the physical
      side, because the injector might otherwise raise the interrupt.  We also
      clear the physical active state when the virtual interrupt is not
      active, since otherwise a SPEND/CPEND sequence from the guest would
      prevent signaling of future interrupts.
      
      Changing the state of mapped interrupts from userspace is not supported,
      and it's expected that userspace unmaps devices from VFIO before
      attempting to set the interrupt state, because the interrupt state is
      driven by hardware.
      
      Reviewed-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Reviewed-by: default avatarEric Auger <eric.auger@redhat.com>
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      df635c5b
    • Christoffer Dall's avatar
      KVM: arm/arm64: Support a vgic interrupt line level sample function · b6909a65
      Christoffer Dall authored
      
      The GIC sometimes need to sample the physical line of a mapped
      interrupt.  As we know this to be notoriously slow, provide a callback
      function for devices (such as the timer) which can do this much faster
      than talking to the distributor, for example by comparing a few
      in-memory values.  Fall back to the good old method of poking the
      physical GIC if no callback is provided.
      
      Reviewed-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Reviewed-by: default avatarEric Auger <eric.auger@redhat.com>
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      b6909a65
    • Christoffer Dall's avatar
      KVM: arm/arm64: vgic: Support level-triggered mapped interrupts · e40cc57b
      Christoffer Dall authored
      Level-triggered mapped IRQs are special because we only observe rising
      edges as input to the VGIC, and we don't set the EOI flag and therefore
      are not told when the level goes down, so that we can re-queue a new
      interrupt when the level goes up.
      
      One way to solve this problem is to side-step the logic of the VGIC and
      special case the validation in the injection path, but it has the
      unfortunate drawback of having to peak into the physical GIC state
      whenever we want to know if the interrupt is pending on the virtual
      distributor.
      
      Instead, we can maintain the current semantics of a level triggered
      interrupt by sort of treating it as an edge-triggered interrupt,
      following from the fact that we only observe an asserting edge.  This
      requires us to be a bit careful when populating the LRs and when folding
      the state back in though:
      
       * We lower the line level when populating the LR, so that when
         subsequently observing an asserting edge, the VGIC will do the right
         thing.
      
       * If the guest never acked the interrupt while running (for example if
         it had masked interrupts at the CPU level while running), we have
         to preserve the pending state of the LR and move it back to the
         line_level field of the struct irq when folding LR state.
      
         If the guest never acked the interrupt while running, but changed the
         device state and lowered the line (again with interrupts masked) then
         we need to observe this change in the line_level.
      
         Both of the above situations are solved by sampling the physical line
         and set the line level when folding the LR back.
      
       * Finally, if the guest never acked the interrupt while running and
         sampling the line reveals that the device state has changed and the
         line has been lowered, we must clear the physical active state, since
         we will otherwise never be told when the interrupt becomes asserted
         again.
      
      This has the added benefit of making the timer optimization patches
      (https://lists.cs.columbia.edu/pipermail/kvmarm/2017-July/026343.html
      
      ) a
      bit simpler, because the timer code doesn't have to clear the active
      state on the sync anymore.  It also potentially improves the performance
      of the timer implementation because the GIC knows the state or the LR
      and only needs to clear the
      active state when the pending bit in the LR is still set, where the
      timer has to always clear it when returning from running the guest with
      an injected timer interrupt.
      
      Reviewed-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Reviewed-by: default avatarEric Auger <eric.auger@redhat.com>
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      e40cc57b
    • Christoffer Dall's avatar
      KVM: arm/arm64: Don't cache the timer IRQ level · 70450a9f
      Christoffer Dall authored
      
      The timer logic was designed after a strict idea of modeling an
      interrupt line level in software, meaning that only transitions in the
      level need to be reported to the VGIC.  This works well for the timer,
      because the arch timer code is in complete control of the device and can
      track the transitions of the line.
      
      However, as we are about to support using the HW bit in the VGIC not
      just for the timer, but also for VFIO which cannot track transitions of
      the interrupt line, we have to decide on an interface between the GIC
      and other subsystems for level triggered mapped interrupts, which both
      the timer and VFIO can use.
      
      VFIO only sees an asserting transition of the physical interrupt line,
      and tells the VGIC when that happens.  That means that part of the
      interrupt flow is offloaded to the hardware.
      
      To use the same interface for VFIO devices and the timer, we therefore
      have to change the timer (we cannot change VFIO because it doesn't know
      the details of the device it is assigning to a VM).
      
      Luckily, changing the timer is simple, we just need to stop 'caching'
      the line level, but instead let the VGIC know the state of the timer
      every time there is a potential change in the line level, and when the
      line level should be asserted from the timer ISR.  The VGIC can ignore
      extra notifications using its validate mechanism.
      
      Reviewed-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Reviewed-by: default avatarAndre Przywara <andre.przywara@arm.com>
      Reviewed-by: default avatarJulien Thierry <julien.thierry@arm.com>
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      70450a9f
    • Christoffer Dall's avatar
      KVM: arm/arm64: Factor out functionality to get vgic mmio requester_vcpu · 6c1b7521
      Christoffer Dall authored
      
      We are about to distinguish between userspace accesses and mmio traps
      for a number of the mmio handlers.  When the requester vcpu is NULL, it
      means we are handling a userspace access.
      
      Factor out the functionality to get the request vcpu into its own
      function, mostly so we have a common place to document the semantics of
      the return value.
      
      Also take the chance to move the functionality outside of holding a
      spinlock and instead explicitly disable and enable preemption.  This
      supports PREEMPT_RT kernels as well.
      
      Acked-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Reviewed-by: default avatarAndre Przywara <andre.przywara@arm.com>
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      6c1b7521
    • Christoffer Dall's avatar
      KVM: arm/arm64: Remove redundant preemptible checks · 5a245750
      Christoffer Dall authored
      
      The __this_cpu_read() and __this_cpu_write() functions already implement
      checks for the required preemption levels when using
      CONFIG_DEBUG_PREEMPT which gives you nice error messages and such.
      Therefore there is no need to explicitly check this using a BUG_ON() in
      the code (which we don't do for other uses of per cpu variables either).
      
      Acked-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Reviewed-by: default avatarAndre Przywara <andre.przywara@arm.com>
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      5a245750
    • Vasyl Gomonovych's avatar
      KVM: arm: Use PTR_ERR_OR_ZERO() · 4404b336
      Vasyl Gomonovych authored
      
      Fix ptr_ret.cocci warnings:
      virt/kvm/arm/vgic/vgic-its.c:971:1-3: WARNING: PTR_ERR_OR_ZERO can be used
      
      Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR
      
      Generated by: scripts/coccinelle/api/ptr_ret.cocci
      
      Signed-off-by: default avatarVasyl Gomonovych <gomonovych@gmail.com>
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      4404b336
  13. Dec 22, 2017
Loading