Skip to content
Snippets Groups Projects
  1. Jul 11, 2024
    • Breno Leitao's avatar
      KVM: Fix a data race on last_boosted_vcpu in kvm_vcpu_on_spin() · 4eb069df
      Breno Leitao authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      commit 49f683b4 upstream.
      
      Use {READ,WRITE}_ONCE() to access kvm->last_boosted_vcpu to ensure the
      loads and stores are atomic.  In the extremely unlikely scenario the
      compiler tears the stores, it's theoretically possible for KVM to attempt
      to get a vCPU using an out-of-bounds index, e.g. if the write is split
      into multiple 8-bit stores, and is paired with a 32-bit load on a VM with
      257 vCPUs:
      
        CPU0                              CPU1
        last_boosted_vcpu = 0xff;
      
                                          (last_boosted_vcpu = 0x100)
                                          last_boosted_vcpu[15:8] = 0x01;
        i = (last_boosted_vcpu = 0x1ff)
                                          last_boosted_vcpu[7:0] = 0x00;
      
        vcpu = kvm->vcpu_array[0x1ff];
      
      As detected by KCSAN:
      
        BUG: KCSAN: data-race in kvm_vcpu_on_spin [kvm] / kvm_vcpu_on_spin [kvm]
      
        write to 0xffffc90025a92344 of 4 bytes by task 4340 on cpu 16:
        kvm_vcpu_on_spin (arch/x86/kvm/../../../virt/kvm/kvm_main.c:4112) kvm
        handle_pause (arch/x86/kvm/vmx/vmx.c:5929) kvm_intel
        vmx_handle_exit (arch/x86/kvm/vmx/vmx.c:?
      		 arch/x86/kvm/vmx/vmx.c:6606) kvm_intel
        vcpu_run (arch/x86/kvm/x86.c:11107 arch/x86/kvm/x86.c:11211) kvm
        kvm_arch_vcpu_ioctl_run (arch/x86/kvm/x86.c:?) kvm
        kvm_vcpu_ioctl (arch/x86/kvm/../../../virt/kvm/kvm_main.c:?) kvm
        __se_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:904 fs/ioctl.c:890)
        __x64_sys_ioctl (fs/ioctl.c:890)
        x64_sys_call (arch/x86/entry/syscall_64.c:33)
        do_syscall_64 (arch/x86/entry/common.c:?)
        entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
      
        read to 0xffffc90025a92344 of 4 bytes by task 4342 on cpu 4:
        kvm_vcpu_on_spin (arch/x86/kvm/../../../virt/kvm/kvm_main.c:4069) kvm
        handle_pause (arch/x86/kvm/vmx/vmx.c:5929) kvm_intel
        vmx_handle_exit (arch/x86/kvm/vmx/vmx.c:?
      			arch/x86/kvm/vmx/vmx.c:6606) kvm_intel
        vcpu_run (arch/x86/kvm/x86.c:11107 arch/x86/kvm/x86.c:11211) kvm
        kvm_arch_vcpu_ioctl_run (arch/x86/kvm/x86.c:?) kvm
        kvm_vcpu_ioctl (arch/x86/kvm/../../../virt/kvm/kvm_main.c:?) kvm
        __se_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:904 fs/ioctl.c:890)
        __x64_sys_ioctl (fs/ioctl.c:890)
        x64_sys_call (arch/x86/entry/syscall_64.c:33)
        do_syscall_64 (arch/x86/entry/common.c:?)
        entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
      
        value changed: 0x00000012 -> 0x00000000
      
      Fixes: 217ece61 ("KVM: use yield_to instead of sleep in kvm_vcpu_on_spin")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarBreno Leitao <leitao@debian.org>
      Link: https://lore.kernel.org/r/20240510092353.2261824-1-leitao@debian.org
      
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4eb069df
  2. Apr 11, 2024
    • Sean Christopherson's avatar
      KVM: Always flush async #PF workqueue when vCPU is being destroyed · 8b509dd5
      Sean Christopherson authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      [ Upstream commit 3d75b8aa ]
      
      Always flush the per-vCPU async #PF workqueue when a vCPU is clearing its
      completion queue, e.g. when a VM and all its vCPUs is being destroyed.
      KVM must ensure that none of its workqueue callbacks is running when the
      last reference to the KVM _module_ is put.  Gifting a reference to the
      associated VM prevents the workqueue callback from dereferencing freed
      vCPU/VM memory, but does not prevent the KVM module from being unloaded
      before the callback completes.
      
      Drop the misguided VM refcount gifting, as calling kvm_put_kvm() from
      async_pf_execute() if kvm_put_kvm() flushes the async #PF workqueue will
      result in deadlock.  async_pf_execute() can't return until kvm_put_kvm()
      finishes, and kvm_put_kvm() can't return until async_pf_execute() finishes:
      
       WARNING: CPU: 8 PID: 251 at virt/kvm/kvm_main.c:1435 kvm_put_kvm+0x2d/0x320 [kvm]
       Modules linked in: vhost_net vhost vhost_iotlb tap kvm_intel kvm irqbypass
       CPU: 8 PID: 251 Comm: kworker/8:1 Tainted: G        W          6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
       Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
       Workqueue: events async_pf_execute [kvm]
       RIP: 0010:kvm_put_kvm+0x2d/0x320 [kvm]
       Call Trace:
        <TASK>
        async_pf_execute+0x198/0x260 [kvm]
        process_one_work+0x145/0x2d0
        worker_thread+0x27e/0x3a0
        kthread+0xba/0xe0
        ret_from_fork+0x2d/0x50
        ret_from_fork_asm+0x11/0x20
        </TASK>
       ---[ end trace 0000000000000000 ]---
       INFO: task kworker/8:1:251 blocked for more than 120 seconds.
             Tainted: G        W          6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
       "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
       task:kworker/8:1     state:D stack:0     pid:251   ppid:2      flags:0x00004000
       Workqueue: events async_pf_execute [kvm]
       Call Trace:
        <TASK>
        __schedule+0x33f/0xa40
        schedule+0x53/0xc0
        schedule_timeout+0x12a/0x140
        __wait_for_common+0x8d/0x1d0
        __flush_work.isra.0+0x19f/0x2c0
        kvm_clear_async_pf_completion_queue+0x129/0x190 [kvm]
        kvm_arch_destroy_vm+0x78/0x1b0 [kvm]
        kvm_put_kvm+0x1c1/0x320 [kvm]
        async_pf_execute+0x198/0x260 [kvm]
        process_one_work+0x145/0x2d0
        worker_thread+0x27e/0x3a0
        kthread+0xba/0xe0
        ret_from_fork+0x2d/0x50
        ret_from_fork_asm+0x11/0x20
        </TASK>
      
      If kvm_clear_async_pf_completion_queue() actually flushes the workqueue,
      then there's no need to gift async_pf_execute() a reference because all
      invocations of async_pf_execute() will be forced to complete before the
      vCPU and its VM are destroyed/freed.  And that in turn fixes the module
      unloading bug as __fput() won't do module_put() on the last vCPU reference
      until the vCPU has been freed, e.g. if closing the vCPU file also puts the
      last reference to the KVM module.
      
      Note that kvm_check_async_pf_completion() may also take the work item off
      the completion queue and so also needs to flush the work queue, as the
      work will not be seen by kvm_clear_async_pf_completion_queue().  Waiting
      on the workqueue could theoretically delay a vCPU due to waiting for the
      work to complete, but that's a very, very small chance, and likely a very
      small delay.  kvm_arch_async_page_present_queued() unconditionally makes a
      new request, i.e. will effectively delay entering the guest, so the
      remaining work is really just:
      
              trace_kvm_async_pf_completed(addr, cr2_or_gpa);
      
              __kvm_vcpu_wake_up(vcpu);
      
              mmput(mm);
      
      and mmput() can't drop the last reference to the page tables if the vCPU is
      still alive, i.e. the vCPU won't get stuck tearing down page tables.
      
      Add a helper to do the flushing, specifically to deal with "wakeup all"
      work items, as they aren't actually work items, i.e. are never placed in a
      workqueue.  Trying to flush a bogus workqueue entry rightly makes
      __flush_work() complain (kudos to whoever added that sanity check).
      
      Note, commit 5f6de5cb ("KVM: Prevent module exit until all VMs are
      freed") *tried* to fix the module refcounting issue by having VMs grab a
      reference to the module, but that only made the bug slightly harder to hit
      as it gave async_pf_execute() a bit more time to complete before the KVM
      module could be unloaded.
      
      Fixes: af585b92 ("KVM: Halt vcpu if page it tries to access is swapped out")
      Cc: stable@vger.kernel.org
      Cc: David Matlack <dmatlack@google.com>
      Reviewed-by: default avatarXu Yilun <yilun.xu@intel.com>
      Reviewed-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Link: https://lore.kernel.org/r/20240110011533.503302-2-seanjc@google.com
      
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      8b509dd5
  3. Oct 12, 2023
  4. Aug 17, 2023
    • Sean Christopherson's avatar
      KVM: Grab a reference to KVM for VM and vCPU stats file descriptors · c08e013d
      Sean Christopherson authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      commit eed3013f upstream.
      
      Grab a reference to KVM prior to installing VM and vCPU stats file
      descriptors to ensure the underlying VM and vCPU objects are not freed
      until the last reference to any and all stats fds are dropped.
      
      Note, the stats paths manually invoke fd_install() and so don't need to
      grab a reference before creating the file.
      
      Fixes: ce55c049 ("KVM: stats: Support binary stats retrieval for a VCPU")
      Fixes: fcfe1bae ("KVM: stats: Support binary stats retrieval for a VM")
      Reported-by: default avatarZheng Zhang <zheng.zhang@email.ucr.edu>
      Closes: https://lore.kernel.org/all/CAC_GQSr3xzZaeZt85k_RCBd5kfiOve8qXo7a81Cq53LuVQ5r=Q@mail.gmail.com
      
      
      Cc: stable@vger.kernel.org
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Message-Id: <20230711230131.648752-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c08e013d
    • Gavin Shan's avatar
      KVM: Avoid illegal stage2 mapping on invalid memory slot · a7e51f80
      Gavin Shan authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      commit 2230f9e1 upstream.
      
      We run into guest hang in edk2 firmware when KSM is kept as running on
      the host. The edk2 firmware is waiting for status 0x80 from QEMU's pflash
      device (TYPE_PFLASH_CFI01) during the operation of sector erasing or
      buffered write. The status is returned by reading the memory region of
      the pflash device and the read request should have been forwarded to QEMU
      and emulated by it. Unfortunately, the read request is covered by an
      illegal stage2 mapping when the guest hang issue occurs. The read request
      is completed with QEMU bypassed and wrong status is fetched. The edk2
      firmware runs into an infinite loop with the wrong status.
      
      The illegal stage2 mapping is populated due to same page sharing by KSM
      at (C) even the associated memory slot has been marked as invalid at (B)
      when the memory slot is requested to be deleted. It's notable that the
      active and inactive memory slots can't be swapped when we're in the middle
      of kvm_mmu_notifier_change_pte() because kvm->mn_active_invalidate_count
      is elevated, and kvm_swap_active_memslots() will busy loop until it reaches
      to zero again. Besides, the swapping from the active to the inactive memory
      slots is also avoided by holding &kvm->srcu in __kvm_handle_hva_range(),
      corresponding to synchronize_srcu_expedited() in kvm_swap_active_memslots().
      
        CPU-A                    CPU-B
        -----                    -----
                                 ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION)
                                 kvm_vm_ioctl_set_memory_region
                                 kvm_set_memory_region
                                 __kvm_set_memory_region
                                 kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE)
                                   kvm_invalidate_memslot
                                     kvm_copy_memslot
                                     kvm_replace_memslot
                                     kvm_swap_active_memslots        (A)
                                     kvm_arch_flush_shadow_memslot   (B)
        same page sharing by KSM
        kvm_mmu_notifier_invalidate_range_start
              :
        kvm_mmu_notifier_change_pte
          kvm_handle_hva_range
          __kvm_handle_hva_range
          kvm_set_spte_gfn            (C)
              :
        kvm_mmu_notifier_invalidate_range_end
      
      Fix the issue by skipping the invalid memory slot at (C) to avoid the
      illegal stage2 mapping so that the read request for the pflash's status
      is forwarded to QEMU and emulated by it. In this way, the correct pflash's
      status can be returned from QEMU to break the infinite loop in the edk2
      firmware.
      
      We tried a git-bisect and the first problematic commit is cd4c7183 ("
      KVM: arm64: Convert to the gfn-based MMU notifier callbacks"). With this,
      clean_dcache_guest_page() is called after the memory slots are iterated
      in kvm_mmu_notifier_change_pte(). clean_dcache_guest_page() is called
      before the iteration on the memory slots before this commit. This change
      literally enlarges the racy window between kvm_mmu_notifier_change_pte()
      and memory slot removal so that we're able to reproduce the issue in a
      practical test case. However, the issue exists since commit d5d8184d
      ("KVM: ARM: Memory virtualization setup").
      
      Cc: stable@vger.kernel.org # v3.9+
      Fixes: d5d8184d ("KVM: ARM: Memory virtualization setup")
      Reported-by: default avatarShuai Hu <hshuai@redhat.com>
      Reported-by: default avatarZhenyu Zhang <zhenyzha@redhat.com>
      Signed-off-by: default avatarGavin Shan <gshan@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOliver Upton <oliver.upton@linux.dev>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarShaoqin Huang <shahuang@redhat.com>
      Message-Id: <20230615054259.14911-1-gshan@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a7e51f80
    • Michal Luczaj's avatar
      KVM: Fix vcpu_array[0] races · 79486642
      Michal Luczaj authored and Frieder Schrempf's avatar Frieder Schrempf committed
      
      commit afb2acb2 upstream.
      
      In kvm_vm_ioctl_create_vcpu(), add vcpu to vcpu_array iff it's safe to
      access vcpu via kvm_get_vcpu() and kvm_for_each_vcpu(), i.e. when there's
      no failure path requiring vcpu removal and destruction. Such order is
      important because vcpu_array accessors may end up referencing vcpu at
      vcpu_array[0] even before online_vcpus is set to 1.
      
      When online_vcpus=0, any call to kvm_get_vcpu() goes through
      array_index_nospec() and ends with an attempt to xa_load(vcpu_array, 0):
      
      	int num_vcpus = atomic_read(&kvm->online_vcpus);
      	i = array_index_nospec(i, num_vcpus);
      	return xa_load(&kvm->vcpu_array, i);
      
      Similarly, when online_vcpus=0, a kvm_for_each_vcpu() does not iterate over
      an "empty" range, but actually [0, ULONG_MAX]:
      
      	xa_for_each_range(&kvm->vcpu_array, idx, vcpup, 0, \
      			  (atomic_read(&kvm->online_vcpus) - 1))
      
      In both cases, such online_vcpus=0 edge case, even if leading to
      unnecessary calls to XArray API, should not be an issue; requesting
      unpopulated indexes/ranges is handled by xa_load() and xa_for_each_range().
      
      However, this means that when the first vCPU is created and inserted in
      vcpu_array *and* before online_vcpus is incremented, code calling
      kvm_get_vcpu()/kvm_for_each_vcpu() already has access to that first vCPU.
      
      This should not pose a problem assuming that once a vcpu is stored in
      vcpu_array, it will remain there, but that's not the case:
      kvm_vm_ioctl_create_vcpu() first inserts to vcpu_array, then requests a
      file descriptor. If create_vcpu_fd() fails, newly inserted vcpu is removed
      from the vcpu_array, then destroyed:
      
      	vcpu->vcpu_idx = atomic_read(&kvm->online_vcpus);
      	r = xa_insert(&kvm->vcpu_array, vcpu->vcpu_idx, vcpu, GFP_KERNEL_ACCOUNT);
      	kvm_get_kvm(kvm);
      	r = create_vcpu_fd(vcpu);
      	if (r < 0) {
      		xa_erase(&kvm->vcpu_array, vcpu->vcpu_idx);
      		kvm_put_kvm_no_destroy(kvm);
      		goto unlock_vcpu_destroy;
      	}
      	atomic_inc(&kvm->online_vcpus);
      
      This results in a possible race condition when a reference to a vcpu is
      acquired (via kvm_get_vcpu() or kvm_for_each_vcpu()) moments before said
      vcpu is destroyed.
      
      Signed-off-by: default avatarMichal Luczaj <mhal@rbox.co>
      Message-Id: <20230510140410.1093987-2-mhal@rbox.co>
      Cc: stable@vger.kernel.org
      Fixes: c5b07754 ("KVM: Convert the kvm->vcpus array to a xarray", 2021-12-08)
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      79486642
  5. Mar 10, 2023
  6. Feb 01, 2023
    • Yi Liu's avatar
      kvm/vfio: Fix potential deadlock on vfio group_lock · 6eb0fc92
      Yi Liu authored
      
      [ Upstream commit 51cdc8bc ]
      
      Currently it is possible that the final put of a KVM reference comes from
      vfio during its device close operation.  This occurs while the vfio group
      lock is held; however, if the vfio device is still in the kvm device list,
      then the following call chain could result in a deadlock:
      
      VFIO holds group->group_lock/group_rwsem
        -> kvm_put_kvm
         -> kvm_destroy_vm
          -> kvm_destroy_devices
           -> kvm_vfio_destroy
            -> kvm_vfio_file_set_kvm
             -> vfio_file_set_kvm
              -> try to hold group->group_lock/group_rwsem
      
      The key function is the kvm_destroy_devices() which triggers destroy cb
      of kvm_device_ops. It calls back to vfio and try to hold group_lock. So
      if this path doesn't call back to vfio, this dead lock would be fixed.
      Actually, there is a way for it. KVM provides another point to free the
      kvm-vfio device which is the point when the device file descriptor is
      closed. This can be achieved by providing the release cb instead of the
      destroy cb. Also rename kvm_vfio_destroy() to be kvm_vfio_release().
      
      	/*
      	 * Destroy is responsible for freeing dev.
      	 *
      	 * Destroy may be called before or after destructors are called
      	 * on emulated I/O regions, depending on whether a reference is
      	 * held by a vcpu or other kvm component that gets destroyed
      	 * after the emulated I/O.
      	 */
      	void (*destroy)(struct kvm_device *dev);
      
      	/*
      	 * Release is an alternative method to free the device. It is
      	 * called when the device file descriptor is closed. Once
      	 * release is called, the destroy method will not be called
      	 * anymore as the device is removed from the device list of
      	 * the VM. kvm->lock is held.
      	 */
      	void (*release)(struct kvm_device *dev);
      
      Fixes: 421cfe65 ("vfio: remove VFIO_GROUP_NOTIFY_SET_KVM")
      Reported-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Suggested-by: default avatarKevin Tian <kevin.tian@intel.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Signed-off-by: default avatarYi Liu <yi.l.liu@intel.com>
      Reviewed-by: default avatarMatthew Rosato <mjrosato@linux.ibm.com>
      Link: https://lore.kernel.org/r/20230114000351.115444-1-mjrosato@linux.ibm.com
      Link: https://lore.kernel.org/r/20230120150528.471752-1-yi.l.liu@intel.com
      
      
      [aw: update comment as well, s/destroy/release/]
      Signed-off-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      6eb0fc92
  7. Nov 23, 2022
  8. Nov 17, 2022
    • David Matlack's avatar
      KVM: Obey kvm.halt_poll_ns in VMs not using KVM_CAP_HALT_POLL · 9eb8ca04
      David Matlack authored
      
      Obey kvm.halt_poll_ns in VMs not using KVM_CAP_HALT_POLL on every halt,
      rather than just sampling the module parameter when the VM is first
      created. This restore the original behavior of kvm.halt_poll_ns for VMs
      that have not opted into KVM_CAP_HALT_POLL.
      
      Notably, this change restores the ability for admins to disable or
      change the maximum halt-polling time system wide for VMs not using
      KVM_CAP_HALT_POLL.
      
      Reported-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Fixes: acd05785 ("kvm: add capability for halt polling")
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20221117001657.1067231-4-dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9eb8ca04
    • David Matlack's avatar
      KVM: Avoid re-reading kvm->max_halt_poll_ns during halt-polling · 175d5dc7
      David Matlack authored
      
      Avoid re-reading kvm->max_halt_poll_ns multiple times during
      halt-polling except when it is explicitly useful, e.g. to check if the
      max time changed across a halt. kvm->max_halt_poll_ns can be changed at
      any time by userspace via KVM_CAP_HALT_POLL.
      
      This bug is unlikely to cause any serious side-effects. In the worst
      case one halt polls for shorter or longer than it should, and then is
      fixed up on the next halt. Furthmore, this is still possible since
      kvm->max_halt_poll_ns are not synchronized with halts.
      
      Fixes: acd05785 ("kvm: add capability for halt polling")
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20221117001657.1067231-3-dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      175d5dc7
    • David Matlack's avatar
      KVM: Cap vcpu->halt_poll_ns before halting rather than after · 97b6847a
      David Matlack authored
      
      Cap vcpu->halt_poll_ns based on the max halt polling time just before
      halting, rather than after the last halt. This arguably provides better
      accuracy if an admin disables halt polling in between halts, although
      the improvement is nominal.
      
      A side-effect of this change is that grow_halt_poll_ns() no longer needs
      to access vcpu->kvm->max_halt_poll_ns, which will be useful in a future
      commit where the max halt polling time can come from the module parameter
      halt_poll_ns instead.
      
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20221117001657.1067231-2-dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      97b6847a
  9. Oct 31, 2022
  10. Oct 27, 2022
    • Sean Christopherson's avatar
      KVM: Reject attempts to consume or refresh inactive gfn_to_pfn_cache · ecbcf030
      Sean Christopherson authored
      
      Reject kvm_gpc_check() and kvm_gpc_refresh() if the cache is inactive.
      Not checking the active flag during refresh is particularly egregious, as
      KVM can end up with a valid, inactive cache, which can lead to a variety
      of use-after-free bugs, e.g. consuming a NULL kernel pointer or missing
      an mmu_notifier invalidation due to the cache not being on the list of
      gfns to invalidate.
      
      Note, "active" needs to be set if and only if the cache is on the list
      of caches, i.e. is reachable via mmu_notifier events.  If a relevant
      mmu_notifier event occurs while the cache is "active" but not on the
      list, KVM will not acquire the cache's lock and so will not serailize
      the mmu_notifier event with active users and/or kvm_gpc_refresh().
      
      A race between KVM_XEN_ATTR_TYPE_SHARED_INFO and KVM_XEN_HVM_EVTCHN_SEND
      can be exploited to trigger the bug.
      
      1. Deactivate shinfo cache:
      
      kvm_xen_hvm_set_attr
      case KVM_XEN_ATTR_TYPE_SHARED_INFO
       kvm_gpc_deactivate
        kvm_gpc_unmap
         gpc->valid = false
         gpc->khva = NULL
        gpc->active = false
      
      Result: active = false, valid = false
      
      2. Cause cache refresh:
      
      kvm_arch_vm_ioctl
      case KVM_XEN_HVM_EVTCHN_SEND
       kvm_xen_hvm_evtchn_send
        kvm_xen_set_evtchn
         kvm_xen_set_evtchn_fast
          kvm_gpc_check
          return -EWOULDBLOCK because !gpc->valid
         kvm_xen_set_evtchn_fast
          return -EWOULDBLOCK
         kvm_gpc_refresh
          hva_to_pfn_retry
           gpc->valid = true
           gpc->khva = not NULL
      
      Result: active = false, valid = true
      
      3. Race ioctl KVM_XEN_HVM_EVTCHN_SEND against ioctl
      KVM_XEN_ATTR_TYPE_SHARED_INFO:
      
      kvm_arch_vm_ioctl
      case KVM_XEN_HVM_EVTCHN_SEND
       kvm_xen_hvm_evtchn_send
        kvm_xen_set_evtchn
         kvm_xen_set_evtchn_fast
          read_lock gpc->lock
                                                kvm_xen_hvm_set_attr case
                                                KVM_XEN_ATTR_TYPE_SHARED_INFO
                                                 mutex_lock kvm->lock
                                                 kvm_xen_shared_info_init
                                                  kvm_gpc_activate
                                                   gpc->khva = NULL
          kvm_gpc_check
           [ Check passes because gpc->valid is
             still true, even though gpc->khva
             is already NULL. ]
          shinfo = gpc->khva
          pending_bits = shinfo->evtchn_pending
          CRASH: test_and_set_bit(..., pending_bits)
      
      Fixes: 982ed0de ("KVM: Reinstate gfn_to_pfn_cache with invalidation support")
      Cc: stable@vger.kernel.org
      Reported-by: default avatar: Michal Luczaj <mhal@rbox.co>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221013211234.1318131-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ecbcf030
    • Michal Luczaj's avatar
      KVM: Initialize gfn_to_pfn_cache locks in dedicated helper · 52491a38
      Michal Luczaj authored
      
      Move the gfn_to_pfn_cache lock initialization to another helper and
      call the new helper during VM/vCPU creation.  There are race
      conditions possible due to kvm_gfn_to_pfn_cache_init()'s
      ability to re-initialize the cache's locks.
      
      For example: a race between ioctl(KVM_XEN_HVM_EVTCHN_SEND) and
      kvm_gfn_to_pfn_cache_init() leads to a corrupted shinfo gpc lock.
      
                      (thread 1)                |           (thread 2)
                                                |
       kvm_xen_set_evtchn_fast                  |
        read_lock_irqsave(&gpc->lock, ...)      |
                                                | kvm_gfn_to_pfn_cache_init
                                                |  rwlock_init(&gpc->lock)
        read_unlock_irqrestore(&gpc->lock, ...) |
      
      Rename "cache_init" and "cache_destroy" to activate+deactivate to
      avoid implying that the cache really is destroyed/freed.
      
      Note, there more races in the newly named kvm_gpc_activate() that will
      be addressed separately.
      
      Fixes: 982ed0de ("KVM: Reinstate gfn_to_pfn_cache with invalidation support")
      Cc: stable@vger.kernel.org
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarMichal Luczaj <mhal@rbox.co>
      [sean: call out that this is a bug fix]
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221013211234.1318131-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      52491a38
    • Hou Wenlong's avatar
      KVM: debugfs: Return retval of simple_attr_open() if it fails · 180418e2
      Hou Wenlong authored
      
      Although simple_attr_open() fails only with -ENOMEM with current code
      base, it would be nicer to return retval of simple_attr_open() directly
      in kvm_debugfs_open().
      
      No functional change intended.
      
      Signed-off-by: default avatarHou Wenlong <houwenlong.hwl@antgroup.com>
      Message-Id: <69d64d93accd1f33691b8a383ae555baee80f943.1665975828.git.houwenlong.hwl@antgroup.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      180418e2
  11. Oct 22, 2022
  12. Oct 07, 2022
  13. Sep 29, 2022
  14. Sep 26, 2022
  15. Aug 19, 2022
    • Li kunyu's avatar
      KVM: Drop unnecessary initialization of "ops" in kvm_ioctl_create_device() · eceb6e1d
      Li kunyu authored
      
      The variable is initialized but it is only used after its assignment.
      
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarLi kunyu <kunyu@nfschina.com>
      Message-Id: <20220819021535.483702-1-kunyu@nfschina.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      eceb6e1d
    • Li kunyu's avatar
      KVM: Drop unnecessary initialization of "npages" in hva_to_pfn_slow() · 28249139
      Li kunyu authored
      
      The variable is initialized but it is only used after its assignment.
      
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarLi kunyu <kunyu@nfschina.com>
      Message-Id: <20220819022804.483914-1-kunyu@nfschina.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      28249139
    • Chao Peng's avatar
      KVM: Rename mmu_notifier_* to mmu_invalidate_* · 20ec3ebd
      Chao Peng authored
      
      The motivation of this renaming is to make these variables and related
      helper functions less mmu_notifier bound and can also be used for non
      mmu_notifier based page invalidation. mmu_invalidate_* was chosen to
      better describe the purpose of 'invalidating' a page that those
      variables are used for.
      
        - mmu_notifier_seq/range_start/range_end are renamed to
          mmu_invalidate_seq/range_start/range_end.
      
        - mmu_notifier_retry{_hva} helper functions are renamed to
          mmu_invalidate_retry{_hva}.
      
        - mmu_notifier_count is renamed to mmu_invalidate_in_progress to
          avoid confusion with mn_active_invalidate_count.
      
        - While here, also update kvm_inc/dec_notifier_count() to
          kvm_mmu_invalidate_begin/end() to match the change for
          mmu_notifier_count.
      
      No functional change intended.
      
      Signed-off-by: default avatarChao Peng <chao.p.peng@linux.intel.com>
      Message-Id: <20220816125322.1110439-3-chao.p.peng@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      20ec3ebd
    • Sean Christopherson's avatar
      KVM: Move coalesced MMIO initialization (back) into kvm_create_vm() · c2b82397
      Sean Christopherson authored
      
      Invoke kvm_coalesced_mmio_init() from kvm_create_vm() now that allocating
      and initializing coalesced MMIO objects is separate from registering any
      associated devices.  Moving coalesced MMIO cleans up the last oddity
      where KVM does VM creation/initialization after kvm_create_vm(), and more
      importantly after kvm_arch_post_init_vm() is called and the VM is added
      to the global vm_list, i.e. after the VM is fully created as far as KVM
      is concerned.
      
      Originally, kvm_coalesced_mmio_init() was called by kvm_create_vm(), but
      the original implementation was completely devoid of error handling.
      Commit 6ce5a090 ("KVM: coalesced_mmio: fix kvm_coalesced_mmio_init()'s
      error handling" fixed the various bugs, and in doing so rightly moved the
      call to after kvm_create_vm() because kvm_coalesced_mmio_init() also
      registered the coalesced MMIO device.  Commit 2b3c246a ("KVM: Make
      coalesced mmio use a device per zone") cleaned up that mess by having
      each zone register a separate device, i.e. moved device registration to
      its logical home in kvm_vm_ioctl_register_coalesced_mmio().  As a result,
      kvm_coalesced_mmio_init() is now a "pure" initialization helper and can
      be safely called from kvm_create_vm().
      
      Opportunstically drop the #ifdef, KVM provides stubs for
      kvm_coalesced_mmio_{init,free}() when CONFIG_KVM_MMIO=n (s390).
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220816053937.2477106-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c2b82397
    • Sean Christopherson's avatar
      KVM: Unconditionally get a ref to /dev/kvm module when creating a VM · 405294f2
      Sean Christopherson authored
      
      Unconditionally get a reference to the /dev/kvm module when creating a VM
      instead of using try_get_module(), which will fail if the module is in
      the process of being forcefully unloaded.  The error handling when
      try_get_module() fails doesn't properly unwind all that has been done,
      e.g. doesn't call kvm_arch_pre_destroy_vm() and doesn't remove the VM
      from the global list.  Not removing VMs from the global list tends to be
      fatal, e.g. leads to use-after-free explosions.
      
      The obvious alternative would be to add proper unwinding, but the
      justification for using try_get_module(), "rmmod --wait", is completely
      bogus as support for "rmmod --wait", i.e. delete_module() without
      O_NONBLOCK, was removed by commit 3f2b9c9c ("module: remove rmmod
      --wait option.") nearly a decade ago.
      
      It's still possible for try_get_module() to fail due to the module dying
      (more like being killed), as the module will be tagged MODULE_STATE_GOING
      by "rmmod --force", i.e. delete_module(..., O_TRUNC), but playing nice
      with forced unloading is an exercise in futility and gives a falsea sense
      of security.  Using try_get_module() only prevents acquiring _new_
      references, it doesn't magically put the references held by other VMs,
      and forced unloading doesn't wait, i.e. "rmmod --force" on KVM is all but
      guaranteed to cause spectacular fireworks; the window where KVM will fail
      try_get_module() is tiny compared to the window where KVM is building and
      running the VM with an elevated module refcount.
      
      Addressing KVM's inability to play nice with "rmmod --force" is firmly
      out-of-scope.  Forcefully unloading any module taints kernel (for obvious
      reasons)  _and_ requires the kernel to be built with
      CONFIG_MODULE_FORCE_UNLOAD=y, which is off by default and comes with the
      amusing disclaimer that it's "mainly for kernel developers and desperate
      users".  In other words, KVM is free to scoff at bug reports due to using
      "rmmod --force" while VMs may be running.
      
      Fixes: 5f6de5cb ("KVM: Prevent module exit until all VMs are freed")
      Cc: stable@vger.kernel.org
      Cc: David Matlack <dmatlack@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220816053937.2477106-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      405294f2
    • Sean Christopherson's avatar
      KVM: Properly unwind VM creation if creating debugfs fails · 4ba4f419
      Sean Christopherson authored
      
      Properly unwind VM creation if kvm_create_vm_debugfs() fails.  A recent
      change to invoke kvm_create_vm_debug() in kvm_create_vm() was led astray
      by buggy try_get_module() handling adding by commit 5f6de5cb ("KVM:
      Prevent module exit until all VMs are freed").  The debugfs error path
      effectively inherits the bad error path of try_module_get(), e.g. KVM
      leaves the to-be-free VM on vm_list even though KVM appears to do the
      right thing by calling module_put() and falling through.
      
      Opportunistically hoist kvm_create_vm_debugfs() above the call to
      kvm_arch_post_init_vm() so that the "post-init" arch hook is actually
      invoked after the VM is initialized (ignoring kvm_coalesced_mmio_init()
      for the moment).  x86 is the only non-nop implementation of the post-init
      hook, and it doesn't allocate/initialize any objects that are reachable
      via debugfs code (spawns a kthread worker for the NX huge page mitigation).
      
      Leave the buggy try_get_module() alone for now, it will be fixed in a
      separate commit.
      
      Fixes: b74ed7a6 ("KVM: Actually create debugfs in kvm_create_vm()")
      Reported-by: default avatar <syzbot+744e173caec2e1627ee0@syzkaller.appspotmail.com>
      Cc: Oliver Upton <oliver.upton@linux.dev>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarOliver Upton <oliver.upton@linux.dev>
      Message-Id: <20220816053937.2477106-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4ba4f419
  16. Aug 10, 2022
  17. Jul 29, 2022
  18. Jun 24, 2022
    • Vineeth Pillai's avatar
      KVM: debugfs: expose pid of vcpu threads · e36de87d
      Vineeth Pillai authored
      
      Add a new debugfs file to expose the pid of each vcpu threads. This
      is very helpful for userland tools to get the vcpu pids without
      worrying about thread naming conventions of the VMM.
      
      Signed-off-by: default avatarVineeth Pillai (Google) <vineeth@bitbyteword.org>
      Message-Id: <20220523190327.2658-1-vineeth@bitbyteword.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e36de87d
    • David Matlack's avatar
      KVM: Allow for different capacities in kvm_mmu_memory_cache structs · 837f66c7
      David Matlack authored
      
      Allow the capacity of the kvm_mmu_memory_cache struct to be chosen at
      declaration time rather than being fixed for all declarations. This will
      be used in a follow-up commit to declare an cache in x86 with a capacity
      of 512+ objects without having to increase the capacity of all caches in
      KVM.
      
      This change requires each cache now specify its capacity at runtime,
      since the cache struct itself no longer has a fixed capacity known at
      compile time. To protect against someone accidentally defining a
      kvm_mmu_memory_cache struct directly (without the extra storage), this
      commit includes a WARN_ON() in kvm_mmu_topup_memory_cache().
      
      In order to support different capacities, this commit changes the
      objects pointer array to be dynamically allocated the first time the
      cache is topped-up.
      
      While here, opportunistically clean up the stack-allocated
      kvm_mmu_memory_cache structs in riscv and arm64 to use designated
      initializers.
      
      No functional change intended.
      
      Reviewed-by: default avatarMarc Zyngier <maz@kernel.org>
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20220516232138.1783324-22-dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      837f66c7
  19. Jun 20, 2022
    • Sean Christopherson's avatar
      KVM: Do not zero initialize 'pfn' in hva_to_pfn() · 943dfea8
      Sean Christopherson authored
      
      Drop the unnecessary initialization of the local 'pfn' variable in
      hva_to_pfn().  First and foremost, '0' is not an invalid pfn, it's a
      perfectly valid pfn on most architectures.  I.e. if hva_to_pfn() were to
      return an "uninitializd" pfn, it would actually be interpeted as a legal
      pfn by most callers.
      
      Second, hva_to_pfn() can't return an uninitialized pfn as hva_to_pfn()
      explicitly sets pfn to an error value (or returns an error value directly)
      if a helper returns failure, and all helpers set the pfn on success.
      
      The zeroing of 'pfn' was introduced by commit 2fc84311 ("KVM:
      reorganize hva_to_pfn"), probably to avoid "uninitialized variable"
      warnings on statements that return pfn.  However, no compiler seems
      to produce them, making the initialization unnecessary.
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220429010416.2788472-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      943dfea8
Loading