RISC-V Linux 内核及周边技术动态第 77 期

呀呀呀创作于 2024/02/08

时间：20240204
编辑：晓怡
仓库：RISC-V Linux 内核技术调研活动
赞助：PLCT Lab, ISCAS

内核动态

RISC-V 架构支持

v8: KVM: selftests: Add SEV smoke test

Add a basic SEV smoke test. Unlike the intra-host migration tests, this one actually runs a small chunk of code in the guest.

v3: riscv: Use CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS to set misaligned access speed

If CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is enabled, no time needs to be spent in the misaligned access speed probe. Disable the probe in this case and set respective uses to “fast” misaligned accesses. On riscv, this config is selected if RISCV_EFFICIENT_UNALIGNED_ACCESS is selected, which is dependent on NONPORTABLE.

v1: riscv: Only flush the mm icache when setting an exec pte

We used to emit a flush_icache_all() whenever a dirty executable mapping is set in the page table but we can instead call flush_icache_mm() which will only send IPIs to cores that currently run this mm and add a deferred icache flush to the others.

v9: riscv: sophgo: add clock support for sg2042

This series adds clock controller support for sophgo sg2042.

v2: riscv: add CALLER_ADDRx support

CALLER_ADDRx returns caller’s address at specified level, they are used for several tracers. These macros eventually use __builtin_return_address(n) to get the caller’s address if arch doesn’t define their own implementation.

v1: riscv: hwprobe: export VA_BITS

Some userspace applications (OpenJDK for instance) uses the free bits in pointers to insert additional information for their own logic. Currently they rely on parsing /proc/cpuinfo to obtain the current value of virtual address used bits [1]. Exporting VA_BITS through hwprobe will allow for a more stable interface to be used.

进程调度

v1: sched: Add trace events for Proxy Execution (PE)

Add sched_[start, finish]_task_selection trace events to measure the latency of PE patches in task selection.
Moreover, introduce trace events for interesting events in PE:
sched_pe_enqueue_sleeping_task: a task gets enqueued on wait queue of a sleeping task (mutex owner).
sched_pe_cross_remote_cpu: dependency chain crosses remote CPU.
sched_pe_task_is_migrating: mutex owner task migrates.

v2: sched/fair: Defer CFS throttle to user entry

CFS tasks can end up throttled while holding locks that other, non-throttled tasks are blocking on.
For !PREEMPT_RT, this can be a source of latency due to the throttling causing a resource acquisition denial.
For PREEMPT_RT, this is worse and can lead to a deadlock: o A CFS task p0 gets throttled while holding read_lock(&lock) o A task p1 blocks on write_lock(&lock), making further readers enter theslowpath o A ktimers or ksoftirqd task blocks on read_lock(&lock)

v5: net/sched: Load modules via alias

These modules may be loaded lazily without user’s awareness and control. Add respective aliases to modules and request them under these aliases so that modprobe’s blacklisting mechanism (through aliases) works for them. (The same pattern exists e.g. for filesystem modules.)
For example (before the change):$ tc filter add dev lo parent 10: protocol ip prio 10 handle 1: cgroup# cls_cgroup module is loaded despite a blacklist cls_cgroup entry# in /etc/modprobe.d/*.conf

内存管理

v1: lib/bch.c: increase bitrev single conversion length

Optimized the performance of the three functions (load_ecc8 store_ecc8 and bch_encode) using a larger calculation length.

v1: mm/zswap: invalidate old entry when store fail or !zswap_enabled

We may encounter duplicate entry in the zswap_store():
swap slot that freed to per-cpu swap cache, doesn’t invalidate the zswap entry, then got reused. This has been fixed.
!exclusive load mode, swapin folio will leave its zswap entry on the tree, then swapout again. This has been removed.
one folio can be dirtied again after zswap_store(), so need to zswap_store() again. This should be handled correctly.

v1: meminfo: provide estimated per-node’s available memory

The system offers an estimate of the per-node’s available memory, in addition to the system’s available memory provided by /proc/meminfo.
like commit 34e431b0ae39(“/proc/meminfo: provide estimated available memory”), it is more convenient to provide such an estimate in /sys/bus/node/devices/nodex/meminfo. If things change in the future, we only have to change it in one place.

v5: -next: minor improvements for x86 mce processing

In this patchset, we remove the unused macro EX_TYPE_COPY and centralize the processing of memory-failure to do_machine_check() to avoid calling memory_failure_queue() separately for different MC-Safe scenarios. In addition, MCE_IN_KERNEL_COPYIN is renamed MCE_IN_KERNEL_COPY_MC to expand its usage scope.

v11: ACPI: APEI: handle synchronous exceptions in task work to send correct SIGBUS si_code

changes since v5 by addressing comments from Kefeng:
document return value of memory_failure()
drop redundant comments in call site of memory_failure()
make ghes_do_proc void and handle abnormal case within it
pick up reviewed-by tag from Kefeng Wang

v8: arm64/gcs: Provide support for GCS in userspace

The arm64 Guarded Control Stack (GCS) feature provides support for hardware protected stacks of return addresses, intended to provide hardening against return oriented programming (ROP) attacks and to make it easier to gather call stacks for applications such as profiling.

v3: mm/mmap: pass vma to vma_merge()

These vma_merge() callers will pass mm, anon_vma and file, they all from the same vma. There is no need to pass three parameters at the same time.
Pass vma instead of mm, anon_vma and file to vma_merge(), so that it can save two parameters.

v3: mm: memcg: Use larger batches for proactive reclaim

Before 388536ac291 (“mm:vmscan: fix inaccurate reclaim during proactive reclaim”) we passed the number of pages for the reclaim request directly to try_to_free_mem_cgroup_pages, which could lead to significant overreclaim. After 0388536ac291 the number of pages was limited to a maximum 32 (SWAP_CLUSTER_MAX) to reduce the amount of overreclaim. However such a small batch size caused a regression in reclaim performance due to many more reclaim start/stop cycles inside memory_reclaim.

v1: mm: Reduce dependencies on <linux/kernel.h>

“page_counter.h” does not need <linux/kernel.h>. <linux/limits.h> is enough to get LONG_MAX.
Files that include page_counter.h are limited. They have been compile tested or checked.

v2: iommu/iova: use named kmem_cache for iova magazines

The magazine buffers can take gigabytes of kmem memory, dominating all other allocations. For observability purpose create named slab cache so the iova magazine memory overhead can be clearly observed.

v5: mm/mempolicy: weighted interleave mempolicy and sysfs extension

(v5: style, retry interleave w/ mems_allowed cookiefix sparse warnings, style, review tags)

v3: Enable >0 order folio memory compaction

This patchset enables >0 order folio memory compaction, which is one of the prerequisitions for large folio support[1]. It includes the fix[4] for V2 and is on top of mm-everything-2024-01-29-07-19.

v2: Handle delay slot for extable lookup

This series fixed extable handling for architecture delay slot (MIPS).
Please see previous discussions at [1].
There are some other places in kernel not handling delay slots properly, such as uprobe and kgdb, I’ll sort them later.

v1: kasan: add atomic tests

Test that KASan can detect some unsafe atomic accesses.
As discussed in the linked thread below, these tests attempt to cover the most common uses of atomics and, therefore, aren’t exhaustive.
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=214055

v5: Transparent Contiguous PTEs for User Mappings

This is a series to opportunistically and transparently use contpte mappings (set the contiguous bit in ptes) for user memory when those mappings meet the requirements. The change benefits arm64, but there is some minor refactoring for x86 and powerpc to enable its integration with core-mm.

v1: regset: use vmalloc() for regset_get_alloc()

An order 7 allocation is (1 « 7) contiguous pages, or 512K. It’s not a surprise that this allocation failed on a system that’s been running for a while.
In this case we’re just generating a core dump and there’s no reason we need contiguous memory. Change the allocation to vmalloc(). We’ll change the free in binfmt_elf to kvfree() which works regardless of how the memory was allocated.

v1: mempool: kvmalloc pool

Add mempool_init_kvmalloc_pool() and mempool_create_kvmalloc_pool(), which wrap kvmalloc() instead of kmalloc() - kmalloc() with a vmalloc() fallback.

文件系统

v2: Restore data lifetime support

UFS devices are widely used in mobile applications, e.g. in smartphones. UFS vendors need data lifetime information to achieve good performance. Providing data lifetime information to UFS devices can result in up to 40% lower write amplification. Hence this patch series that restores the bi_write_hint member in struct bio. After this patch series has been merged, patches that implement data lifetime support in the SCSI disk (sd) driver will be sent to the Linux kernel SCSI maintainer.

v9: io_uring: add support for ftruncate

This patch adds support for doing truncate through io_uring, eliminating the need for applications to roll their own thread pool or offload mechanism to be able to do non-blocking truncates.

v1: Decomplicate file_dentry()

Miklos,
When posting the patches for file_user_path(), I wrote [1]:
“This change already makes file_dentry() moot, but for now we did notchange this helper just added a WARN_ON() in ovl_d_real() to catch if wehave made any wrong assumptions.

v1: remap_range: merge do_clone_file_range() into vfs_clone_file_range()

commit dfad37051ade (“remap_range: move permission hooks out of do_clone_file_range()”) moved the permission hooks from do_clone_file_range() out to its caller vfs_clone_file_range(), but left all the fast sanity checks in do_clone_file_range().

v1: fs/address_space: move i_mmap_rwsem to mitigate a false sharing with i_mmap.

In the struct address_space, there is a 32-byte gap between i_mmap and i_mmap_rwsem. Due to the alignment of struct address_space variables to 8 bytes, in certain situations, i_mmap and i_mmap_rwsem may end up in the same CACHE line.

v1: __fs_parse: Correct a documentation comment

Commit 7f5d38141e30 (“new primitive: __fs_parse()”) taking p_log instead of fs_context.
So, update that comment to refer to p_log instead

v1: JFS folio conversion

This patchset removes uses of struct page from the I/O paths of JFS. write_begin and write_end are still passed a struct page, but they convert to a folio as their first thing. The logmgr still uses a struct page, but I think that’s one we actually don’t want to convert since it’s never inserted into the page cache.

v1: blk: optimization for classic polling

This removes the dependency on interrupts to wake up task. Set task state as TASK_RUNNING, if need_resched() returns true, while polling for IO completion. Earlier, polling task used to sleep, relying on interrupt to wake it up. This made some IO take very long when interrupt-coalescing is enabled in NVMe.