RISC-V Linux 内核及周边技术动态第 83 期

呀呀呀创作于 2024/03/17

时间：20240317
编辑：晓怡
仓库：RISC-V Linux 内核技术调研活动
赞助：PLCT Lab, ISCAS

内核动态

RISC-V 架构支持

v1: riscv: Apply Zawrs when available

Zawrs provides two instructions (wrs.nto and wrs.sto), where both are meant to allow the hart to enter a low-power state while waiting on a store to a memory location.

v4: riscv: Optimize crc32 with Zbc extension

As suggested by the B-ext spec, the Zbc (carry-less multiplication) instructions can be used to accelerate CRC calculations.

v3: riscv: sophgo: add dmamux support for Sophgo CV1800/SG2000 SoCs

Add dma multiplexer support for the Sophgo CV1800/SG2000 SoCs.

v2: Add board-id support for multiple DT selection

The software packages are shipped with multiple device tree blobs supporting multiple boards. For instance, suppose we have 3 SoC, each with 4 boards supported, along with 2 PMIC support for each case which would lead to total of 24 DTB files.

v1: clk: starfive: jh7100: Use provided clocks instead of hardcoded names

The Starfive JH7100 clock driver does not use the DT “clocks” property to find its external input clocks, but instead relies on the names of the actual external clock providers.

v5: riscv: pwm: sophgo: add pwm support for CV1800

The Sophgo CV1800 chip provides a set of four independent PWM channel outputs. This series adds PWM controller support for Sophgo cv1800.

v1: Add StarFive’s StarLink-500 Cache Controller

StarFive’s StarLink-500 Cache Controller flush/invalidates cache using non- conventional CMO method. This driver provides the cache handling on StarFive RISC-V SoC.

v1: riscv: Define TASK_SIZE_MAX for __access_ok()

TASK_SIZE_MAX should be set to the largest userspace address under any runtime configuration. This optimizes the check in __access_ok(), which no longer needs to compute the current value of TASK_SIZE.

v1: riscv: uaccess: Allow the last potential unrolled copy

When the dst buffer pointer points to the last accessible aligned addr, we could still run another iteration of unrolled copy.

v1: riscv: uaccess: Relax the threshold for fast path

The bytes copy for unaligned head would cover at most SZREG-1 bytes, so it’s better to set the threshold as >= (SZREG-1 + word_copy stride size) which equals to 9*SZREG-1.

v1: riscv: hwprobe: do not produce frtace relocation

Such relocation causes crash of android linker similar to one described in commit e05d57dcb8c7 (“riscv: Fixup __vdso_gettimeofday broke dynamic ftrace”).

v4: riscv: hwprobe: export highest virtual userspace address

Some userspace applications (OpenJDK for instance) uses the free MSBs in pointers to insert additional information for their own logic and need to get this information from somewhere.

v13: riscv: Create and document PR_RISCV_SET_ICACHE_FLUSH_CTX prctl

Improve the performance of icache flushing by creating a new prctl flag PR_RISCV_SET_ICACHE_FLUSH_CTX. The interface is left generic to allow for future expansions such as with the proposed J extension [1].

v1: riscv: Add tracepoints for SBI calls and returns

These are useful for measuring the latency of SBI calls. The SBI HSM extension is excluded because those functions are called from contexts such as cpuidle where instrumentation is not allowed.

v1: riscv: Do not save the scratch CSR during suspend

While the processor is executing kernel code, the value of the scratch CSR is always zero, so there is no need to save the value. Continue to write the CSR during the resume flow, so we do not rely on firmware to initialize it.

v12: riscv: Create and document PR_RISCV_SET_ICACHE_FLUSH_CTX prctl

Improve the performance of icache flushing by creating a new prctl flag PR_RISCV_SET_ICACHE_FLUSH_CTX. The interface is left generic to allow for future expansions such as with the proposed J extension [1].

v1: clocksource/drivers/timer-sun4i: Partially convert to a platform driver

Fix this by wrapping the timer initialization in a platform driver. builtin_platform_driver_probe() must be used because the driver uses timer_of_init(), which is marked as __init.

v2: drivers/perf: riscv: Disable PERF_SAMPLE_BRANCH_* while not supported

RISC-V perf driver does not yet support branch sampling. Although the specification is in the works [0], it is best to disable such events until support is available, otherwise we will get unexpected results.

v2: bpf-next: bpf: make tracing program support multi-link

For now, the BPF program of type BPF_PROG_TYPE_TRACING is not allowed to be attached to multiple hooks, and we have to create a BPF program for each kernel function, for which we want to trace, even through all the program have the same (or similar) logic.

进程调度

v2: sched/fair: simplify __calc_delta()

Commit 5e963f2bd4654a202a8a05aa3a86cb0300b10e6c (“sched/fair: Commit to EEVDF”) removed __calc_delta()’s use case where the input weight is not equal to NICE_0_LOAD.

v1: sched: Improve rq selection for a blocked task when its affinity changes

We observered select_idle_sibling() is likely to return the target cpu early which is likely to be the previous cpu this task is running on even

v1: sched/headers: do not set last_queued to 0 in arrive

This change leaves enqueue/dequeue on last_queued only, and correct the pcount accounting.

内存管理

v1: mm: memcg: add NULL check to obj_cgroup_put()

9 out of 16 callers perform a NULL check before calling obj_cgroup_put(). Move the NULL check in the function, similar to mem_cgroup_put().

v1: sysctl: treewide: constify ctl_table argument of sysctl handlers

Patch 1 is a bugfix for the stack_erasing sysctl handler
Patches 2-10 change various helper functions throughout the kernel to be able to handle ‘const ctl_table’.

v1: mm: Increase folio batch size

On a 104 thread, 2 socket Skylake system, Intel report a 4.7% performance reduction with will-it-scale page_fault2. This was due to reducing the size of the batch from 32 to 15.

v2: mm: support multi-size THP numa balancing

Now the anonymous page allocation already supports multi-size THP (mTHP), but the numa balancing still prohibits mTHP migration even though it is an exclusive mapping, which is unreasonable.

v2: Revert “mm: skip CMA pages when they are not available”

This reverts commit b7108d66318a (“Multi-gen LRU: skip CMA pages when they are not eligible”) commit 5da226dbfce3 (“mm: skip CMA pages when they are not available”)

**[v1: mm/madvise: make MADV_POPULATE_(READ

WRITE) handle VM_FAULT_RETRY properly](http://lore.kernel.org/linux-mm/20240314161300.382526-1-david@redhat.com/)**

Derrick reports that in some cases where pread() would fail with -EIO and mmap()+access would generate a SIGBUS signal, MADV_POPULATE_READ / MADV_POPULATE_WRITE will keep retrying forever and not fail with -EFAULT.

v2: vmstat: Keep count of the maximum page reached by the kernel stack

CONFIG_DEBUG_STACK_USAGE provides a mechanism to determine the minimum amount of memory left in a stack. Every time a new low-memory record is reached, a message is printed to the console.

v1: cpufreq: dt: always allocate zeroed cpumask

Commit 0499a78369ad (“ARM64: Dynamically allocate cpumasks and increase supported CPUs to 512”) changed the handling of cpumasks on ARM 64bit,

v1: selftests/mm: virtual_address_range: Switch to ksft_exit_fail_msg

mmap() must not succeed in validate_lower_address_hint(), for if it does, it is a bug in mmap() itself. Reflect this behaviour with ksft_exit_fail_msg(). While at it, do some formatting changes.

v1: Add XSAVE layout description to Core files for debuggers to support varying XSAVE layouts

This patch proposes to add an extra .note section in the corefile to dump the CPUID information of a machine. This is being done to solve the issue of tools like the debuggers having to deal with coredumps from machines with varying XSAVE layouts in spite of having the same XCR0 bits.

v1: selftests: mm: restore settings from only parent process

The atexit() is called from parent process as well as forked processes. Hence the child restores the settings at exit while the parent is still executing. Fix this by checking pid of atexit() calling process and only restore THP number from parent process.

v1: Reliable testing for hugetlb

We need to handle pages allocated by hugetlbfs differently in a number of places. If we have a reference to the folio, there are many methods which will work. If we don’t, most of our previous attempts can misidentify a page which never belonged to hugetlb as belonging to hugetlb.

v3: enable bs > ps in XFS

This is the third version of the series that enables block size > page size (Large Block Size) in XFS. The context and motivation can be seen in cover letter of the RFC v1[1]. We also recorded a talk about this effort at LPC [3], if someone would like more context on this effort.

v10: Support page table check PowerPC

Support page table check on all PowerPC platforms. This works by serialising assignments, reassignments and clears of page table entries at each level in order to ensure that anonymous mappings have at most one writable consumer, and likewise that file-backed mappings are not simultaneously also anonymous mappings.

v1: filemap: replace pte_offset_map() with pte_offset_map_nolock()

The vmf->ptl in filemap_fault_recheck_pte_none() is still set from handle_pte_fault(). But at the same time, we did a pte_unmap(vmf->pte). After a pte_unmap(vmf->pte) unmap and rcu_read_unlock(), the page table may be racily changed and vmf->ptl maybe fails to protect the actual page table. Fix this by replacing pte_offset_map() with pte_offset_map_nolock().

v3: Cover a guard gap corner case

For v3, the change is in the struct vm_unmapped_area_info zeroing patches. Per discussion[0], they are switched to a method of intializing the struct at the callers that also doesn’t leave useless statements as cleanup, but is a bit easier to manually inspect for bugs. The arch’s that acked the old versions are left separate. What’s left after that happens in a treewide change.

v6: zswap: replace RB tree with xarray

Very deep RB tree requires rebalance at times. That contributes to the zswap fault latencies. Xarray does not need to perform tree rebalance. Replacing RB tree to xarray can have some small performance gain.

v4: Fast kernel headers: split linux/mm.h

This patch set aims to clean up the linux/mm.h header and reduce dependencies on it by moving parts out.

v1: memtest: use {READ,WRITE}_ONCE in memory scanning

memtest failed to find bad memory when compiled with clang. So use {WRITE,READ}_ONCE to access memory to avoid compiler over optimization.

v2: Improved Memory Tier Creation for CPUless NUMA Nodes

When a memory device, such as CXL1.1 type3 memory, is emulated as normal memory (E820_TYPE_RAM), the memory device is indistinguishable from normal DRAM in terms of memory tiering with the current implementation.

v1: lib/test_hmm.c: handle src_pfns and dst_pfns allocation failure

The kcalloc() in dmirror_device_evict_chunk() will return null if the physical memory has run out. As a result, if src_pfns or dst_pfns is dereferenced, the null pointer dereference bug will happen.

v2: mm/migrate: put dest folio on deferred split list if source was there.

Commit 616b8371539a6 (“mm: thp: enable thp migration in generic path”) did not check if a THP is on deferred split list before migration, thus, the destination THP is never put on deferred split list even if the source THP might be.

v1: Dynamic Kernel Stacks

This feature allows to grow kernel stack dynamically, from 4KiB and up to the THREAD_SIZE. The intend is to save memory on fleet machines. From the initial experiments it shows to save on average 70-75% of the kernel stack memory.

v4: Swap-out mTHP without splitting

This series adds support for swapping out multi-size THP (mTHP) without needing to first split the large folio via split_huge_page_to_list_to_order(). It closely follows the approach already used to swap-out PMD-sized THP.

v1: mm/slub: Simplify get_partial_node()

Remove the check of !kmem_cache_has_cpu_partial() because it is always false, we’ve known this by calling kmem_cache_debug() before calling remove_partial(), so we can remove the check.

v3: Memory management patches needed by Rust Binder

This patchset contains some abstractions needed by the Rust implementation of the Binder driver for passing data between userspace, kernelspace, and directly into other processes.

文件系统

v3: sysctl: treewide: prepare ctl_table_root for ctl_table constification

The two patches were previously submitted on their own. In commit f9436a5d0497 (“sysctl: allow to change limits for posix messages queues”) a code dependency was introduced between the two callbacks. This code dependency results in a dependency between the two patches, so now they are submitted as a series.

v1: cifs: Move some extern decls from .c files to .h

Move the following:

    extern mempool_t *cifs_sm_req_poolp;
    extern mempool_t *cifs_req_poolp;
    extern mempool_t *cifs_mid_poolp;
    extern bool disable_legacy_dialects;

from various .c files to cifsglob.h.

v1: fs,block: get holder during claim

Now that we open block devices as files we need to deal with the realities that closing is a deferred operation. An operation on the block device such as e.g., freeze, thaw, or removal that runs concurrently with umount, tries to acquire a stable reference on the holder.

v1: -next: fs: Add kernel-doc comments to cuse_process_init_reply()

This commit adds kernel-doc style comments with complete parameter descriptions for the function cuse_process_init_reply.

v1: -next: fs: Add kernel-doc comments to proc_create_net_data_write()

This commit adds kernel-doc style comments with complete parameter descriptions for the function proc_create_net_data_write.

v1: bit more FS_IOC_GETFSUUID, FS_IOC_GETFSSYSFSPATH

implement FS_IOC_GETFSUUID, FS_IOC_GETFSSYSFSPATH a bit more
also: https://evilpiepirate.org/git/bcachefs.git/commit/?h=bcachefs-sysfs-ioctls

Fwd:GIT PULL: vfs uuid

Do you have sample programs for these programs (or even better mini-xfstest programs) that we can use to make sure this e.g. works for cifs.ko (which has similar concept to FS UUID for most remote filesystems etc.)?

v2: xfs: allow cross-linking special files without project quota

There’s an issue that if special files is created before quota project is enabled, then it’s not possible to link this file. This works fine for normal files. This happens because xfs_quota skips special files (no ioctls to set necessary flags). The check for having the same project ID for source and destination then fails as source file doesn’t have any ID.

v1: afs: Revert “afs: Hide silly-rename files from userspace”

This also reverts commit 5f7a07646655fb4108da527565dcdc80124b14c4 (“afs: Fix endless loop in directory parsing”) as that’s a bugfix for the above.

GIT PULL: zonefs changes for 6.9-rc1

The following changes since commit 841c35169323cd833294798e58b9bf63fa4fa1de:
Linux 6.8-rc4 (2024-02-11 12:18:13 -0800)
are available in the Git repository at:
ssh://git@gitolite.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs tags/zonefs-6.9-rc1

v1: blk: optimization for classic polling

This removes the dependency on interrupts to wake up task. Set task state as TASK_RUNNING, if need_resched() returns true, while polling for IO completion. Earlier, polling task used to sleep, relying on interrupt to wake it up. This made some IO take very long when interrupt-coalescing is enabled in NVMe.

网络设备

v1: caif: Use UTILITY_NAME_LENGTH instead of hard-coding 16

UTILITY_NAME_LENGTH is 16. So better use the former when defining the ‘utility_name’ array. This makes the intent clearer when it is used around line 260.

v1: iproute2-next: arpd: create /var/lib/arpd on first use

The motivation is to build distributions packages without /var to go towards stateless systems, see link below (TL;DR: provisionning anything outside of /usr on boot).

v1: net: tcp: Clear req->syncookie in reqsk_alloc().

syzkaller reported a read of uninit req->syncookie. [0]
Originally, req->syncookie was used only in tcp_conn_request() to indicate if we need to encode SYN cookie in SYN+ACK, so the field remains uninitialised in other places.

v1: NXP S32G3 SoC initial bring-up

This series brings up initial support for the NXP S32G3 SoC (8 x cortex-a53), used on the S32G-VNP-RDB3 board [1].

v11: net-next/bnx2x: refactor common code to bnx2x_stop_nic()

Refactor common code which disables and releases HW interrupts, deletes NAPI objects, into a new bnx2x_stop_nic() function.

v11: net/bnx2x: Prevent access to a freed page in page_pool

Fix race condition leading to system crash during EEH error handling
During EEH error recovery, the bnx2x driver’s transmit timeout logic could cause a race condition when handling reset tasks. The bnx2x_tx_timeout() schedules reset tasks via bnx2x_sp_rtnl_task(), which ultimately leads to bnx2x_nic_unload().

v4: net: Report RCU QS for busy network kthreads

This changeset fixes a common problem for busy networking kthreads. These threads, e.g. NAPI threads, typically will do:
polling a batch of packets
if there are more work, call cond_resched to allow scheduling
continue to poll more packets when rx queue is not empty

v1: vfs, nfsd, nfs: implement directory delegations

NFSv4.1 adds a new GET_DIR_DELEGATION operation, to allow clients to request a delegation on a directory. If the client holds a directory delegation, then it knows that nothing will change the dentries in it until it has been recalled.

v1: net: always initialize sysctl ownership

The sysctl core does not initialize these fields when the set_ownership callback is present. So always do it in the callback.

v1: net: mlxbf_gige: open() should call request_irq() after NAPI init

This patch fixes an exception that occurs during open() when kdump is enabled.

v1: bpf-next: Enhancing selftests/xsk Framework: Maximum and Minimum Ring Configurations

Please find enclosed a patch set that introduces enhancements and new test cases to the selftests/xsk framework. These test the robustness and reliability of AF_XDP across both minimal and maximal ring size configurations.

v2: net: hsr: Handle failures in module init

A failure during registration of the netdev notifier was not handled at all. A failure during netlink initialization did not unregister the netdev notifier.

v1: net: Do not break out of sk_stream_wait_memory() with TIF_NOTIFY_SIGNAL

It can happen that a socket sends the remaining data at close() time. With io_uring and KTLS it can happen that sk_stream_wait_memory() bails out with -512 (-ERESTARTSYS) because TIF_NOTIFY_SIGNAL is set for the current task. This flag has been set in io_req_normal_work_add() by calling task_work_add().

v2: net: rds: introduce acquire/release ordering in acquire/release_in_xmit()

acquire/release_in_xmit() work as bit lock in rds_send_xmit(), so they are expected to ensure acquire/release memory ordering semantics. However, test_and_set_bit/clear_bit() don’t imply such semantics, on top of this, following smp_mb__after_atomic() does not guarantee release ordering (memory barrier actually should be placed before clear_bit()).

v3: net: i40e: Enforce software interrupt during busy-poll exit

As for ice bug fixed by commit b7306b42beaf (“ice: manage interrupts during poll exit”) followed by commit 23be7075b318 (“ice: fix software generating extra interrupts”) I’m seeing the similar issue also with i40e driver.

v1: net-next: net: phy: aquantia: add support for AQR114C PHY ID

Add support for AQR114C PHY ID. This PHY advertise 10G speed but supports only up to 5G speed.

v2: net-next: net: phy: don’t resume device not in use

In the case when an MDIO bus contains PHY device not attached to any netdev or is attached to the external netdev, controlled by another driver and the driver is disabled, the bus, when PM resume occurs, is trying to resume also the unattached phydev.

v1: net: tools: ynl: add header guards for nlctrl

I “extracted” YNL C into a GitHub repo to make it easier to use in other projects: https://github.com/linux-netdev/ynl-c

v1: net: move dev->state into net_device_read_txrx group

dev->state can be read in rx and tx fast paths.
netif_running() which needs dev->state is called from
enqueue_to_backlog() [RX path]
__dev_direct_xmit() [TX path]

v2: iproute2-next: Support for nexthop group statistics

Next hop group stats allow verification of balancedness of a next hop group. The feature was merged in kernel commit 7cf497e5a122 (“Merge branch ‘nexthop-group-stats’”). This patchset adds to ip the corresponding support.

v1: net: packet: annotate data-races around ignore_outgoing

ignore_outgoing is read locklessly from dev_queue_xmit_nit() and packet_getsockopt()
Add appropriate READ_ONCE()/WRITE_ONCE() annotations.

v1: net: provides dim profile fine-tuning channels

The NetDIM library provides excellent acceleration for many modern network cards. However, the default profiles of DIM limits its maximum capabilities for different NICs, so providing a channel through which the NIC can be custom configured is necessary.

v1: iproute2: ifstat: handle strdup return value

get_nlmsg_extended missing the check as it’s done in get_nlmsg

v1: net/sched: Forbid assigning mirred action to a filter attached to the egress

As we all know the mirred action is used to mirroring or redirecting the packet it receives.

v2: net: dsa: mt7530: prevent possible incorrect XTAL frequency selection

On MT7530, the HT_XTAL_FSEL field of the HWTRAP register stores a 2-bit value that represents the frequency of the crystal oscillator connected to the switch IC. The field is populated by the state of the ESW_P4_LED_0 and ESW_P4_LED_0 pins, which is done right after reset is deasserted.

v1: atm: Convert sprintf/snprintf to sysfs_emit

Per filesystems/sysfs.rst, show() should only use sysfs_emit() or sysfs_emit_at() when formatting the value to be returned to user space.

v1: net: dsa: add return value check of genphy_read_status()

Need to check return value of genphy_read_status(), because higher in the call hierarchy is the dsa_register_switch() function, which is used in various drivers.

v1: CAPI: return -ENOMEM when kmalloc failed

The driver is using -1 instead of the -ENOMEM defined macro to specify that a buffer allocation failed.

v1: iproute2: ematch: support JSON output

The ematch in filter was missing support JSON output and therefore would generate bogus output.

v1: Documentation: networking: document CAN ISO-TP

While the in-kernel ISO-TP stack is fully functional and easy to use, no documentation exists for it. This patch adds such documentation, containing the very basics of the protocol, the APIs and a basic example.

v4: ipsec-next: xfrm: Add Direction to the SA in or out

This patch introduces the ‘dir’ attribute, ‘in’ or ‘out’, to the xfrm_state, SA, enhancing usability by delineating the scope of values based on direction. An input SA will now exclusively encompass values pertinent to input, effectively segregating them from output-related values. This change aims to streamline the configuration process and improve the overall clarity of SA attributes.

v3: net: veth: ability to toggle GRO and XDP independently

It is rather confusing that GRO is automatically enabled, when an XDP program is attached to a veth interface. Moreover, it is not possible to disable GRO on a veth, if an XDP program is attached (which might be desirable in some use cases).

v1: cxgb4: unnecessary check for 0 in the free_sge_txq_uld() function

The free_sge_txq_old() function has an unnecessary txq check of 0. This check is not necessary, since the txq pointer is initialized by the uldtxq[i] address from the operation &txq_info->uldtxq[i], which ensures that txq is not equal to 0.

v1: bpf-next: arm64: bpf: zero upper bits after rev32

Commit d63903bbc30c7 (“arm64: bpf: fix endianness conversion bugs”) added upper bits zeroing to byteswap operations, but it assumes they will be already zeroed after rev32,

v1: netpoll: support sending over raw IP interfaces

Currently, netpoll only supports interfaces with an ethernet-compatible link layer. Certain interfaces like SLIP do not have a link layer on the network interface level at all and expect raw IP packets, and could benefit from being supported by netpoll.

v4: iwl-net: i40e: Prevent setting MTU if greater than MFS

Commit 6871a7de705 (“[intelxl] Use admin queue to set port MAC address and maximum frame size”) from iPXE project set the MFS to 0x600 = 1536. See https://github.com/ipxe/ipxe/commit/6871a7de705

[v2 net PATCH] octeontx2-pf: Disable HW TSO for seg size < 16B

Current NIX hardware do not support TSO for the segment size less 16 bytes. This patch disable hw TSO for such packets.

v1: iproute2: constify tc XXX_util structures

Constify the pointers to tc util struct. Only place it needs to mutable is when discovering and linking in new util structs.

安全增强

v3: checkpatch: add check for snprintf to scnprintf

I am going to quote Lee Jones who has been doing some snprintf -> scnprintf refactorings:
“There is a general misunderstanding amongst engineers that {v}snprintf() returns the length of the data actually encoded into the destination array.

v1: ubsan: Disable signed integer overflow sanitizer on GCC < 8

For opting functions out of sanitizer coverage, the “no_sanitize” attribute is used, but in GCC this wasn’t introduced until GCC 8. Disable the sanitizer unless we’re not using GCC, or it is GCC version 8 or higher.

v1: hrtimer:Add get_hrtimer_cpu_base()

On the Arm platform,arch_timer may occur irq strom, By using the next_timer of hrtimer_cpu_base, it is possible to quickly locate abnormal timers. As it is an out of tree modules,the function needs to be exproted.

v1: gcc-plugins: disable plugins when gmp.h is unavailable

Without gmp.h the plugin build fails.

v2: bcachefs: Prefer struct_size over open coded arithmetic

This is an effort to get rid of all multiplications from allocation functions in order to prevent integer overflows [1][2].

异步 IO

v2: io_uring/net: ensure async prep handlers always initialize ->done_io

If we get a request with IOSQE_ASYNC set, then we first run the prep async handlers. But if we then fail setting it up and want to post a CQE with -EINVAL, we use ->done_io.

v1: io_uring/rw: return IOU_ISSUE_SKIP_COMPLETE for multishot retry

If read multishot is being invoked from the poll retry handler, then we should return IOU_ISSUE_SKIP_COMPLETE rather than -EAGAIN. If not, then a CQE will be posted with -EAGAIN rather than triggering the retry when the file is flagged as readable again.

Rust For Linux

v1: Rust block device driver API and null block driver

This is the second version of the Rust block device driver API and the Rust null block driver. The context and motivation can be seen in cover letter of the RFC v1 [1]. If more context is required, a talk about this effort was recorded at LPC [2]. I hope to be able to discuss this series at LSF this year [3].

v3: Arc methods for linked list

This patchset contains two useful methods for the Arc type. They will be used in my Rust linked list implementation, which Rust Binder uses. See the Rust Binder RFC [1] for more information. Both these commits and the linked list that uses them are present in the branch referenced by the RFC.

BPF

v3: bpf-next: selftests/bpf: scale benchmark counting by using per-CPU counters

When benchmarking with multiple threads (-pN, where N>1), we start contending on single atomic counter that both BPF trigger benchmarks are using, as well as “baseline” tests in user space (trig-base and trig-uprobe-base benchmarks). As such, we start bottlenecking on something completely irrelevant to benchmark at hand.

v1: bpf-next: BPF raw tracepoint support for BPF cookie

Add ability to specify and retrieve BPF cookie for raw tracepoint programs. Both BTF-aware (SEC(“tp_btf”)) and non-BTF-aware (SEC(“raw_tp”)) are supported, as they are exactly the same at runtime.

v5: dwarves: pahole: Inject kfunc decl tags into BTF

This patchset teaches pahole to parse symbols in .BTF_ids section in vmlinux and discover exported kfuncs. Pahole then takes the list of kfuncs and injects a BTF_KIND_DECL_TAG for each kfunc.

v2: bpf-next: bpf: Allow helper bpf_get_[ns_]current_pid_tgid() for all prog types

Currently bpf_get_current_pid_tgid() is allowed in tracing, cgroup and sk_msg progs while bpf_get_ns_current_pid_tgid() is only allowed in tracing progs.

v4: Extend HID-BPF kfuncs (was: allow HID-BPF to do device IOs)

New version of the sleepable bpf_timer code, without BPF changes, as they can now go through the HID tree independantly:
https://lore.kernel.org/all/20240315-hid-bpf-sleepable-v4-0-5658f2540564@kernel.org/
For reference, the use cases I have in mind:

v4: bpf-next: sleepable bpf_timer (was: allow HID-BPF to do device IOs)

New version of the sleepable bpf_timer code, without the HID changes, as they can now go through the HID tree independantly.

v1: Introduce capable_any()

This is especially helpful with regard to SELinux, where each audit message about a not allowed capability request will create a denial message. Using this new wrapper with the least invasive capability as left most argument (e.g. CAP_SYS_NICE before CAP_SYS_ADMIN) enables policy writers to only grant the least invasive one for the particular subject instead of both.

v1: bpf: bpf_doc: use silent mode when exec make cmd

This will distort the reStructuredText output and make latter rst2man failed like: bpf-helpers.rst:20: (WARNING/2) Field list ends without a blank line; unexpected unindent.

v1: bpf: arena followups.

A set of follow ups to clean up bpf_arena and adjust to the latest LLVM.

v1: bpf: Temporarily disable atomic operations in BPF arena

Currently, the x86 JIT handling PROBE_MEM32 tagged accesses is not equipped to handle atomic accesses into PTR_TO_ARENA, as no PROBE_MEM32 tagging is performed and no handling is enabled for them.

v2: bpf-next: libbpf: Prevent null-pointer dereference when prog to load has no BTF

In bpf_objec_load_prog(), there’s no guarantee that obj->btf is non-NULL when passing it to btf__fd(), and this function does not perform any check before dereferencing its argument (as bpf_object__btf_fd() used to do). As a consequence, we get segmentation fault errors in bpftool (for example) when trying to load programs that come without BTF information.

v2: tools/testing/selftests/bpf/test_tc_tunnel.sh: Prevent client connect before server bind

In some systems, the netcat server can incur in delay to start listening. When this happens, the test can randomly fail in various points.

v1: bpf-next: bpf: preserve sleepable bit in subprog info

Copy over main program’s sleepable bit into subprog’s info. This might be important for, e.g., freplace cases.

v1: bpf-next: Revert “libbpf: make uniform use of btf__fd() accessor inside libbpf”

There’s no guarantee that obj->btf is non-NULL when passing it to btf__fd(), and this function doesn’t perform any check before dereferencing its argument.

v1: bpf-next: uprobes: two common case speed ups

This patch set implements two speed ups for uprobe/uretprobe runtime execution path for some common scenarios: BPF-only uprobes (patches #1 and #2) and system-wide (non-PID-specific) uprobes (patch #3). Please see individual patches for details.

v4: Add minimal XDP support to TI AM65 CPSW Ethernet driver

This patch adds XDP support to TI AM65 CPSW Ethernet driver.

GIT PULL: Networking for v6.9

I get what looks like blk-iocost deadlock when I try to run your current tree on real Meta servers :( So tested the PR merged with your tree only on QEMU and on real HW pure net-next without pulling in your tree.

v2: drivers/perf: riscv: Disable PERF_SAMPLE_BRANCH_* while not supported

RISC-V perf driver does not yet support branch sampling. Although the specification is in the works [0], it is best to disable such events until support is available, otherwise we will get unexpected results.

v1: bpf-next: selftests/bpf: Add kprobe multi triggering benchmarks

Adding kprobe multi triggering benchmarks. It’s useful now to bench new fprobe implementation and might be useful later as well.

v2: bpf-next: bpf: hardcode BPF_PROG_PACK_SIZE to 2MB * num_possible_nodes()

On some architectures like ARM64, PMD_SIZE can be really large in some configurations. Like with CONFIG_ARM64_64K_PAGES=y the PMD_SIZE is 512MB.

v1: libbpf: ringbuf: allow to partially consume items

Instead of always consuming all items from a ring buffer in a greedy way, allow to stop when the callback returns a value > 0.

周边技术动态

Qemu

v1: Add support for RISC-V ACPI tests

Currently, bios-table-test doesn’t support RISC-V. This series enables the framework changes required and basic testing. Things like NUMA related test cases will be added later.

v4: target/riscv: Implement dynamic establishment of custom decoder

In this patch, we modify the decoder to be a freely composable data structure instead of a hardcoded one. It can be dynamically builded up according to the extensions.

v14: for-9.0: riscv: set vstart_eq_zero on vector insns

In this version we’re fixing a redundant check in the vmvr_v helper that was pointed out by in v13.

v13: for-9.0: riscv: set vstart_eq_zero on vector insns

In this new version I added a new patch (patch 4) to handle the case pointed out by LIU Zhiwei in v12. I decided to do it in separate since it’s a distinct case from what we’re dealing with in patch 5.

v3: target/riscv: raise an exception when CSRRS/CSRRC writes a read-only CSR

Both CSRRS and CSRRC always read the addressed CSR and cause any read side effects regardless of rs1 and rd fields. Note that if rs1 specifies a register holding a zero value other than x0, the instruction will still attempt to write the unmodified value back to the CSR and will cause any attendant side effects.

v1: for-9.0: target/riscv: do not enable all named features by default

Commit 3b8022269c added the capability of named features/profile extensions to be added in riscv,isa. To do that we had to assign priv versions for each one of them in isa_edata_arr[].

v2: Add RISC-V Server Platform Reference Board

The RISC-V Server Platform specification[1] defines a standardized set of hardware and software capabilities, that portable system software, such as OS and hypervisors can rely on being present in a RISC-V server platform.

v12: riscv: set vstart_eq_zero on vector insns

In this new version we reworked the commit message of patch 4, as suggested by Richard, since the solution we went for in patch 3 trivialized the removal of ‘brcond’ since we’re doing an early exit if vstart >= vl.

v11: riscv: set vstart_eq_zero on vector insns

In this new version, after the comments from LIU Zhiwei in v9 [1], I decided to ditch all the patches that were trying to integrate the tail update process in a single function. Handling the right value for NF for every single function is out of the scope for this bug fix. The patches might be useful in the future if we decide that such integration adds value, but for now it’s too much.

v10: riscv: set vstart_eq_zero on mark_vs_dirty

This version has changes in the wording on patch 9 subject and commit msg. The previous subject, “target/riscv: Clear vstart_qe_zero flag”, isn’t accurate. We’re not clearing (i.e. setting to false/zero) the flag, we’re setting the flag to ‘true’ in the end of each insns.

U-Boot

v1: board: sophgo: milkv_duo: Add ethernet support for Milk-V Duo board

This series add init code for cv1800b ethernet phy and enable ethernet support for Sophgo Milk-V Duo board.

[置顶] Linux Lab v1.3 升级部分内核到 v6.6，新增上游内核工具链支持，完善 riscv64 和 nolibc 开发支持，另有新增 2 款虚拟开发板：ppc64le/pseries 和 ppc64le/powernvLinux Lab 发布 v1.3 正式版，升级部分内核到 v6.6，新增 2 款 ppc64 虚拟开发板