This is some quick docs to allow the use of `rr` to troubleshoot and trace difficult cluster api provider problems. Assisted-by: cursor
3.8 KiB
RR (Record and Replay) Debugging
This document describes the implementation of RR (Record and Replay) debugging capabilities in the OpenShift installer's Cluster API system. RR is a powerful debugging tool that allows you to record program execution and then replay it deterministically for debugging purposes.
Overview
The implementation includes modifications to enable RR debugging for Cluster API controllers. This allows developers to:
- Record the execution of Cluster API controllers with deterministic replay
- Debug complex timing-dependent issues that are difficult to reproduce
- Step through execution multiple times with identical behavior
- Analyze race conditions and concurrency issues
Implementation Details
Key Changes
The implementation modifies two main components:
- Process Management (
pkg/clusterapi/internal/process/process.go) - System Controller (
pkg/clusterapi/system.go)
Process Management Modifications
The process management system has been modified to:
- Use process group signaling (
syscall.Kill(-ps.Cmd.Process.Pid, syscall.SIGTERM)) instead of direct process signaling - Add enhanced logging for process exit states
- Temporarily disable timeout-based process termination for debugging purposes
Controller Execution Modifications
The controller execution system has been modified to:
- Replace capi provider with
rrfor CPU binding and execution recording - Add RR-specific flags for optimal debugging:
--wait: Ensures RR waits for the recorded process--disable-avx-512: Disables AVX-512 instructions for compatibility--bind-to-cpu=0: Binds execution to CPU 0 for deterministic behavior (or CPUs with P and E cores)
Setup
Installing RR
# On Fedora/RHEL/CentOS
sudo dnf install rr
# On Ubuntu/Debian
sudo apt install rr
Installing Delve
go install github.com/go-delve/delve/cmd/dlv@latest
Applying the Patch
Apply the RR debugging patch to enable RR recording:
# Apply the patch (patch file location: docs/dev/rr-debugging.patch)
git apply docs/dev/rr-debugging.patch
Building the Installer
Build the installer with RR debugging enabled:
MODE=dev TAGS="release" ./hack/build.sh
Usage
Recording Execution
rr requires these kernel parameters to trace:
sudo sysctl kernel.perf_event_paranoid=-1;
sudo sysctl kernel.kptr_restrict=0
When running the installer with RR debugging enabled, Cluster API controllers will automatically be recorded using RR. The recording process:
- Captures all system calls, memory accesses, and timing information
- Stores the trace in
~/.local/share/rr/latest-trace - Maintains deterministic replay capability
Replaying with Delve
To replay a recorded trace using Delve (dlv), use the following command:
dlv replay --listen=:2345 --headless=true --api-version=2 --accept-multiclient ~/.local/share/rr/latest-trace
Command Options:
--listen=:2345: Listens on port 2345 for debugger connections--headless=true: Runs in headless mode without requiring a terminal--api-version=2: Uses Delve API version 2--accept-multiclient: Allows multiple debugger clients to connect~/.local/share/rr/latest-trace: Path to the recorded RR trace
Connecting a Debugger
After starting the replay session, you can connect your preferred debugger:
- VS Code: Configure launch.json to connect to
localhost:2345 - GoLand/IntelliJ: Use the remote debugging configuration
- Command line: Use
dlv connect localhost:2345