1
0
mirror of https://github.com/openshift/installer.git synced 2026-02-05 15:47:14 +01:00
Files
installer/docs/dev/rr-debugging.md
Joseph Callen fc462861f8 dev docs: add rr capi debugging
This is some quick docs to allow the
use of `rr` to troubleshoot and trace
difficult cluster api provider problems.

Assisted-by: cursor
2025-08-14 15:39:29 -04:00

3.8 KiB

RR (Record and Replay) Debugging

This document describes the implementation of RR (Record and Replay) debugging capabilities in the OpenShift installer's Cluster API system. RR is a powerful debugging tool that allows you to record program execution and then replay it deterministically for debugging purposes.

Overview

The implementation includes modifications to enable RR debugging for Cluster API controllers. This allows developers to:

  1. Record the execution of Cluster API controllers with deterministic replay
  2. Debug complex timing-dependent issues that are difficult to reproduce
  3. Step through execution multiple times with identical behavior
  4. Analyze race conditions and concurrency issues

Implementation Details

Key Changes

The implementation modifies two main components:

  1. Process Management (pkg/clusterapi/internal/process/process.go)
  2. System Controller (pkg/clusterapi/system.go)

Process Management Modifications

The process management system has been modified to:

  • Use process group signaling (syscall.Kill(-ps.Cmd.Process.Pid, syscall.SIGTERM)) instead of direct process signaling
  • Add enhanced logging for process exit states
  • Temporarily disable timeout-based process termination for debugging purposes

Controller Execution Modifications

The controller execution system has been modified to:

  • Replace capi provider with rr for CPU binding and execution recording
  • Add RR-specific flags for optimal debugging:
    • --wait: Ensures RR waits for the recorded process
    • --disable-avx-512: Disables AVX-512 instructions for compatibility
    • --bind-to-cpu=0: Binds execution to CPU 0 for deterministic behavior (or CPUs with P and E cores)

Setup

Installing RR

# On Fedora/RHEL/CentOS
sudo dnf install rr

# On Ubuntu/Debian
sudo apt install rr

Installing Delve

go install github.com/go-delve/delve/cmd/dlv@latest

Applying the Patch

Apply the RR debugging patch to enable RR recording:

# Apply the patch (patch file location: docs/dev/rr-debugging.patch)
git apply docs/dev/rr-debugging.patch

Building the Installer

Build the installer with RR debugging enabled:

MODE=dev TAGS="release" ./hack/build.sh

Usage

Recording Execution

rr requires these kernel parameters to trace:

sudo sysctl kernel.perf_event_paranoid=-1;
sudo sysctl kernel.kptr_restrict=0

When running the installer with RR debugging enabled, Cluster API controllers will automatically be recorded using RR. The recording process:

  1. Captures all system calls, memory accesses, and timing information
  2. Stores the trace in ~/.local/share/rr/latest-trace
  3. Maintains deterministic replay capability

Replaying with Delve

To replay a recorded trace using Delve (dlv), use the following command:

dlv replay --listen=:2345 --headless=true --api-version=2 --accept-multiclient ~/.local/share/rr/latest-trace

Command Options:

  • --listen=:2345: Listens on port 2345 for debugger connections
  • --headless=true: Runs in headless mode without requiring a terminal
  • --api-version=2: Uses Delve API version 2
  • --accept-multiclient: Allows multiple debugger clients to connect
  • ~/.local/share/rr/latest-trace: Path to the recorded RR trace

Connecting a Debugger

After starting the replay session, you can connect your preferred debugger:

References