Bundle-Adjustment

MVG Chapter 18: N-View Computational Methods, Check that for deeper analysis.

If Kalman Filters are the "quick and dirty" recursive solution, Bundle Adjustment (BA) is the "slow and perfect" batch solution.
It is the de facto standard for offline SLAM and 3D reconstruction.

Definition: Simultaneous refinement of the 3D coordinates of scene geometry ( $X_{j}$ ) and the parameters of the relative motion ( $R_{i}, t_{i}$ ) and optical characteristics ( $K$ ) to minimize the reprojection error.

1. The Visual Objective Function

Unlike ICP-SLAM (which minimizes 3D-3D distance), Visual BA minimizes 2D-3D Reprojection Error.

min_{T_{i}, X_{j}} \sum_{i = 1}^{m} \sum_{j = 1}^{n} v_{i j} {‖ z_{i j} - π (T_{i}, X_{j}) ‖}_{Σ}^{2}

$z_{i j}$ : Observed pixel coordinates $(u, v)$ of point $j$ in frame $i$ .
$π (\cdot)$ : The Camera Projection function (World $\to$ Camera $\to$ Image).
$T_{i}$ : Camera Pose ( $R, t$ ).
$X_{j}$ : 3D Landmark position.

2. The Projection Model $π$

This is where it differs from ICP. We have to project 3D points to the image plane.

Transform: $P^{'} = R X_{w} + t = [X^{'}, Y^{'}, Z^{'}]^{T}$
Normalize: $p_{n} = [X^{'} / Z^{'}, Y^{'} / Z^{'}]^{T}$ (Perspective Division)
Pixelate: $[\begin{matrix} u \\ v \end{matrix}] = [\begin{matrix} f_{x} & 0 & c_{x} \\ 0 & f_{y} & c_{y} \end{matrix}] [\begin{matrix} X^{'} / Z^{'} \\ Y^{'} / Z^{'} \\ 1 \end{matrix}]$

3. The Jacobian (The Chain Rule)

The residual is $r = z - π (T, X)$ . We need $\frac{\partial r}{\partial δ ξ}$ .
Using the Chain Rule:

J_{p o s e} = \frac{\partial r}{\partial P^{'}} \cdot \frac{\partial P^{'}}{\partial δ ξ}

Part A: The Projection Derivative ( $\partial r / \partial P^{'}$ )**
How pixel $u, v$ changes when the camera-frame point $X^{'}, Y^{'}, Z^{'}$ moves.

\frac{\partial π}{\partial P^{'}} = [\begin{matrix} f_{x} / Z^{'} & 0 & - f_{x} X^{'} / (Z^{'})^{2} \\ 0 & f_{y} / Z^{'} & - f_{y} Y^{'} / (Z^{'})^{2} \end{matrix}]

Note the $1 / Z^{2}$ term. This is why points close to the camera cause massive gradients (instability).

Part B: The Manifold Derivative ( $\partial P^{'} / \partial δ ξ$ )
Same as ICP-SLAM.

\frac{\partial P^{'}}{\partial δ ξ} = [I_{3} ∣ - [P^{'}]_{\times}]

Multiply A and B to get the $2 \times 6$ Jacobian for one observation.

4. The Schur Complement (The Computational Trick)

We have a system $H δ = b$ .
$H$ is size $(6 m + 3 n)^{2}$ . For 1000 points and 100 cameras, $H$ is $3600 \times 3600$ . Inverting this is $O (N^{3})$ . Too slow.

The Structure:

H = [\begin{matrix} B & E \\ E^{T} & C \end{matrix}]

$B$ : Camera-Camera block (Sparse).
$C$ : Point-Point block (Diagonal! Because points don't see each other).
$E$ : Camera-Point interaction.

Since $C$ is diagonal, it is trivial to invert. We can "marginalize out" the points to solve for cameras first:

Solve for Cameras (Reduced Camera System):

(B - E C^{- 1} E^{T}) δ_{camera} = b_{camera} - E C^{- 1} b_{point}

The matrix $S = B - E C^{- 1} E^{T}$ is the Schur Complement. It is much smaller ( $6 m \times 6 m$ ).

Back-Substitute for Points:

δ_{point} = C^{- 1} (b_{point} - E^{T} δ_{camera})

This reduces complexity from cubic in points to linear in points. This is why we can run BA on thousands of landmarks.

1. The Visual Objective Function

2. The Projection Model π

3. The Jacobian (The Chain Rule)

4. The Schur Complement (The Computational Trick)

2. The Projection Model $π$