Image Formation

Image Projection Basics

The core equations show how 3D points collapse to 2D:

y = y/z  
x = f·(x/z)  
y = f·(y/z)

What this really means:
When you take a photo, camera makes the 3D world into a 2D image ( pixels) (W2C) . The division by z is important ( this is basic stereo) - it's why objects farther away appear smaller. The focal length f acts like a "zoom factor".

Practical implications:

A tree 10m away at (x=2, z=10) projects to the same position as a toy car at (x=0.2, z=1)
This non-linearity causes perspective distortion (think railroad tracks converging)

Coordinate Systems

Camera Frame: Raw 3D coordinates relative to the lens
Normalized Image Frame: "Pure" coordinates before pixel conversion
Pixel Coordinates: Where it actually lands on your sensor (e.g., (255,255))

The conversion magic happens via matrix K:

[ f   0   p_x ]
[ 0   f   p_y ]
[ 0   0    1  ]

Where p_x, p_y is where the optical axis hits the sensor (usually near image center). The zeros assume perfect square pixels - real cameras often have slight skew. ( these px and py are offset that we require , it makes the 0,0 from bottom to where the offset is if that makes sense)

Triangulation

The core idea:
If you see the same object in two photos taken from different positions, you can reconstruct its 3D location like your eyes do with stereo vision.

Mathematically:
For each camera, we:

Backproject a ray using K⁻¹ to undo the camera's distortion
Convert to world coordinates using [R|t]
Find where rays from all cameras intersect (in practice, their closest point due to noise)

Why least squares?
Coz it might just not intersect, like lmao wtf.
Because real measurements are noisy. We minimize:

∑||πᵢ(X) - xᵢ||²

This averages out errors across all observations.

Epipolar Geometry

For any point in Image 1, its match in Image 2 must lie along a specific line (the epipolar line). This reduces stereo matching from a 2D search to 1D.

Key components:

Epipole: Where the other camera's center appears in your image
Fundamental Matrix (F): The algebraic embodiment of this relationship

Why F matters:

Fx₁ directly gives you the search line in Image 2
It encodes the cameras' relative pose compactly
Enables depth estimation from parallax

Derivation intuition:
The coplanarity condition x₂ᵀ(t × Rx₁) = 0 says: "The vector to X, the baseline between cameras, and the vector to x₂ must all lie in the same plane."

Implementation Notes

For triangulation:

Wider camera baselines improve depth accuracy but make matching harder
Acute angles between rays reduce precision

For fundamental matrix:

Normalize coordinates first (subtract mean, scale by 1/std dev)
Use RANSAC to handle mismatched points
Refine with Levenberg-Marquardt non-linear optimization

Common pitfalls:

Forgetting lens distortion before computing F
Assuming perfect calibration (always estimate K if possible)
Ignoring degenerate configurations (e.g., all points on a plane)

These foundations power:

Structure from Motion (SfM): Your phone's 3D scanning
Visual SLAM: Autonomous car navigation
NeRF: Novel view synthesis via implicit neural representations

The math is absurd bruhh, but it's literally how your phone creates portrait mode bokeh effects by estimating depth from multiple viewpoints, which is so insane man