Before working in 3D, we start with a 2D example: a Neural Field from 2D pixel coordinates (u,v) to 3D pixel colors (r,g,b). This involves a couple of steps:
The following sequence visualizes the training process by plotting the predicted images across iterations.
![]() |
![]() |
![]() |
![]() |
![]() |
An example from the hyperparameter tuning process was varying L from 10 to 15 and varying channel size from 256 to 128. This edit didn't affect the performance of the network much, as the increase in L seemed to compensate for the decrease in channel size.
![]() |
![]() |
![]() |
![]() |
![]() |
This optimization was also performed for another image. Specifically, the learning rate was increased to 2e-2 because it didn't seem to have converged.
![]() |
![]() |
![]() |
![]() |
![]() |
This time, I tried the opposite for hyperparameter tuning: decreasing L from 10 to 5 and increasing channel size from 256 to 400. As seen in the predicted outputs, this change seems to create smoother images that capture fewer positional differences.
![]() |
![]() |
![]() |
![]() |
![]() |
To begin, we define a few functions to transform among image, camera and world coordinates:
transform(c2w, x_c)
function
multiplies camera coordinates by a camera-to-world transformation, or extrinsic, matrix, which
contains a rotation matrix and translation vector, in order to obtain the corresponding world
coordinates.
pixel_to_camera(c2w, x_c)
function
transforms pixel coordinates into the camera coordinate system, using the pinhole camera's
intrinsic matrix K defined by its focal length and principal point. ray_o, ray_d = pixel_to_ray(K, c2w, uv)
function utilizes the
previous functions to convert pixel coordinates to world coordinates, then create rays with an
origin and normalized direction.sample_rays(N, M)
function randomly samples M
images, then N rays from each image.sample_along_rays(rays_o, rays_d, perturb)
then
samples points along these rays, which can be perturbed to help prevent ovefitting in the training
process.The above steps are then integrated into the dataloading process, which randomly samples pixels from a dataset of images, and is visualized below:
Now, the Neural Radiance Field can be learned with a network that takes in 3D world coordinates x and 3D ray direction vector r_d, then outputs predicted 3D rgb colors and a 1D density. This network is a deeper, more powerful MLP.
To generate rendered colors, the volume rendering equation aggregates the batch of samples along each ray.
The training process is visualized below, along with the PSNR curve every step (10 iterations) on the validation set.
![]() |
![]() |
![]() |
![]() |
![]() |
Finally, the network can be used to render a novel view of the scene from an arbitrary camera extrinsic:
A background color can be injected as the bottom of the rays into the volume rendering equation.