Matrices

The gooey guts of transformations.

Matrices for 3D Graphics

A matrix is, roughly speaking, a grid of numbers that can be added to and multiplied by another matrix. Advanced OpenGL use is aided by a solid understanding of them, as they are what we use to transform our 3D coordinates and project them onto a 2D screen.

Row-major vs Column-major order

First, let’s get one thing out of the way. Some coders like to use row-major order to make the data in their matrices easier to read in code form. For a matrix representation to be row-major means that the following code:

mat4 m(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16);

creates a matrix by filling the rows first:

$\begin{bmatrix} 1&2&3&4 \\\ 5&6&7&8 \\\ 9&10&11&12 \\\ 13&14&15&16 \end{bmatrix}$

Using this convention helps when creating matrices since the layout of the code matches the representation:

mat4 m( 1,  2,  3,  4
	    5,  6,  7,  8
	    9,  10, 11, 12
	    13, 14, 15, 16 );

This is all well and good, however OpenGL and GLSL (Graphics Library Shader Language) specifies its matrices in column-major order. So in a GLSL shader the same code:

mat4 m(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16)

creates a matrix by filling the columns first:

$\begin{bmatrix} 1&5&9&13 \\\ 2&6&10&14 \\\ 3&7&11&15 \\\ 4&8&12&16 \end{bmatrix}$

Many graphics mathematics libraries such as three.js and glm (and MAT’s own allosystem) use column major order in the code to avoid needing to transpose their matrices when copying matrices over to the gpu. Transposition turns rows into columns

$\begin{bmatrix} 1&2&3&4 \\\ 5&6&7&8 \\\ 9&10&11&12 \\\ 13&14&15&16 \end{bmatrix}^T = \begin{bmatrix} 1&5&9&13 \\\ 2&6&10&14 \\\ 3&7&11&15 \\\ 4&8&12&16 \end{bmatrix}$

The code that accompanies this course we be strictly row-order and so we will transpose our matrices when sending them over to the gpu. This is just my own design decision. In this case the clarity of using row-major representations outweigh the cost of transposition. You will make your own decision in your code, just don’t forget which one you choose! Also, when looking at other libraries, make sure you can identify which order they encode their matrices.

Transformations

Let’s say we have built a mesh that is composed of a bunch of vectors $\mathbf{v}$ in 3D space. We say these vectors are in local space.

In order for the mesh to be presented on your 2D screen, each of these 3D vectors must undergo a transformation $f$ that turns them into screen space. Modern day graphics cards are designed to optimize these transformations.

$f: \mathbf{v_{local}} \mapsto \mathbf{v_{screen}}$

The transformation function $f$ is composed of a series of matrix multiplications acting on a homegenized vector, followed by a viewport transform:

$f(\mathbf{v_{local}})= T_{viewport}[A_{projection} * A_{view} * A_{model} * h(\mathbf{v_{local}})]$

where $h: \{x,y,z\} \mapsto \{x,y,z,1\}$ is a function that adds a fourth component $w=1$ to the 3D vector. The viewport transform is handled by the graphics card, so all we need to worry about are those three $A$ matrices.

Each $A$ variable in the equation above is a 4x4 matrix. When these are multiplied together we get what is called the Model-View-Projection matrix.

$A_{modelViewProjection}=A_{projection}*A_{view} * A_{model}$

Let’s take a closer look at each of the components of the model-view-projection matrix.

The model matrix converts vectors in local space to vectors in object space
The view matrix converts vectors in object space to vectors in eye space
The projection matrix converts vectors in eye space to vectors in clip space

(Finally, the viewport transform, handled automatically by OpenGL, converts the vectors in clip space to vectors in screen space.)

Composing Transformation Sequences

One thing you may immediately notice is that our matrices are multiplied together in the opposite order in which they transform the input vector. A matrix that first transforms $\mathbf{v}$ by $A_1$ and then transforms that result by $A_2$ is encoded as $A_2*A_1$. It may be helpful to think of the equation itself as a box that processes an input vector on the far right and spits out an output on the left.

$\mbox{Output} \leftarrow \mbox{Second Transformation} \leftarrow \mbox{First Transformation} \leftarrow \mbox{Input}$

By the way, we call this multiplying together of matrices into a single matrix concatenation. Remember, when concatenating transformations, the first operation we want to apply is the last one to be multiplied.

Model Matrix

Given a set of vectors in local space representing a mesh, the model matrix is used to transform those vectors into object space.

The Model Matrix can be thought of as being composed of three matrices, each of which can be broken down further into particular transformations like rotate, translate, and scale:

1 - The first component is the internal properties of the mesh being rendered: position, orientation, and size These properties can be converted into a matrix by scaling, rotating and translating – in that order!

$\mbox{Output} \leftarrow \mbox{Translate} \leftarrow \mbox{Rotate} \leftarrow \mbox{Scale} \leftarrow \mbox{Input}$

This scaling, rotating, and translating is encoded by the frame matrix:

$A_{frame}= A_{translate} * A_{rotate} * A_{scale}$

2 - The second component are transformations applied to an external matrix. These are coordinate system transformations that can be applied in any order.

3 - The third component are the properties (position, orientation, and size) of the “parent” scene in which the mesh is being drawn. These are again computed in the translation * rotation * scaling order.

The final Model matrix can be described as:

A_{model} = A_{sceneFrame} * A_{sceneMatrix} * A_{meshFrame}

Scaling

The scaling matrix simply multiplies each ${x,y,z}$ component of the input vector by ${sx,sy,sz}$ respectively:

$A_{scaling} = \begin{bmatrix} sx & 0 & 0 & 0 \\\ 0 & sy & 0 & 0 \\\ 0 & 0 & sz & 0 \\\ 0 & 0 & 0 & 1 \end{bmatrix}$

Rotation

Rotating is a bit more complicated, but easy if we use quaternions (which we will). For a quaternion $\mathbb{H}$ with values $\{w,x,y,z\}$ the corresponding rotation matrix is:

$A_{rotation} = \begin{bmatrix} 1-2(y^2+z^2) & 2(xy+wz) & 2(xz-wy) & 0 \\\ 2(xy-wz) & 1-2(x^2+z^2)& 2(yz+wx) & 0 \\\ 2(xz+wy) & 2(yz-wx) & 1-2(x^2+y^2)& 0 \\\ 0 & 0 & 0 & 1 \end{bmatrix}$

Translation

The translation matrix outputs a new vector that has been translated by $(tx,ty,tz)$.:

$A_{translation} = \begin{bmatrix} 1 & 0 & 0 & tx \\\ 0 & 1 & 0 & ty \\\ 0 & 0 & 1 & tz \\\ 0 & 0 & 0 & 1 \end{bmatrix}$

TRS

A frame’s local model (position, orientation, and size) matrix is a translation-rotation-scale matrix. If we do the math, that is, if we multiply these matrices together, we get

$A_{frame} = \begin{bmatrix} sx(1-2(y^2+z^2)) & 2sy(xy+wz) & 2sz(xz-wy) & tx \\\ 2sx(xy-wz) & sy(1-2(x^2+z^2))& 2sz(yz+wx) & ty \\\ 2sx(xz+wy) & 2sy(yz-wx) & sz(1-2(x^2+y^2))& tz \\\ 0 & 0 & 0 & 1 \end{bmatrix}$

View

The view matrix takes the object coordinates of the model and converts into eye space coordinates relative to the virtual camera. For stereoscopic rendering, this happens once for each eye.

The view matrix is typically made by identifying three vectors:

$E$: The eye position vector.
$T$: The target coordinate at which to look.
$U$: The up vector specifying which way is up.

Then, given $f=E-T$, $r=U \times Z$, and $u=z \times x$, we normalize $f$ $u$ and $r$ and plug them into the matrix:

$A_{view} = \begin{bmatrix} r_x & r_y & r_z & -r \cdot E \\\ u_x & u_y & u_z & -u \cdot E \\\ f_x & f_y & f_z & -f \cdot E \\\ 0 & 0 & 0 & 1 \end{bmatrix}$

Projection

First, a definition. A frustum is a pyramid formed by drawing lines from the camera position to the four corners of the image plane. To represent one we need to know the distance of the image plane from the camera, and the distance of each edge (left, right, top, bottom) from the center. In symmetric frustums, the left offset is equal to the right offset, and the top offset is equal to the bottom offset. This simplifies the matrix construction. However when creating stereoscopic content we will need asymmetric frustums, and so the full frustum matrix remains valuable.

We will make a matrix that takes a frustum and returns clip space coordinates that the gpu then uses internally to calculate the final screen space.

For details on the rational behind the math for constructing a matrix representation from the frustrum see songho’s site.

$A_{projection} = \begin{bmatrix} \frac{2 \cdot near}{right-left} & 0 & -\frac{right+left}{right-left} & 0 \\\ 0 & \frac{2 \cdot near}{top-bottom} & -\frac{top+bottom}{top-bottom} & 0 \\\ 0 & 0 & -\frac{far+near}{far-near} & -\frac{2 \cdot far \cdot near}{far-near} \\\ 0 & 0 & -1 & 0 \end{bmatrix}$

Typical mono (non-stereo) single context (non-tiled) perspective matrices can be made with a symmetric frustum, where the $left=-right$ and $bottom=-top$. This simplifies the above formula to:

$A_{projection} = \begin{bmatrix} \frac{near}{right} & 0 & 0 & 0 \\\ 0 & \frac{near}{top} & 0 & 0 \\\ 0 & 0 & -\frac{far+near}{far-near} & -\frac{2 \cdot far \cdot near}{far-near} \\\ 0 & 0 & -1 & 0 \end{bmatrix}$

Typically in such situations we do not specify edges of our frame, but rather feed in a field-of-view (FOV), which specifies the angles in degrees of our lens. This angle can refer to vertical FOVY (from the top of the frame to the bottom) or horizontal FOVX (from the left of the frame to the right). If we have a vertical FOVY converted to radians $\theta$ and an aspect ratio (the width of the screen divided by the height of the screen) we can calculate the missing parameters that we want to feed to the symmetric projection matrix. We note that the tangent of half the field-of-view angle $\mbox{tan}\frac{\theta}{2}$ represents the ratio $\frac{top}{near}$, and so the inverse $\frac{1}{\mbox{tan}\frac{\theta}{2}}$ represents the ratio $\frac{near}{top}$:

$A_{projection} = \begin{bmatrix} \frac{1}{ratio \cdot \mbox{tan} \frac{\theta}{2}} & 0 & 0 & 0 \\\ 0 & \frac{1}{\mbox{tan}\frac{\theta}{2}} & 0 & 0 \\\ 0 & 0 & -\frac{far+near}{far-near} & -\frac{2 \cdot far \cdot near}{far-near} \\\ 0 & 0 & -1 & 0 \end{bmatrix}$

Asymmetric Frustums (for Virtual Environments)

If we want to make multi-tiled displays or stereoscopic scenes for VR, in which case we need to calculate all kinds of different frustums. For details on the rational for making asymmetrical frustums for left and right eyes in stereoscopic rendering see Paul Bourke’s series of papers and take a look at some of his slides.

The construction is similar to generic frustums, with special care taken to calculate appropriate right and left border values. Also, an additional parameter - the focal distance – is calculated. The lens shift is proportional to the ratio between this focal distance and the near clipping plane.

$A_{projection} = \begin{bmatrix} \frac{1}{ratio \cdot \mbox{tan} \frac{\theta}{2}} & 0 & -\frac{right+left}{right-left} & 0 \\\ 0 & \frac{1}{\mbox{tan}\frac{\theta}{2}} & 0 & 0 \\\ 0 & 0 & -\frac{far+near}{far-near} & -\frac{2 \cdot far \cdot near}{far-near} \\\ 0 & 0 & -1 & 0 \end{bmatrix}$