For the course on game engines I teach in the IT University of Copenhagen's MSc in Games, one project is to implement a 3d wireframe software renderer, without using OpenGL or other existing APIs, in order to fully understand what's going on behind the scenes in 3d rendering.
This document gives a concise overview of the series of mathematical transforms needed to do that: to take triangles, given by their coordinates in 3d space, and turn them into a wireframe view on a 2d screen, as seen from a movable camera. I aim to provide all the details needed to implement a renderer from scratch, without using existing libraries, while being much more concise than the lengthy mathematical treatments given in a typical graphics textbook (some of those are linked at the end of this document, for those wanting further information).
(I do apologize for the lack of illustrative diagrams, as I haven't had a chance to make any yet.)
A geometric transform is simply an operation that moves a point or set of points: translates them, rotates them about an axis, etc. Conceptually, geometric transforms are just applications of trigonometry. We could use trigonometric formulas directly. For example, if we want to take a point $(x,y,z)$, and rotate it by $\theta$ degrees around the $z$ axis (i.e., rotate it in its $x$–$y$ plane), the resulting point is located at: \begin{align*} x' &= x \cos \theta - y \sin \theta\\ y' &= x \sin \theta + y \cos \theta\\ z' &= z \end{align*}
It's convenient to represent these in matrix form, however, because then they can be easily stored and chained. In addition, chains of transforms can be accumulated into a single precomputed matrix for future use: multiplying two transform matrices gives a new matrix that applies both of the original transforms.
There are two conventions for matrix-based transforms: we can represent our point as a row vector, or as a column vector. Either way results in a transform identical to the direct trigonometric formulas above.
With row vectors: \begin{equation*} \begin{bmatrix}x' & y' & z'\end{bmatrix} = \begin{bmatrix}x & y & z\end{bmatrix} \begin{bmatrix} \cos \theta & \sin \theta & 0 \\ -\sin \theta & \cos \theta & 0 \\ 0 & 0 & 1 \end{bmatrix} \end{equation*}
With column vectors: \begin{equation*} \begin{bmatrix}x' \\ y' \\ z'\end{bmatrix} = \begin{bmatrix} \cos \theta & -\sin \theta & 0 \\ \sin \theta & \cos \theta & 0 \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix}x \\ y \\ z\end{bmatrix} \end{equation*}
If you multiply them out, you can verify that both forms are equivalent to the three equations at the top. The transformation matrix for a column vector is the transpose of the transformation matrix for a row vector (and vice versa). Which convention you choose is arbitrary, but make sure you use the right transformation matrices for the convention you're using! In this document, we'll use the column-vector convention.
The choice of convention also impacts the direction of chaining. As can be seen above, the transformation matrix for row vectors is multiplied on the right, and in general chains towards the right. For column vectors, the transformation matrix is multiplied on the left, and transforms chain towards the left. This can be important to remember when order of application makes a difference.
A number of 3d transforms can be represented as $3 \times 3$ matrices like the rotation example given in the previous section. These are called the linear transforms (in the same sense as linear algebra). In computer graphics, these almost cover the common space of transforms, with one important one missing: translation, when we simply move a point by an offset. The set of linear transforms, plus translation, is called the affine transforms.
In equation form, translation of a point $(x,y,z)$ by an offset $(t_x,t_y,t_z)$ produces a new point: \begin{align*} x' &= x + t_x\\ y' &= y + t_y\\ z' &= z + t_z \end{align*}
You can verify that there's no way to write a $3 \times 3$ transformation matrix that is equivalent to those equations. We would really like to be able to encode all affine transforms in transformation matrices, however, so that we have a uniform representation for transforms.
The solution is to work in an augmented space called homogeneous coordinates. We augment our points to 4 dimensions, by adding a "dummy" coordinate, $w$; for our purposes, this is always normalized to 1. These new 4-dimensional points therefore have $4 \times 4$ transformation matrices instead of $3 \times 3$ ones. In this augmeted space, it turns out that it's possible to encode some kinds of transforms of 3-dimensional points which aren't possible to encode directly in 3d vectors with $3 \times 3$ transformation matrices.
Translation can now be defined: \begin{equation*} \begin{bmatrix}x' \\ y' \\ z' \\ w'\end{bmatrix} = \begin{bmatrix} 1 & 0 & 0 & t_x \\ 0 & 1 & 0 & t_y \\ 0 & 0 & 1 & t_z \\ 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix}x \\ y \\ z \\ w\end{bmatrix} \end{equation*}
In addition, we can easily "upgrade" all $3 \times 3$ linear transformation matrices to $4 \times 4$ homogeneous transformation matrices, by adding $\begin{bmatrix}0 & 0 & 0 & 1\end{bmatrix}$ as the last row and column (that just sets $w'=w$, and keeps the definitions of $x',y',z'$ the same as before, ignoring $w$). The next section shows the homogeneous-coordinate forms of all the major affine transforms.
Note that, while we initially set $w$ to 1, some kinds of transforms might change it (affine transforms don't, but the perspective transform does). Therefore, after a series of transforms, and before the resulting point is used, we need to divide through by $w'$ to normalize it back to 1—we'll mention this again in the section on the perspective transform and clip space.
Transformation matrices are usually built up out of several base transforms, which can then be chained together to produce more complex transforms. This section lists the most common ones.
Rotate by $\theta$ around the $x$ axis: \begin{equation*} \begin{bmatrix}x' \\ y' \\ z' \\ w'\end{bmatrix} = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & \cos \theta & -\sin \theta & 0 \\ 0 & \sin \theta & \cos \theta & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix}x \\ y \\ z \\ w\end{bmatrix} \end{equation*}
Rotate by $\theta$ around the $y$ axis: \begin{equation*} \begin{bmatrix}x' \\ y' \\ z' \\ w'\end{bmatrix} = \begin{bmatrix} \cos \theta & 0 & \sin \theta & 0 \\ 0 & 1 & 0 & 0 \\ -\sin \theta & 0 & \cos \theta & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix}x \\ y \\ z \\ w\end{bmatrix} \end{equation*}
Rotate by $\theta$ around the $z$ axis: \begin{equation*} \begin{bmatrix}x' \\ y' \\ z' \\ w'\end{bmatrix} = \begin{bmatrix} \cos \theta & -\sin \theta & 0 & 0 \\ \sin \theta & \cos \theta & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix}x \\ y \\ z \\ w\end{bmatrix} \end{equation*}
Translate by an offset $(t_x,t_y,t_z)$: \begin{equation*} \begin{bmatrix}x' \\ y' \\ z' \\ w'\end{bmatrix} = \begin{bmatrix} 1 & 0 & 0 & t_x \\ 0 & 1 & 0 & t_y \\ 0 & 0 & 1 & t_z \\ 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix}x \\ y \\ z \\ w\end{bmatrix} \end{equation*}
Scale by a factor $(s_x,s_y,s_z)$: \begin{equation*} \begin{bmatrix}x' \\ y' \\ z' \\ w'\end{bmatrix} = \begin{bmatrix} s_x & 0 & 0 & 0 \\ 0 & s_y & 0 & 0 \\ 0 & 0 & s_z & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix}x \\ y \\ z \\ w\end{bmatrix} \end{equation*}
For scaling by a uniform factor $s$, just set $s_x=s_y=s_z=s$. It's common for game engines to support only uniform scaling, because it simplifies some things. For example, if an engine only supports uniform scaling, spheres will always stay spherical, so we can safely assume spherical colliders will continue to work. Note that scaling happens around the origin, so objects not centered at the origin will have their centers move. They'll move towards the origin when scaled by factors less than 1, or away from the origin when scaled by factors greater than 1. To avoid this, either scale the object when it's located at the origin, or perform a translation afterwards to move the object back to where it should be.
Models typically start in model space, also called object space. They are usually centered at the origin, as exported from a 3d modeling program. Each model starts in its own model space, where it sits at the origin.
A scene is made up of a single coordinate space, the world space, which may contain many objects. Objects are inserted into the scene, but if just inserted directly, they would all end up at the origin, and possibly at strange sizes (depending on what units were used in the modeling program). An object is transformed from model space to world space by scaling, translating, and rotating it, so it reaches the size, location, and orientation it should have in the world.
To render a scene, we need to view the world from somewhere, which we think of as the "camera". The camera is located at some position in the world (in world coordinates), and has an orientation. There are different ways an engine can use to specify the orientation. Here, we'll use the look-at approach, which is a common one. In that approach, the camera is oriented by specifying two things. First, a look point (also in world coordinates), specifies where the camera is pointing. Second, an up direction specifies how to rotate the scene so the correct side is at the top of the screen. In many games, the up direction is just $(0,1,0)$, with the $y$ axis in world space being the up direction (exceptions include things like flight simulators, where which way is up might rotate when the plane banks).
The world as it's seen from the camera's vantage point is the view space. In view space, the camera is located at the origin, and by convention, faces perpendicularly to the $x$–$y$ axis. That way, $x$ and $y$ coordinates will eventually map onto screen $x$ and $y$ coordinates, while $z$ will represent depth into the scene (distance from the camera).
To transform objects in our scene from world space into view space, we apply a series of transformations to all elements of the scene. The elements all start out in world space, and the series of transformations converts their coordinates into view-space coordinates.
The first transformation is easy. The camera is located at position $\vec{c} = (c_x,c_y,c_z)$ in world space. In view space, we want it to be at the origin, $(0,0,0)$. This simply requires translating by an offset of $-\vec{c} = (-c_x,-c_y,-c_z)$, using the translation matrix given earlier.
Next, we need to point the camera at the look point, and orient it so the correct side is up. To calculate the transformation matrix for this step requires a little vector mathematics.
In world-space coordinates, the camera is at $\vec{c}$, and should be looking at the look point, $\vec{\ell}$. That means it should be looking in direction $\vec{d} = \vec{\ell} - \vec{c}$, in world space. If we normalize $\vec{n} = \frac{\vec{d}}{|\vec{d}|}$ so it has length 1, this gives us a direction vector, called the look direction, which is what we want to align with the $z$ axis. It's also called the view plane normal, because the direction we're looking is perpendicular to ("normal to") the desired view plane.
At this point, we could begin applying rotations: find the $x$-axis and $y$-axis angles between $\vec{n}$ and the $z$ axis, and rotate so that the look direction coincides with the $z$ axis. However, instead we'll fix the rest of the camera's orientation and do the transformation all at once. The idea is to think of the view transformation as an overall change of coordinates: we start out with $x$–$y$–$z$ coordinates, and we want to transform into a new axis system. We already have one of the three axes, with $\vec{n}$ representing the new $z$, so we just need the other two.
How do we find these axes? The other piece of information we have is the up vector. It is not necessarily orthogonal to the look direction, so is not one of the axes itself. However, a version of the up vector should represent our new $y$ axis—but we need a version that's within the view plane. We can get that in two steps, using cross products, and in the process get the third axis as well. First, we compute (normalized again), $$\vec{u} = \frac{\vec{up} \times \vec{n}}{|\vec{up} \times \vec{n}|}$$
By the definition of cross products, $\vec{u}$ is orthogonal to both $\vec{up}$ and $\vec{n}$. Since it's orthogonal to the view plane's normal, it must be in the view plane. And since it's orthogonal to the up direction, it must point to the side of the view plane. Therefore, it's the new $x$ axis. Finally, from this we can get the new $y$ axis by taking another cross product, $\vec{v} = \vec{n} \times \vec{u}$. Since this is orthogonal to both the view plane's normal, and to a vector pointing to the side of the view plane, it must point up in the view plane.
Now we have a new three-axis coordinate system, defined by the $\vec{u}$–$\vec{v}$–$\vec{n}$ direction vectors. To transform the scene into view space, we need to transform the objects from world-space coordinates to this new coordinate system. This could be done by a series of rotations, but it can also be done by constructing a single change-of-coordinate matrix from these three direction vectors which represent the new axes: \begin{equation*} \begin{bmatrix}x' \\ y' \\ z' \\ w'\end{bmatrix} = \begin{bmatrix} u_x & u_y & u_z & 0 \\ v_x & v_y & v_z & 0 \\ n_x & n_y & n_z & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix}x \\ y \\ z \\ w\end{bmatrix} \end{equation*}
After multiplying an object by this transformation matrix, its coordinates are now in view space.
There is a final transform, a projection into normalized device coordinates (NDC). In NDC, all coordinates within the viewing space are mapped to a cube, spanning from $-1$ to $1$ along each axis. This requires transforming the viewing frustum, which is a chopped-pyramid shape defining the 3d region of space that the camera sees, into a cube.
We'll restrict ourselves to the common case of a symmetric viewing frustum. When symmetric, it is defined by four values: the near and far distances specify the near and far planes, between which objects are viewable. Then, the width and height specify the dimensions of the near viewing plane.
How do we choose those? The near and far are freely chosen based on what kind of depths we want to be viewable. They are given as negative numbers, because in a right-handed coordinate system (the default used by most modeling software and engines), the $z$ axis going into the screen is negative. (Note that some APIs hide this step. For example, you give OpenGL positive values for the near and far planes, and it internally negates them.) The width and height control the aspect ratio and the field of view. We can also go the other way around, and calculate width and height from a desired aspect ratio, and a desired field of view. For example, it's common to use a horizontal field of view of 75 or 90 degrees, and an aspect ratio equal to the aspect ratio of your screen. If you have a desired horizontal field of view of $fov$ and width-to-height aspect ratio of $r$, then by basic geometry, \begin{align*} width &= -2 \cdot near \cdot \tan(\frac{fov}{2})\\ height &= \frac{width}{r} \end{align*}
Now, we have near, far, width, and height. Given those four values, we need to do four things, and an optional but common fifth. They are: 1) scale the $x$ axis to $[-1,1]$ according to the width and distance of the near plane; 2) scale the $y$ axis to $[-1,1]$ according to the height and distance of the near plane; 3) center the $z$ axis between the near and far planes and scale it to $[-1,1]$; and finally 4) apply perspective scaling so objects further from the viewer look smaller. In addition, we usually reflect the scene so that positive $z$ goes into the screen: this converts it to a left-handed coordinate system where $z$ depths are positive, which is more convenient for rendering.
We won't go over the mathematical derivation of the projection matrix that performs these transformations (the references at the end of this document do go over it), but the end result is: \begin{equation*} \begin{bmatrix}x' \\ y' \\ z' \\ w'\end{bmatrix} = \begin{bmatrix} \frac{2 \cdot near}{width} & 0 & 0 & 0\\ 0 & \frac{2 \cdot near}{height} & 0 & 0 \\ 0 & 0 & \frac{-(far+near)}{far-near} & \frac{-2 \cdot far \cdot near}{far-near} \\ 0 & 0 & -1 & 0 \end{bmatrix} \begin{bmatrix}x \\ y \\ z \\ w\end{bmatrix} \end{equation*}
Note that the perspective transform, unlike the previous transforms we discussed, changes $w$, so the new $w'$ will not be 1 anymore. This requires normalizing, by dividing through by $w'$.
An optional optimization: we can actually remove points that are outside of the viewable region (called frustum culling, or clipping) before normalizing. The space of un-normalized homogeneous coordinates that we get after multiplying by the projection matrix, but before normalizing, is called the clip space. A point should be culled if its clip space coordinate, on any axis, is outside the range $[-w',w']$. This saves us from doing potentially many wasted division operations, from normalizing points that are just going to be culled anyway. The reason why we can do this is simple. In normalized device coordinates, the viewable space has been mapped to the cube that spans $[-1,1]$ on each axis. Therefore, before the normalization by $w'$, this is equivalent to a cube spanning $[-w',w']$ on each axis. So frustum culling becomes a simple comparison with $w'$. Then we normalize as usual, but only the points that weren't culled.
Finally, we have normalized device coordinates, which are almost the final result. All that's left is to draw them to the screen. The $x$ and $y$ axes are mapped to screen pixels, according to the width and height of the screen, with $-1$ representing the left/top edge and $+1$ the right/bottom edge. To avoid distortion, these should be at the same aspect ratio that was used to construct the projection matrix. And, the $z$ axis provides the depth values indicating how far from the camera each point is. That would be needed if we were going beyond a wireframe renderer, and drawing opaque surfaces where it would be necessary to know which surfaces are behind others.
Tying it all together, to render a scene, we need four things: a list of triangles to render (in 3d world coordinates), the location of the camera in the world, a point that the camera is looking at, and the up direction of the camera.
We need to transform the triangle's 3d points to 2d screen coordinates, in order to draw them on screen (or to an image). The transformation from 3d coordinates to 2d points is carried out via the view and perspective transforms discussed above.
We can build the overall rendering transform out of three intermediate transformation matrices:
cameraLocationTransform
: Sets the camera at the origin. This is
just a translation matrix.
cameraLookTransform
: Points the camera towards
the look point, with the correct up direction. This one requires
some vector math to build the transformation matrix, as explained in the "View
space" section above.
perspectiveTransform
: Adds perspective to the view and
converts to normalized device coordinates. This
transformation matrix is the one shown in the "Perspective projection" section
above.
Once those matrices are computed, we can do wireframe rendering with this pseudocode:
foreach triangle in triangles { foreach vertex in triangle.vertices { // apply the view and perspective transforms vertexViewSpace = perspectiveTransform * cameraLookTransform * cameraLocationTransform * vertex // normalize by dividing through by the homogeneous coordinate w vertexViewSpace.x = vertexViewSpace.x / vertexViewSpace.w vertexViewSpace.y = vertexViewSpace.y / vertexViewSpace.w vertexViewSpace.z = vertexViewSpace.z / vertexViewSpace.w if any of .x/.y/.z are outside the range [-1,1] skip to next vertex // now map [-1,1] into the screen coordinates (0,width) and (0,height) // where (0,0) is the top-left corner of the screen screenCoordinate.x = vertexViewSpace.x * (width/2.0) + (width/2.0) screenCoordinate.y = -vertexViewSpace.y * (height/2.0) + (height/2.0) drawPoint(screenCoordinate) } draw three lines connecting the three vertices' 2d coordinates }
And that, which turns out not to be all that much code once you understand it, is a fully self-contained, wireframe 3d renderer.
This article is, by design, a bit of a whirlwind tour of the entire process. If you'd like a more detailed exposition, including mathematical derivations of each step, there are a number of good sources:
Game Engine Architecture by Jason Gregory (AK Peters, 2009), Chapter 3, "3d Math for Games", provides an overview of matrix mathematics and an introduction to transformation matrices, homogeneous coordinates, and the affine transforms.
Interactive Computer Graphics: A Top-Down Approach with Shader-Based OpenGL by Edward Angel and Dave Shreiner (Addison-Wesley, 2011), Chapter 3, "Geometric Objects and Transformations", also provides an overview of matrix mathematics and geometric transforms, with a somewhat more mathematics-heavy approach grounded in linear algebra.
In the same Angel & Shreiner book, Chapter 4, "Viewing", provides an extensive introduction to cameras, viewing, perspective transforms, and related topics.
Song Ho Ahn's OpenGL tutorials provide a description of what goes on inside OpenGL's implementation of 3d rendering. The section on "Transformation", in particular, traces the series of transformations from object coordinates through to screen coordinates.