Matrix Notation
In this post, the common matrix notation used on this website will be thoroughly explained. The importance for learning how to understand the matrix representation is so that it’s possible to follow along step-by-step the meaning of the various formulas.
- The Benefit of Matrix Representation
- The Style of Notation Used on this Website
- Extensions from Univariate to Multivariate
- Applications of the Matrix Format
The Benefit of Matrix Representation
The usefulness of matrix representation will be expressed briefly. Within introductory statistical courses, the univariate (data with one variable, e.g., height) is explored and different types of statistical inference are briefly covered. However, in order to extend these analysis to data with multivariate (data with more than one variable, e.g., height and weight) properties, it’s possible to retain the same tools of analysis, with the exception that data and calculations are done using matrices.
Also, it is important to understand the style of notation used anytime it is necessary to look closely at a formula. Different authors utilize different styles of matrix notation, but they all are fine as long as the same style is used throughout the formula.
The Style of Notation Used on this Website
For the remainder of the page, it will be assumed that basic linear algebra has already been learned (if you have not learned or need to review, please visit Khan Academy).
In multivariate statistics, the way that matrix notation is used is that vectors are expressed as follows. Consider the \(n \times p\) data matrix, \(\textbf{X}\) (bold-faced, upper-case, non-italicized, Roman letter X), it can be visualized below,
\[\begin{bmatrix} x_{1,1} & x_{1,2} & \cdots & x_{1,p} \\ x_{2,1} & x_{2,2} & \cdots & x_{2,p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n,1} & x_{n,2} & \cdots & x_{n,p} \end{bmatrix}.\]Each row of the matrix \(\textbf{X}\) can be thought of as an observation from a dataset. Then each column corresponds to a variable of the observation. Then in the matrix there are \(n\) observations and \(p\) variables.
Let us then look at the first observation from the data matrix, \(\vec{x}_1\) (lower-case, italicized, Roman letter x, with a subscript 1, and a right arrow above). It is a \(p\times 1\) vector and can be seen below,
\[\begin{bmatrix} x_{1,1}\\ x_{1,2}\\ \vdots\\ x_{1,p} \end{bmatrix}.\]If you notice, the vector is represented vertically, which is an important nuance. Therefore, to visualize the same vector horizontally, the transpose must first be taken. Such a vector can be denoted as \(\vec{x}_1^\top\). Then it also follows that the data matrix \(\textbf{X}\) can also be represented as follows,
\[\begin{bmatrix} \vec{x}_1^\top\\ \vec{x}_2^\top\\ \vdots\\ \vec{x}_n^\top \end{bmatrix}.\]Now, let us look back at a single observation, \(x_{i,j}\) (lower-case, italicized, Roman letter x with a subscript \(i\) and \(j\)). The subscripts indicate the \(ith\) row and \(jth\) column. Therefore, the second variable of the first observation can be referenced by \(x_{1,2}\). The letters \(i\) and \(j\) are dummy variables, which implies that they can be changed to something else such as \(k\) and \(l\).
Extensions from Univariate to Multivariate
In the univariate case, we are all familiar with learning that the variance of a set of numbers is denoted as \(\sigma^2\). Take for example, the vector, \(\vec{x}^\top = [1, 2, 3, 4, 5]\). The variance of the vector can be written as \(Var(\vec{x}) = 2.5\). However, if we were to look towards more complicated data, such as a data matrix \(\textbf{X}\), with \(n\) rows and \(p\) columns, then to find the variance of \(\textbf{X}\) would require looking at \(Var(\textbf{X})\). This can also be denoted as \(\Sigma\) (upper-case, Greek-letter Sigma), and is seen below,
\[\begin{bmatrix} \sigma_{1,1} & \sigma_{1,2} & \cdots & \sigma_{1,p} \\ \sigma_{2,1} & \sigma_{2,2} & \cdots & \sigma_{2,p} \\ \vdots & \vdots & \ddots & \vdots \\ \sigma_{p,1} & \sigma_{p,2} & \cdots & \sigma_{p,p} \end{bmatrix}.\]This matrix is known as the variance-covariance matrix and makes it possible to extend analysis from univariate data to multivariate data. The notation \(\sigma_{1,2}\) denotes the covariance between the first and second variables, of a given observation. To be fair, notation can become confusing at times, and if you become lost it could be a sign that you’ve misunderstood how some step of the calculation is taking place. For example, with the Mahalanobis distance, it could be forgotten that it applies to each observation one at a time and is not necessarily calculated all at once for every row of the data matrix.
Applications of the Matrix Format
All of this may seem interesting at first, but then you may begin to wonder, what is the benefit of all this? Perhaps you are not a statistician, and so you can rely on programming code to perform many of these calculations that allow you to skip over much of this material.
It is worth noting that having understood how to properly utilize the matrix representation of data, it will then possible to perform faster and more efficient vectorized calculations in your programming. Keeping such notation in mind is helpful to becoming a more efficient programmer and analyst.
Let us take a look at the Mahalanobis distance that was mentioned earlier. Without getting deep into the details of what it is, we can look at how it’s utilized in the multivariate normal (Gaussian) distribution. The formula for the probability density function (pdf) of the multivariate normal distribution is as follows,
\[\mathcal{N}(\vec{x};\vec{\mu},\Sigma) = \frac{1}{\sqrt{(2\pi)^p |\Sigma|}}exp[-\frac{1}{2}(\vec{x}-\vec{\mu})^\top\Sigma^{-1}(\vec{x}-\vec{\mu})].\]Within the exponential portion of the pdf is the following,
\[(\vec{x}-\vec{\mu})^\top\Sigma^{-1}(\vec{x}-\vec{\mu}),\]which is the square of the Mahalanobis distance. I have written some code in R
to calculate the Mahalanobis distance portion of a multivariate normal pdf.
Let’s look closely at this function to understand what’s going on. The function is called mahalanobis_distance()
and has the following arguments: X_matrix
, mean_vector
, and variance_covariance_matrix
. The first argument, X_matrix
, is similar to the data matrix \(\textbf{X}\). The second argument, mean_vector
, is a vector containing the column means for each of the variables, \(\vec{\mu}\). The last argument, variance_covariance_matrix
, is the variance-covariance of the variables, \(\Sigma\).
The goal of the function is to ‘vectorize’ the calculation such that it can be done quickly and efficiently, leaving less room for error in the process. Now let us begin to take a look what is happening in the first few lines.
The # Helper variables
section is where some transformation is being done to \(\vec{\mu}\). Originally, it is just a \(p \times 1\) vector of all the column means in the dataset. Using the one_vector
variable, a vector of \(n\) ones, denoted, \(1_n\), where \(n\) is the number of observations in \(\textbf{X}\). The product of the two leads to a matrix of means, where the same row of \(\vec{\mu}^\top = [\mu_1, \mu_2, \cdots , \mu_p]\) is repeated \(n\) times. Subtracting this matrix of means from \(\textbf{X}\) allows us to perform the \((\vec{x} - \vec{\mu})\) calculation for all the rows at once. This is a preferred method to using a for loop to do the same calculation \(n\) times, which is slow and leads to inefficient code. Once this mean subtracted matrix (mean_subtracted_matrix
) is created, it’s possible to use the apply
function in R
to finally loop through each row (an eventually necessary process) and finish calculating \((\vec{x}-\vec{\mu})^\top\Sigma^{-1}(\vec{x}-\vec{\mu})\).
I hope that you guys are able to understand the meaning of this post and that readers are now able to follow along with the math that goes on within the different lessons. If there is any confusion, don’t hesitate to contact me through email.
References:
[1] Applied Multivariate Statistical Analysis 6th Ed., Johnson, Wichern
[2] https://en.wikipedia.org/wiki/Matrix_(mathematics)
[3] https://stattrek.com/matrix-algebra/matrix-notation.aspx
[4] https://en.wikipedia.org/wiki/Mahalanobis_distance
[5] https://en.wikipedia.org/wiki/Multivariate_normal_distribution