Bayesian multivariate linear regression

In statistics, Bayesian multivariate linear regression is a

MMSE estimator

.

Details

Consider a regression problem where the

explanatory variables

, grouped into a vector

\mathbf {x} _{i}

of length k (where a dummy variable with a value of 1 has been added to allow for an intercept coefficient). This can be viewed as a set of m related regression problems for each observation i:

{\begin{aligned}y_{i,1}&=\mathbf {x} _{i}^{\mathsf {T}}{\boldsymbol {\beta }}_{1}+\epsilon _{i,1}\\&\;\;\vdots \\y_{i,m}&=\mathbf {x} _{i}^{\mathsf {T}}{\boldsymbol {\beta }}_{m}+\epsilon _{i,m}\end{aligned}}

where the set of errors

\{\epsilon _{i,1},\ldots ,\epsilon _{i,m}\}

are all correlated. Equivalently, it can be viewed as a single regression problem where the outcome is a

row vector

\mathbf {y} _{i}^{\mathsf {T}}

and the regression coefficient vectors are stacked next to each other, as follows:

\mathbf {y} _{i}^{\mathsf {T}}=\mathbf {x} _{i}^{\mathsf {T}}\mathbf {B} +{\boldsymbol {\epsilon }}_{i}^{\mathsf {T}}.

The coefficient matrix B is a $k\times m$ matrix where the coefficient vectors ${\boldsymbol {\beta }}_{1},\ldots ,{\boldsymbol {\beta }}_{m}$ for each regression problem are stacked horizontally:

\mathbf {B} ={\begin{bmatrix}{\begin{pmatrix}\\{\boldsymbol {\beta }}_{1}\\\\\end{pmatrix}}\cdots {\begin{pmatrix}\\{\boldsymbol {\beta }}_{m}\\\\\end{pmatrix}}\end{bmatrix}}={\begin{bmatrix}{\begin{pmatrix}\beta _{1,1}\\\vdots \\\beta _{k,1}\end{pmatrix}}\cdots {\begin{pmatrix}\beta _{1,m}\\\vdots \\\beta _{k,m}\end{pmatrix}}\end{bmatrix}}.

The noise vector ${\boldsymbol {\epsilon }}_{i}$ for each observation i is jointly normal, so that the outcomes for a given observation are correlated:

{\boldsymbol {\epsilon }}_{i}\sim N(0,{\boldsymbol {\Sigma }}_{\epsilon }).

We can write the entire regression problem in matrix form as:

\mathbf {Y} =\mathbf {X} \mathbf {B} +\mathbf {E} ,

where Y and E are

n\times m

matrices. The design matrix X is an

n\times k

matrix with the observations stacked vertically, as in the standard linear regression setup:

\mathbf {X} ={\begin{bmatrix}\mathbf {x} _{1}^{\mathsf {T}}\\\mathbf {x} _{2}^{\mathsf {T}}\\\vdots \\\mathbf {x} _{n}^{\mathsf {T}}\end{bmatrix}}={\begin{bmatrix}x_{1,1}&\cdots &x_{1,k}\\x_{2,1}&\cdots &x_{2,k}\\\vdots &\ddots &\vdots \\x_{n,1}&\cdots &x_{n,k}\end{bmatrix}}.

The classical, frequentists

linear least squares

solution is to simply estimate the matrix of regression coefficients

{\hat {\mathbf {B} }}

using the

pseudoinverse

:

{\hat {\mathbf {B} }}=(\mathbf {X} ^{\mathsf {T}}\mathbf {X} )^{-1}\mathbf {X} ^{\mathsf {T}}\mathbf {Y} .

To obtain the Bayesian solution, we need to specify the conditional likelihood and then find the appropriate conjugate prior. As with the univariate case of linear Bayesian regression, we will find that we can specify a natural conditional conjugate prior (which is scale dependent).

Let us write our conditional likelihood as^[1]

\rho (\mathbf {E} |{\boldsymbol {\Sigma }}_{\epsilon })\propto |{\boldsymbol {\Sigma }}_{\epsilon }|^{-n/2}\exp \left(-{\tfrac {1}{2}}\operatorname {tr} \left(\mathbf {E} ^{\mathsf {T}}\mathbf {E} {\boldsymbol {\Sigma }}_{\epsilon }^{-1}\right)\right),

writing the error

\mathbf {E}

in terms of

\mathbf {Y} ,\mathbf {X} ,

and

\mathbf {B}

yields

\rho (\mathbf {Y} |\mathbf {X} ,\mathbf {B} ,{\boldsymbol {\Sigma }}_{\epsilon })\propto |{\boldsymbol {\Sigma }}_{\epsilon }|^{-n/2}\exp(-{\tfrac {1}{2}}\operatorname {tr} ((\mathbf {Y} -\mathbf {X} \mathbf {B} )^{\mathsf {T}}(\mathbf {Y} -\mathbf {X} \mathbf {B} ){\boldsymbol {\Sigma }}_{\epsilon }^{-1})),

We seek a natural conjugate prior—a joint density $\rho (\mathbf {B} ,\Sigma _{\epsilon })$ which is of the same functional form as the likelihood. Since the likelihood is quadratic in $\mathbf {B}$ , we re-write the likelihood so it is normal in $(\mathbf {B} -{\hat {\mathbf {B} }})$ (the deviation from classical sample estimate).

Using the same technique as with Bayesian linear regression, we decompose the exponential term using a matrix-form of the sum-of-squares technique. Here, however, we will also need to use the Matrix Differential Calculus (Kronecker product and vectorization transformations).

First, let us apply sum-of-squares to obtain new expression for the likelihood: