I'm studying backpropagation and am trying to wrap my head around the idea of a derivative with respect to a matrix. Suppose we have a vector function $f: \mathbb{R^m} \to \mathbb{R}$. Then we can define $$\frac{\partial f}{\partial \bf{x}} = \nabla f^T = \begin{pmatrix} \frac{\partial f}{\partial {x_1}} \ \ldots \frac{\partial f}{\partial {x_m}} \end{pmatrix} \in \mathbb{R}^{1 \times m}$$ and we can write the linearization of $f$ at $\bf{x} = \bf{x_0}$ as $$ L_{\bf{x_0}}{(\bf{x})} = f({\bf{x_0}}) + \frac{\partial f}{\partial {x_1}}({\bf x} - {\bf x_0})$$ where $(\bf{x} - \bf{x_0})$ is a $m\times1$ column vector. I would expect that we could construct a similar linearization for a matrix function $g : \mathbb{R}^{p \times q} \to \mathbb{R}$. We have that $$ { {\frac {\partial g}{\partial \mathbf {X} }}={\begin{bmatrix}{\frac {\partial g}{\partial x_{11}}}&{\frac {\partial g}{\partial x_{21}}}&\cdots &{\frac {\partial g}{\partial x_{p1}}}\\{\frac {\partial g}{\partial x_{12}}}&{\frac {\partial g}{\partial x_{22}}}&\cdots &{\frac {\partial g}{\partial x_{p2}}}\\\vdots &\vdots &\ddots &\vdots \\{\frac {\partial g}{\partial x_{1q}}}&{\frac {\partial g}{\partial x_{2q}}}&\cdots &{\frac {\partial g}{\partial x_{pq}}}\\\end{bmatrix}}} \in \mathbb{R}^{q \times p}$$ but if we attempt to write a linearization for $g$ around $\bf{X_0}$, we get $$L_{\bf{X_0}}({\bf{X}}) = g({\bf{X_0}}) + \frac {\partial g}{\partial \mathbf {X}}({\bf{X}} - {\bf{X_0}})$$ which doesn't make sense, since $\frac {\partial g}{\partial \mathbf {X} }({\bf{X}} - {\bf{X_0}})$ is a $q \times q$ matrix. Why is the derivative with respect to a matrix defined this way? I'm struggling to find a good resource that can explain this clearly.
- 1$\begingroup$ I'm not an expert on this matrix stuff, but your $\frac{\partial g}{\partial\mathbf{X}}$ is actually a $q\times p$ matrix. Maybe it makes more sense then. $\endgroup$Stefan– Stefan2024-11-22 23:47:46 +00:00Commented Nov 22, 2024 at 23:47
- $\begingroup$ A.function $f:\mathbb R^{p\times q}\to\mathbb R$ has as its derivative the gradient. A matrix derivative will be for a function $g:\mathbb R^p\to\mathbb R^q$. You then have $q$ real valued functions, each of $p$ variables. $\endgroup$John Douma– John Douma2024-11-23 00:01:16 +00:00Commented Nov 23, 2024 at 0:01
- $\begingroup$ @Stefan You are right, we should get a $q \times q$ matrix, but that still doesn't make sense in the linearization. $\endgroup$John Hippisley– John Hippisley2024-11-23 14:44:41 +00:00Commented Nov 23, 2024 at 14:44
- $\begingroup$ @JohnHippisley You are formatting a vector with $pq$ elements as a matrix. In this case, when you multiply $\frac{dg}{dX}$ by $(X-X_0)$ you are multiplying two $p\times q$ matrices component by component. This is because your function is equivalent to a function $f:\mathbb R^{pq}\to\mathbb R$. Using matrix multiplication doesn't make sense, in your case. You are taking the dot product of two vectors, formatted as matrices. $\endgroup$John Douma– John Douma2024-11-24 12:17:11 +00:00Commented Nov 24, 2024 at 12:17
2 Answers
For a vector function $\def\R{\mathbb{R}}f:\R^n \to \R$ the linearization of $f$ at $x=x_0$ uses $$\def\qty#1{\left(#1\right)} \qty{\frac{\partial}{\partial x}}^\top = \begin{bmatrix} \frac{\partial}{\partial x^1} \\ \vdots \\ \frac{\partial}{\partial x^n} \end{bmatrix}^\top = \begin{bmatrix} \frac{\partial}{\partial x^1} & \cdots & \frac{\partial}{\partial x^n} \end{bmatrix} $$ which is an operator that acts as a row vector. Then $$ df = \underbrace{\qty{\frac{\partial}{\partial x}}^\top f}_{1 \times n} \;\underbrace{dx}_{n \times 1} . $$ For a matrix function $g:\R^{m\times n} \to \R$ the linearization of $g$ at $X=X_0$ uses $$ \qty{\frac{\partial}{\partial X}}^\top = \begin{bmatrix} \frac{\partial}{\partial {X^1}_1} & \cdots & \frac{\partial}{\partial {X^1}_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial}{\partial {X^m}_1} & \cdots & \frac{\partial}{\partial {X^m}_n} \end{bmatrix}^\top = \begin{bmatrix} \frac{\partial}{\partial {X^1}_1} & \cdots & \frac{\partial}{\partial {X^m}_1} \\ \vdots & \ddots & \vdots \\ \frac{\partial}{\partial {X^1}_n} & \cdots & \frac{\partial}{\partial {X^m}_n} \end{bmatrix} $$ which is an operator that acts as as a $n \times m$ matrix ($X$ is a $m \times n$ matrix). Then $$ \def\tr{\operatorname{tr}} dg = \tr\qty{ \underbrace{ \begin{bmatrix} \frac{\partial}{\partial {X^1}_1} & \cdots & \frac{\partial}{\partial {X^m}_1} \\ \vdots & \ddots & \vdots \\ \frac{\partial}{\partial {X^1}_n} & \cdots & \frac{\partial}{\partial {X^m}_n} \end{bmatrix} g}_{n \times m} \; \; \; \underbrace{dX}_{m\times n}} . $$
- $\begingroup$ What is the motivation to define it with the trace? It seems like it'd be handy, since it's linear, but I don't understand why it pops up $\endgroup$John Hippisley– John Hippisley2024-11-23 14:43:30 +00:00Commented Nov 23, 2024 at 14:43
- 1$\begingroup$ For $\def\R{\mathbb R} f:\R^n \to \R$ we need a linear map that maps a vector to a scalar. For $g:\R^{m \times n } \to \R$ we need a linear map that maps a matrix to a scalar. So in both cases the linearization is a map in the dual space. For $\R^n$ this is the vector space of linear functionals that map vectors to scalars. The inner product between two vectors can play this role. For $\R^{m \times n}$ this is the vector space of linear functionals that map matrices to scalars. The inner product between two matrices $X,Y$ is ${\rm tr} (X^\top Y)$. $\endgroup$Ted Black– Ted Black2024-11-23 15:03:23 +00:00Commented Nov 23, 2024 at 15:03
In fact if $f : \Bbb R^n \to \Bbb R$ is a function then its derivative is a linear function also from $\Bbb R^n \to \Bbb R$. It's just you can write it in the form of a dot product $v \mapsto v \cdot (\partial f)$.
So in the case $g : \Bbb R^{p\times q} \to \Bbb R$, the derivative is also a linear function $\partial g : \Bbb R^{p \times q} \to \Bbb R$, maps $(v_{ij})_{1\leq i \leq p, 1 \leq j \leq q}$ to $\sum_{i,j} \frac{\partial g}{\partial x_{ij}}v_{ij}$. But in this case you can't express it in the form of some operation in matrices.