0
$\begingroup$

I'm studying backpropagation and am trying to wrap my head around the idea of a derivative with respect to a matrix. Suppose we have a vector function $f: \mathbb{R^m} \to \mathbb{R}$. Then we can define $$\frac{\partial f}{\partial \bf{x}} = \nabla f^T = \begin{pmatrix} \frac{\partial f}{\partial {x_1}} \ \ldots \frac{\partial f}{\partial {x_m}} \end{pmatrix} \in \mathbb{R}^{1 \times m}$$ and we can write the linearization of $f$ at $\bf{x} = \bf{x_0}$ as $$ L_{\bf{x_0}}{(\bf{x})} = f({\bf{x_0}}) + \frac{\partial f}{\partial {x_1}}({\bf x} - {\bf x_0})$$ where $(\bf{x} - \bf{x_0})$ is a $m\times1$ column vector. I would expect that we could construct a similar linearization for a matrix function $g : \mathbb{R}^{p \times q} \to \mathbb{R}$. We have that $$ { {\frac {\partial g}{\partial \mathbf {X} }}={\begin{bmatrix}{\frac {\partial g}{\partial x_{11}}}&{\frac {\partial g}{\partial x_{21}}}&\cdots &{\frac {\partial g}{\partial x_{p1}}}\\{\frac {\partial g}{\partial x_{12}}}&{\frac {\partial g}{\partial x_{22}}}&\cdots &{\frac {\partial g}{\partial x_{p2}}}\\\vdots &\vdots &\ddots &\vdots \\{\frac {\partial g}{\partial x_{1q}}}&{\frac {\partial g}{\partial x_{2q}}}&\cdots &{\frac {\partial g}{\partial x_{pq}}}\\\end{bmatrix}}} \in \mathbb{R}^{q \times p}$$ but if we attempt to write a linearization for $g$ around $\bf{X_0}$, we get $$L_{\bf{X_0}}({\bf{X}}) = g({\bf{X_0}}) + \frac {\partial g}{\partial \mathbf {X}}({\bf{X}} - {\bf{X_0}})$$ which doesn't make sense, since $\frac {\partial g}{\partial \mathbf {X} }({\bf{X}} - {\bf{X_0}})$ is a $q \times q$ matrix. Why is the derivative with respect to a matrix defined this way? I'm struggling to find a good resource that can explain this clearly.

$\endgroup$
4
  • 1
    $\begingroup$ I'm not an expert on this matrix stuff, but your $\frac{\partial g}{\partial\mathbf{X}}$ is actually a $q\times p$ matrix. Maybe it makes more sense then. $\endgroup$ Commented Nov 22, 2024 at 23:47
  • $\begingroup$ A.function $f:\mathbb R^{p\times q}\to\mathbb R$ has as its derivative the gradient. A matrix derivative will be for a function $g:\mathbb R^p\to\mathbb R^q$. You then have $q$ real valued functions, each of $p$ variables. $\endgroup$ Commented Nov 23, 2024 at 0:01
  • $\begingroup$ @Stefan You are right, we should get a $q \times q$ matrix, but that still doesn't make sense in the linearization. $\endgroup$ Commented Nov 23, 2024 at 14:44
  • $\begingroup$ @JohnHippisley You are formatting a vector with $pq$ elements as a matrix. In this case, when you multiply $\frac{dg}{dX}$ by $(X-X_0)$ you are multiplying two $p\times q$ matrices component by component. This is because your function is equivalent to a function $f:\mathbb R^{pq}\to\mathbb R$. Using matrix multiplication doesn't make sense, in your case. You are taking the dot product of two vectors, formatted as matrices. $\endgroup$ Commented Nov 24, 2024 at 12:17

2 Answers 2

1
$\begingroup$

For a vector function $\def\R{\mathbb{R}}f:\R^n \to \R$ the linearization of $f$ at $x=x_0$ uses $$\def\qty#1{\left(#1\right)} \qty{\frac{\partial}{\partial x}}^\top = \begin{bmatrix} \frac{\partial}{\partial x^1} \\ \vdots \\ \frac{\partial}{\partial x^n} \end{bmatrix}^\top = \begin{bmatrix} \frac{\partial}{\partial x^1} & \cdots & \frac{\partial}{\partial x^n} \end{bmatrix} $$ which is an operator that acts as a row vector. Then $$ df = \underbrace{\qty{\frac{\partial}{\partial x}}^\top f}_{1 \times n} \;\underbrace{dx}_{n \times 1} . $$ For a matrix function $g:\R^{m\times n} \to \R$ the linearization of $g$ at $X=X_0$ uses $$ \qty{\frac{\partial}{\partial X}}^\top = \begin{bmatrix} \frac{\partial}{\partial {X^1}_1} & \cdots & \frac{\partial}{\partial {X^1}_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial}{\partial {X^m}_1} & \cdots & \frac{\partial}{\partial {X^m}_n} \end{bmatrix}^\top = \begin{bmatrix} \frac{\partial}{\partial {X^1}_1} & \cdots & \frac{\partial}{\partial {X^m}_1} \\ \vdots & \ddots & \vdots \\ \frac{\partial}{\partial {X^1}_n} & \cdots & \frac{\partial}{\partial {X^m}_n} \end{bmatrix} $$ which is an operator that acts as as a $n \times m$ matrix ($X$ is a $m \times n$ matrix). Then $$ \def\tr{\operatorname{tr}} dg = \tr\qty{ \underbrace{ \begin{bmatrix} \frac{\partial}{\partial {X^1}_1} & \cdots & \frac{\partial}{\partial {X^m}_1} \\ \vdots & \ddots & \vdots \\ \frac{\partial}{\partial {X^1}_n} & \cdots & \frac{\partial}{\partial {X^m}_n} \end{bmatrix} g}_{n \times m} \; \; \; \underbrace{dX}_{m\times n}} . $$

$\endgroup$
2
  • $\begingroup$ What is the motivation to define it with the trace? It seems like it'd be handy, since it's linear, but I don't understand why it pops up $\endgroup$ Commented Nov 23, 2024 at 14:43
  • 1
    $\begingroup$ For $\def\R{\mathbb R} f:\R^n \to \R$ we need a linear map that maps a vector to a scalar. For $g:\R^{m \times n } \to \R$ we need a linear map that maps a matrix to a scalar. So in both cases the linearization is a map in the dual space. For $\R^n$ this is the vector space of linear functionals that map vectors to scalars. The inner product between two vectors can play this role. For $\R^{m \times n}$ this is the vector space of linear functionals that map matrices to scalars. The inner product between two matrices $X,Y$ is ${\rm tr} (X^\top Y)$. $\endgroup$ Commented Nov 23, 2024 at 15:03
0
$\begingroup$

In fact if $f : \Bbb R^n \to \Bbb R$ is a function then its derivative is a linear function also from $\Bbb R^n \to \Bbb R$. It's just you can write it in the form of a dot product $v \mapsto v \cdot (\partial f)$.

So in the case $g : \Bbb R^{p\times q} \to \Bbb R$, the derivative is also a linear function $\partial g : \Bbb R^{p \times q} \to \Bbb R$, maps $(v_{ij})_{1\leq i \leq p, 1 \leq j \leq q}$ to $\sum_{i,j} \frac{\partial g}{\partial x_{ij}}v_{ij}$. But in this case you can't express it in the form of some operation in matrices.

$\endgroup$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.