chain rules matrix derivatives

For the discussion of math. Duh.

Moderators: gmalivuk, Moderators General, Prelates

Posts: 527
Joined: Tue Apr 24, 2012 1:10 am UTC

chain rules matrix derivatives

Postby >-) » Sun Dec 25, 2016 11:27 pm UTC

I have trouble with taking derivatives / gradients / jacobians of matrix expressions. I don't have any "formal education" in this area, and when I search, i can find a lot of identities, which don't help that much.

My (naive) approach to finding the gradient of expressions ∇_x f(x) is
1. take the partial derivative with respect to a single component x_i
2. rewriting f(x), which was previously in matrix form, into a bunch of sums
3. taking the derivative normally
4. take an educated guess as to what all the other components will be and write the gradient down

I feel like I don't have any intuition or "sense" for how to take these derivatives, so I end up breaking them down into something I know and doing that. It's quite inefficient and prone to mistakes. Also, this becomes impossible with more complex expressions, so I'd like to use some sort of chain rule or product rule. I can't seem to find anything on this. Consider the following problem

∇_x ||Ax||^2
I know this is
∇_x (Ax)^T (Ax)
One approach to this is to expand the expression, but I want to use the fact that ∇_w w^T w = 2w, so I have
(2Ax) ∇_x (Ax)
The ∇_x (Ax) comes from the "chain rule". However, as it's written right now, it's not correct. I had to rearrange it as
(∇_x (Ax))^T (2Ax)
before the math worked out.

Now there is an identity on wikipedia's matrix calculus page which is a generalization of this (∇_x u(x)^T v(x)) , and does show that the ∇_x (Ax) term needs to go in front and needs to be transposed, but I have no idea why. Can anyone explain this and/or point me to some resources?

Posts: 179
Joined: Thu Feb 25, 2016 6:09 pm UTC

Re: chain rules matrix derivatives

Postby DavidSh » Mon Dec 26, 2016 2:44 pm UTC

As the wikipedia page says, there are several competing notations. You will need to pick one, and stick with it. Also, you need to keep clear the distinction between row vectors and column vectors.

If you use the notation where the derivative of Ax + b with respect to x is A, where x is a column vector, then it turns out that the derivative of x^T A with respect to x is A^T.

The derivative of a function from R^n to R^m in this notation is always an m by n matrix (m rows, n column). This lets us write the chain rule. If f() maps from R^n to R^m, and g() maps from R^m to R^k, then ∇f is an m by n matrix, ∇g is a k by m matrix, g(f(x)) maps from R^n to R^k, and ∇g(f(x)) is ∇g ∇f, a k by n matrix.

The derivative of x^T x with respect to x is 2 x^T . You missed the transposition. (This was a function from R^n to R^1, so the derivative is a 1 by n matrix, i.e., a row vector.) This can be derived using the above formulas, using the product rule.

Now you should be able to derive ∇_x (Ax)^T(Ax) as (∇_Ax (Ax)^T(Ax) ) (∇_x Ax) = ( 2 (Ax)^T ) (A) = 2x^T A^T A.

Second derivatives get more complicated.

User avatar
Dr. The Juggernaut of Touching Himself
Posts: 5498
Joined: Mon Oct 23, 2006 2:31 am UTC
Location: Lexington, MA

Re: chain rules matrix derivatives

Postby doogly » Tue Dec 27, 2016 4:44 am UTC

Praise be to Einstein notation.
LE4dGOLEM: What's a Doug?
Noc: A larval Doogly. They grow the tail and stinger upon reaching adulthood.

Keep waggling your butt brows Brothers.
Or; Is that your eye butthairs?

Return to “Mathematics”

Who is online

Users browsing this forum: No registered users and 11 guests