Why does the function of a random variable X, have a different probability than the inverse of the function in the PDF of X?

Question

There is probably a very easy answer to this but I'm just not getting it. Suppose X is a random variable with a PDF = $f_X(x)$.

If $Y$ is a function of $X$ such that:

$$Y=g(X)$$ and $$g^{-1}(Y) = X$$ where $g(x)$ is differentiable and monotonic. The PDF of $Y$, then $f_Y(x)$ is

$$f_Y(y)= f_X(x) / g'(g^{-1}(y))$$

While this formula makes sense, can anyone provide some intuition as to why $$f_Y(y)\neq f_X(g^{-1}(Y))$$

Since $Y=y$ iff $X = g^{-1}(y)$, why would the probability of $X=g^{-1}(y)$ differ from the probability of $Y=y$.

For example if $g(x) = X^2$ and $$f_X(x)=\frac{e^{-x^2/2}}{\sqrt{2\pi}},\forall x\in\Bbb R$$then $$f_Y(y)=\frac{e^{-y/2}}{\sqrt{2y\pi}}$$

Why is it that $f_Y(Y=4) \neq f_X(X=2)+f_X(X=-2)$?

First time poster so any comments on question format and content would also be appreciated. Thanks.

yeah sorry i was having a little trouble with the exponents in mathjax — phntm
– phntm, Commented Feb 9, 2021 at 7:03
Fixed your Mathjax. Have a look. For getting $a^{b}$ in Mathjax, use the syntax $a^{b}$ . — Shubham Johri
– Shubham Johri, Commented Feb 9, 2021 at 7:07
$g(X)=X^2$ is not bijective. $f_Y(y)=f_X(g^{-1}(y))$ is true when $X,Y$ are discrete random variables ($f_X(x)$ is PMF and denotes the probability that $X=x$ and similarly for $f_Y(y)$) and of course, $g$ bijective. But PDF $f_X(x)$ doesn't denote the probability of the event $X=x$ (which is zero). — Shubham Johri
– Shubham Johri, Commented Feb 9, 2021 at 7:08
Thanks so much! you've made it much more clear. Sorry fixed the bijective issue. I suppose I should've used a better example, but in general don't understand the intuition behind why a simple substitution into the original PDF wouldn't always work. — phntm
– phntm, Commented Feb 9, 2021 at 7:13

Mark D · Accepted Answer · 2021-02-09 09:15:46Z

I suppose the pithy answer to your question is that this is a consequence of change-of-variables in integral calculus: $$\int_\color{gray}{a}^\color{gray}{b} f(g(t))g^\prime(t)dt = \int_{\color{gray}{g(a)}}^{\color{gray}{g(b})} f(u) du,$$ for a monotonically increasing function $g$, and continuous $f$ over $[a,b]$.

If you would like an appreciation for the geometry of what this represents though...

In short, the factor you refer to is needed to preserve total probability density (by rescaling the original density function).

[Using $g^\prime(g^{-1}(y))$ here is perhaps a typo...I think you're trying to express the derivative of the inverse transformation?]

The density function of $Y$ is better presented as $$f_Y(y)=f_X(g^{-1}(y)) \lvert\frac{d}{dy}g^{-1}(y)\rvert,$$ where $g$ is a monotonic function (and hence so is $g^{-1}$) over the support of $X$. Students love to ignore the $\frac{d}{dy}g^{-1}(y)$ factor, but it is where all the action happens for continuous $X$, as I hope to suggest to you!

(Side note: be careful with your notation: you have written $f_Y(\color{blue}{y})= f_X(\color{red}{x}) / g'(g^{-1}(\color{blue}{y}))$, which is a function of $y$ on the left hand side but still has an $x$ on the right hand side, which may contribute to confusion - the $x$ has been (and should be) replaced by $g^{-1}(y)$, a function of $y$.)

As Shubham Johri pointed out, when $X$ is discrete then (effectively) $\frac{d}{dy}g^{-1}(y)$ "falls away". However, "falling away" doesn't give you much of a feel for what is really going on (geometrically).

A good starting point is to accept that probability masses (associated with discrete variables) and densities (continuous variables) are "incompressible". When you perform a statistical transformation ($Y=g(X)$) you are shifting probability mass or density around, but preserving the total mass or density ($\sum_{\text{all y}} f_Y(y)=1$ or $\int_\mathbb{R} f_Y(y)dy=1$).

For a physical analogy for a continuous random variable, think of taking an inflated balloon: you can reshape the balloon, without popping it, in many different ways (by squashing it, for example), and the total volume in the balloon remains the same (yes, real gasses do compress/expand but that is not the point here - fill the balloon with water in your mind if it helps you get past the limitations of my physical analogy).

Now, for a monotonic transformation of a continuous random variable, consider how the supports map (from $X$ to $Y$). To make it concrete, let's suppose that the support of $X$ is $(0,1)$ and we are going to transform this variable to $Y=e^X-1=g(X)$ (notice that this "$g$" is monotonic). The transformed support of $Y$ is then $(e^0-1,e^1-1)=(0,e-1)$ (we can do this because $e^x-1$ is monotonic in $x$; if this were not the case, we would need to be far more careful in determining the new support). The point is that geometrically the support (not the density directly!) has been "stretched" by this transformation, from a range of $1$ for $X$ to a range of $e-1$ for $Y$. The density simply follows this stretching.

In order to preserve the total probability density (which is incompressible), the density of $Y$ will need to be rescaled. How much to rescale by? That depends on how much local "stretching" has occurred. (Think of the graph of the transformation, $y=e^x-1$, over the support of $X$ (i.e. $(0,1)$.) The amount of local "stretching" that must be compensated for is given precisely by the factor $\frac{d}{dy}g^{-1}(y)$! In this example, $g^{-1}(y)=\ln (y+1)$, and $\frac{d}{dy}g^{-1}(y)=\frac{1}{y+1}$. You will notice that (for this transformation) for values of $y>0$, this factor is less than $1$ i.e. the straight substitution you enquired about, $f_X(g^{-1}(y))$, is decreased by this factor. And it needs to be decreased more the larger $y$ is. Why is that (geometrically)? The transformation we are using ($e^X-1$) stretches (the $x$-axis) more, the larger $x$ is. The height of the density above this needs to be decreased then, in order to preserve the total density in a local region when it goes through the transformation. In non-rigorous terms, the density of $X$ is "smeared" out by this particular transformation.

Since point masses can't be "smeared", there is no need for this scaling factor in the density transformation.

[If the non-linear transformation above is confusing at first, try the same argument using a linear transformation, $Y=g(X)=a+b X$. In this case the scaling factor, $\lvert1/b\rvert$, is constant over the support. And, in the really trivial case where $b=1$, no scaling is required at all...because the transformation is only a shift, $Y=a+X$, of the distribution of $X$.]

Thank you so much for the patient and well crafted answer!! Unfortunately I am a rather course student so it took me a few re-reads and some thinking to develop even a superficial intution, which is as follows: Probability is represented by the area under the PDF, if you adjust the support you have to adjust the height to preserve the area. In this manner P(y1<Y<y2) = P(inv(g(y1))<X<inv(g(y2))) even though their PDF's do not equal, in fact their PDF's are not equal in order to preserve this equality? — phntm
– phntm, Commented Feb 11, 2021 at 6:34
i think what's confusing is also the fact that as dy ->0, then you're just left with a pdf. But I guess what that means is you can't really draw conclusions by comparing two PDF's to each other? Because two events can have two very different PDF's and still have the same probability? — phntm
– phntm, Commented Feb 11, 2021 at 6:37
To your first comment: quite so - your summary is succinct (for monotonic increasing $g$). And, in fact, viewing it in terms of probability instead of density makes it easier. You have effectively answered your own question in a way. For monotonic increasing $g$ you have cumulative distribution function (probability) $$F_Y(y)=P(Y \leq y) = P(X \leq g^{-1}(y)) = F_X(g^{-1}(y)).$$ Taking derivatives with respect to $y$ obtains the density function, and the need to use the chain rule produces the factor that you asked about. — Mark D
– Mark D, Commented Feb 11, 2021 at 12:01
Sorry i meant for a continuous random variable, probability is only defined over some interval. So $P(Y=y) = f_Y(y)$ is short hand for $$P(Y=y+\Delta) = \int_y^{y+\Delta}f_Y(y)dy$$. As $\Delta$-> 0 then the area under the PDF and hence the probability ->0. But if a continuous random variable cannot take on a single value, what exactly is $f_y(Y=y)$ telling us when y is not a range, but a constant? — phntm
– phntm, Commented Feb 11, 2021 at 16:06

Stack Exchange Network

Why does the function of a random variable X, have a different probability than the inverse of the function in the PDF of X?

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Why does the function of a random variable X, have a different probability than the inverse of the function in the PDF of X?

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions