Probability distributions for sorted observations of a random variable, such as the nth-highest roll from k dice.

Question

This seems to be a bit of an unorthodox question since I haven't been able to find any questions quite like it on this site. The closest I can find are those pertaining specifically to the scenarios of rolling dice and ignoring some number of either only the lowest or only the highest. So, I'll first describe a more general case of the problem for anyone else who might need to know how the distribution of ordered observations of a random variable works.

Note on notation: As per Wikipedia, I'm using $\left[ a .. b \right]$ to mean the set of integers between $a$ and $b$—that is, $\left[ a .. b \right] = \left[ a,b \right] \cap \mathbb{N}$

General Case

Take $k$ observations, $a_1$ through $a_k$, of some random variable $A$ with distribution $\mathcal{A}$. Put them in a list $b$, ordered from smallest to largest, so that $b_n$ is the $n$th-smallest value of $\{ a_i \mid i \in [1 .. k]\}$. What is the probability distribution of $b_n$? I will refer to this distribution as $\Omega(\mathcal{A}, k, n)$.

Example

Let's say $k$ = 5 and $A$ is a standard uniform random variable. That is, $A \sim \mathcal{U}(0,1)$.

We obtain five observations of $\mathcal{U}(0,1)$:

$a = ( .376, .531, .826, .896, .166 )$

and then we sort them:

$b = ( .166, .376, .531, .826, .896 )$

Each $b_n$ is an observation of $\Omega(\mathcal{U}(0,1), 5, n)$. For example, 0.531 is an observation of $\Omega(\mathcal{U}(0,1), 5, 3)$.

Specific Case

The particular reason I came across this problem is for tabletop role-playing games, which typically use dice for determining random outcomes. A common method for modifying the shape of probability distributions without changing the minimum and maximum possible values is to have players perform one or more extra rolls and then discard the highest or lowest results before summing the remaining values. For example, roll a 20-sided dice twice and discard the highest, or roll a 6-sided dice four times and discard the lowest. When designing new game mechanics that use this type of random selection, it would be good to know what the actual probability distributions are. To determine the probability distribution of a sum of certain rolls in a sorted set, you need to know the probability distribution of each die after sorting.

So, my specific case is a subset of the general case for a discrete uniform distribution with domain $[1..s]$, where s is the number of sides on the die. I will call this distribution $D(s)$. It has the following probability mass function and cumulative distribution function:

$p(x; s) = \left\{\begin{array}{ll}\frac{1}{s} & \quad x \in [1 .. s] \\ 0 & \quad \text{otherwise} \end{array}\right.$

$F(x; s) = \left\{\begin{array}{ll} 0 & \quad x \lt 1 \\ \frac{\lfloor x \rfloor}{s} & \quad 1 \leq x \leq s \\ 1 & \quad x \gt s \end{array}\right.$

What I'm looking for is a way to get the probability distribution for the $n$th-smallest value from rolling an $s$-sided die $k$ times. Formally, I want to find the probability mass function $p(x; s, k, n)$ and/or the cumulative distribution function $F(x; s, k, n)$ of $\Omega(D(s), k, n)$.

Example

Let's say $k$ = 4 and $A$ is a random variable that represents the outcome of rolling a 20-sided die. That is, $A \sim D(20)$.

We obtain four observations of $D(20)$:

$a = ( 19, 10, 10, 13)$

and then we sort them:

$b = ( 10, 10, 13, 19 )$

Each $b_n$ is an observation of $\Omega(D(20), 4, n)$. For example, the second 10 is an observation of $\Omega(D(20), 4, 2)$.

What I am personally looking for is just a solution to the dice problem. However, out of curiosity as to whether it is even possible, I do wonder if it is possible to come up with a more general composite function that handles the general case for any random variable. I'm sure there are use cases for needing to know the distribution of things like the middle half of observations from a normal distribution, or the tenth-highest outcome out of a hundred from a geometric distribution.

The trouble with starting from individual order statistics is that they are not independent. To get the distribution of the sum, you could compute the joint distribution of the individual order statistics, but this gets messy quickly. Fortunately, there are a variety of other effective approaches. — HighDiceRoller
– HighDiceRoller, Commented Sep 12, 2023 at 1:57

Henry · Accepted Answer · 2023-09-12 00:57:11Z

You want the probability that the $n$th-smallest value from rolling an $s$-sided die $k$ times is $x$.

This is the probability that at least $n$ of the $k$ rolls give $x$ or less minus the probability that at least $n$ of the $k$ rolls give $x-1$ or less

which is the same as the probability that no more than $n-1$ of the $k$ rolls give $x-1$ or less minus the probability that no more than $n-1$ of the $k$ rolls give $x$ or less

which you can write as $$\sum\limits_{j=0}^{n-1} {k \choose j}\frac{(x-1)^j(s-x+1)^{k-j}-x^j(s-x)^{k-j}}{s^n}$$

but if you have a binomial CDF function to hand, such as in R, it might be easier to write:

pbinom(n-1, k, (x-1)/s) - pbinom(n-1, k, x/s)

making the probability that the second smallest value of four $D(20)$ rolls is $10$ equal to $0.07848125$ (the value of $7$ is more likely, with a probability of $0.08871875$, the median is $8$ and the expected value is $8.5000125$)

The CDF is $1$ minus the probability that no more than $n-1$ of the $k$ rolls give $x$ or less so

$$1- \sum\limits_{j=0}^{n-1} {k \choose j}\frac{x^j(s-x)^{k-j}}{s^n}$$

or in R

 1 - pbinom(n-1, k, x/s)

HighDiceRoller · Accepted Answer · 2023-09-12 18:02:30Z

This class of question was also asked about here and here.

Multiset enumeration

I would be remiss not to mention AnyDice, which can conveniently compute many problems of this type and is the currently the most popular resource of its type in the tabletop RPG area.

output [highest 3 of 4d6] output [middle 1 of 3d20] output {1,3,5}@5d20

(The @ operator here is used to select the highest, third highest, and fifth highest (i.e. lowest) die out of five, which are then summed.)

However, AnyDice does enumerate all $\binom{n + k - 1}{n}$ possible multisets (equivalently, sorted rolls) that can come out of a pool, which are weighted according to the multinomial distribution. Its 5-second limit tops out at around eight to nine d20s, or between 4.5 and 7 million possibilities.

Fast Fourier Transform

That the sum of a set of dice can be computed via convolution is a classical result in discrete statistics. Convolution (sometimes also framed in terms of generating functions) allows us to use the fantastically efficient fast Fourier transform (FFT). With some more work, FFTs can also be used the case where only some of the dice are kept. Approaches of this type have been proposed on StackExchange, with examples including:

Dynamic programming

However, my preferred approach is dynamic programming (sometimes also framed in terms of recurrence relations). Examples include:

The short of it is that we compute the answer to the problem to "roll $n$ d20, keep $m$" by considering how many dice $k = 0 \ldots n$ might possibly roll a 20, and then use the solutions to "roll $n - k$ d19, keep $m - k$" -- namely, the remaining pool after the $k$ dice that rolled a 20 are removed -- to complete the solution. In turn, the solutions for d19s use the solutions for d18s, and so forth until we reach d1s and all of the remaining dice must roll 1. While not as fast as FFTs, the dynamic programming approach is much more flexible, covering general "single-pass" functions over order statistics: in addition to summing the dice, it can also be used to produce efficient solutions for matching sets, straights, RISK-like mechanics, and so forth. Selecting from the middle or even arbitrary indexes is no problem either.

I've implemented this in my Icepool Python package. Here is an example script for summing the highest 10 out of 30d20:

from icepool import d output(d(20).highest(30, 10))

You can also take from the middle:

output(d(20).middle(30, 10))

In pool mode, you can select arbitrary indexes, e.g. every other from a pool of 5:

output(d(20).pool(5)[1,0,1,0,1].sum())

Or operations other than summing:

output(d(20).pool(30).middle(10).largest_straight())

You can try it online here.

If you are interested in learning more, you can read my paper on the subject.

@inproceedings{liu2022icepool, title={Icepool: Efficient Computation of Dice Pool Probabilities}, author={Albert Julius Liu}, booktitle={Eighteenth AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment}, volume={18}, number={1}, pages={258-265}, year={2022}, month={Oct.}, eventdate={2022-10-24/2022-10-28}, venue={Pomona, California}, url={https://ojs.aaai.org/index.php/AIIDE/article/view/21971}, doi={10.1609/aiide.v18i1.21971} }

This is really impressive! I wasn't expecting to stumble upon a veritable genius of dice statistics. I'm a big fan of AnyDice, but it lacks the ability to analyze a "middle" selection of order statistics—it can only take the highest or lowest n dice. That is the original reason I decided to pose this question. Seems like your site should really be the most popular resource. — Ethan Maness
– Ethan Maness, Commented Sep 12, 2023 at 17:22
Thank you for the kind words! Though to be fair, AnyDice is also capable of selection using the middle function or the @ operator. I've added some more AnyDice examples. — HighDiceRoller
– HighDiceRoller, Commented Sep 12, 2023 at 18:06

Stack Exchange Network

Probability distributions for sorted observations of a random variable, such as the nth-highest roll from k dice.

General Case

Example

Specific Case

Example

2 Answers 2

Multiset enumeration

Fast Fourier Transform

Dynamic programming

You must log in to answer this question.

Linked

Hot Network Questions

Probability distributions for sorted observations of a random variable, such as the nth-highest roll from k dice.

General Case

Example

Specific Case

Example

2 Answers 2

Multiset enumeration

Fast Fourier Transform

Dynamic programming

You must log in to answer this question.

Linked

Related

Hot Network Questions