I recently learned of a new correlation coefficient, introduced by Sourav Chatterjee, with some really cool properties. As stated in the abstract of the paper (Reference 1 below), this coefficient…

- … is as simple as the classical coefficients of correlation (e.g. Pearson correlation and Spearman correlation),
- … consistently estimates a quantity that is 0 iff the variables are independent and 1 iff one is a measurable function of the other, and
- … has a simple asymptotic theory under the hypothesis of independence (like the classical coefficients).

What was surprising to me was Point 2: that * this is the first known correlation coefficient that measures the degree of functional dependence such that we get independence iff val=0 and deterministic functional dependence iff val=1*. (The “iff”s are not typos: they are short for “if and only if”.) This can be viewed as a generalization of the Pearson correlation coefficient which measures the degree of

*linear*dependence between and . (The author points out in point 5 of the Introduction and Section 6 that the maximal information coefficient and the maximal correlation coefficient do not have this property, even though they are sometimes thought to have it.)

**Defining the sample correlation coefficient**

Let and be real-valued random variables such that is not a constant, and let be i.i.d. pairs of these random variables. The new correlation coefficient can be computed as follows:

- Rearrange the data as so that the values are in increasing order, i.e. . If there are ties, break them uniformly at random.
- For each index , let be the rank of , i.e. the number of such that , and let be the number of such that .
- Define the new correlation coefficient as

If there are no ties among the ‘s, the denominator simplifies and we don’t have to compute the ‘s:

**What is the sample correlation coefficient estimating?**

The following theorem tells us what is trying to estimate:

Theorem:As , converges almost surely to the deterministic limit

seems like a nasty quantity but it has some nice properties:

- always belongs to . (This follows immediately from the law of total variance.)
- iff and are independent, and iff there is a measurable function such that almost surely.

This later paper by Mona Azadkia and Chatterjee extends to capture degrees of conditional dependence.

**Some properties of and **

Here are some other key properties of this new correlation coefficient:

- is not symmetric in and , i.e. often we will have . It can be symmetrized by considering as the correlation coefficient instead. Chatterjee notes that we might want this asymmetry in certain cases: “we may want to understand if is a function of , and not just if one of the variables is a function of the other”.
- remains unchanged if we apply strictly increasing transformations to and as it is only based on the ranks of the data.
- Since is only based on ranks, it can be computed in time.
- We have some asymptotic theory for under the assumption of independence:

**Theorem:**Suppose and are independent and is continuous. Then in distribution as .(The paper has a corresponding result for when is not continuous.) This theorem allows us to construct a hypothesis test of independence based on this correlation coefficient.

References:

- Chatterjee, S. (2019). A new coefficient of correlation.