16. Random Matrices

A matrix is interesting only in those ways it fails to look like a random matrix.

A random matrix is a matrix whose entries are random, i.e., random draws from some distribution(s). More precisely, the entries of a random matrix are random variables, which means their values are determined probabilistically.

You would expect that a random matrix must no covariance structure, because it is random. In fact, random matrices have a covariance structure described by the Marchenko-Pastur distribution.

In this lecture, we start with the maths, and then we look at a random matrix example that arises in finance. Finally, we illustrate a denoising technique based on a medical imaging example.

A useful reference on the topic and the importance of random matrix in machine learning are:

For a connection between random matrices results in spin glasses and machine learning see:

For insights from both physics and finance, see:

16.1. Wigner Semicircle Law

Consider a matrix \(M\) belonging to the Gaussian Orthogonal Ensemble (GOE).

Such a matrix \(M\) can be a real symmetric \(n \times n\) matrix with i.i.d. entries \(M_{ij}\) drawn from a normal distribution with mean 0 and variance 1.

Let us compute such a matrix.

[5]:
import numpy as np
import matplotlib.pyplot as plt
n = 50
A = np.random.randn(n, n)
M = (A + A.T) / np.sqrt(2 * n)

plt.figure(figsize=(5,5))
plt.imshow(M, cmap='viridis')
plt.colorbar(label='Value', shrink=0.8)
plt.title('Random GOE Matrix')
plt.xlabel('Column Index')
plt.ylabel('Row Index')
plt.show()



../../_images/material_part16_notebook_2_0.png

We compute the eigenvalues of \(M\), i.e., \(\lambda_1, \ldots, \lambda_n\) such that \(M v_i = \lambda_i v_i\) for \(i=1,\ldots, n\), where \(v_1, \ldots, v_n\) are the eigenvectors of \(M\).

[9]:
lambdas = np.linalg.eigvalsh(M)
len(lambdas)
[9]:
50
[11]:
lambdas
[11]:
array([-1.78831596, -1.66285061, -1.60735989, -1.478767  , -1.37076632,
       -1.28266936, -1.26073123, -1.17522307, -1.15078943, -1.0505597 ,
       -0.87587328, -0.82994073, -0.81216264, -0.69627856, -0.63556949,
       -0.60146074, -0.5174084 , -0.49149304, -0.42805175, -0.34783377,
       -0.30842303, -0.18600724, -0.16910437, -0.0798577 , -0.06882297,
       -0.03037273,  0.10134173,  0.12347385,  0.19636835,  0.33194403,
        0.39814771,  0.40461385,  0.48102878,  0.5181245 ,  0.57489408,
        0.60581731,  0.74318659,  0.74941497,  0.83245446,  0.99281742,
        1.04225235,  1.06837886,  1.18588169,  1.2323374 ,  1.40538206,
        1.46164366,  1.54461226,  1.65577622,  1.73020455,  1.8206828 ])

lambdas is an array of length n containing the eigenvalues of \(M\). We can present the eigenvalues in a histogram. This histogram is sometimes referred to as the empirical eigenvalue distribution.

[12]:
plt.figure(figsize=(5,5))
plt.hist(lambdas, bins=20, density=True, alpha=0.7, color='black', label='Eigenvalues')
plt.title('Histogram of GOE Matrix Eigenvalues')
plt.xlabel(r'Eigenvalue $\lambda$')
plt.ylabel('Density')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

../../_images/material_part16_notebook_7_0.png

Let us now consider the limiting case as \(n \to \infty\).

For instance, pick \(n=1000\).

[13]:
%%time
n = 1000
A = np.random.randn(n, n)
M = (A + A.T) / np.sqrt(2 * n)
lambdas = np.linalg.eigvalsh(M)
CPU times: user 161 ms, sys: 38.9 ms, total: 200 ms
Wall time: 164 ms
[14]:
plt.figure(figsize=(5,5))
plt.hist(lambdas, bins=100, density=True, alpha=0.7, color='black', label='Eigenvalues')
plt.title('Histogram of GOE Matrix Eigenvalues')
plt.xlabel(r'Eigenvalue $\lambda$')
plt.ylabel('Density')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
../../_images/material_part16_notebook_10_0.png

n = 10000, so the matrix is 10000 by 10000

[4]:
%%time
n = 10000
A = np.random.randn(n, n)
M = (A + A.T) / np.sqrt(2 * n)
lambdas = np.linalg.eigvalsh(M)

CPU times: user 6min 38s, sys: 42.3 s, total: 7min 21s
Wall time: 4min 18s
[6]:
plt.figure(figsize=(5,5))
plt.hist(lambdas, bins=100, density=True, alpha=0.7, color='black', label='Eigenvalues')
plt.title('Histogram of GOE Matrix Eigenvalues')
plt.xlabel(r'Eigenvalue $\lambda$')
plt.ylabel('Density')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

../../_images/material_part16_notebook_13_0.png

The histogram converges to a semicircle. The eigenvalues are distributed according to the Wigner semicircle distribution:

\[\rho(\lambda) = \frac{2}{\pi R^2} \sqrt{ R^2 - \lambda^2} \quad \text{for} \quad \lambda \in [-R, R].\]

Here with \(R=2\).

Note that if the standard deviation of the normal distribution is \(\sigma\), then the semicircle radius is \(R = 2 \sigma\).

[7]:
plt.figure(figsize=(5,5))
plt.hist(lambdas, bins=100, density=True, alpha=0.7, color='black', label='Eigenvalues')

x = np.linspace(-2, 2, 1000)
y = 2/(np.pi*4) * np.sqrt(4 - x**2)  # R=2 case
plt.plot(x, y, 'r-', label='Wigner Semicircle', lw=2)

plt.title('Histogram of GOE Matrix Eigenvalues')
plt.xlabel(r'Eigenvalue $\lambda$')
plt.ylabel('Density')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
../../_images/material_part16_notebook_15_0.png

This property was discovered by Eugene Wigner in the 1950s, working on nuclear physics (more precisely the spaces between the energy levels of heavy nuclei). See Wigner surmise for more details.

Wigner’s insight was as follows:

Perhaps I am now too courageous when I try to guess the distribution of the distances between successive levels (of energies of heavy nuclei). Theoretically, the situation is quite simple if one attacks the problem in a simpleminded fashion. The question is simply what are the distances of the characteristic values of a symmetric matrix with random coefficients. (Wigner, 1956)

16.2. Wishart Matrices

This ensemble of matrices is called the Wishart Ensemble and named after the British statistician John Wishart.

16.2.1. Sample Covariance Matrices

Consider \(T\) observations \(\mathbf{x}_1, \ldots, \mathbf{x}_T\), with \(\mathbf{x}_t \in \mathbb{R}^N\).

The sample covariance matrix (SCM) is defined as:

\[\mathbf{E} = \frac{1}{T} \sum_{t=1}^T \mathbf{x}_t \mathbf{x}_t^T.\]

It is of size \(N \times N\). With \(\mathbf{H}\) the matrix of observations (the data matrix), i.e., \(\mathbf{H} = [\mathbf{x}_1, \ldots, \mathbf{x}_T]\) of size \(N \times T\), we can write:

\[\mathbf{E} = \frac{1}{T} \mathbf{H} \mathbf{H}^T.\]

You can show that \(\mathbf{E}\) is symmetric and positive semidefinite, and therefore all its eigenvalues are real and non-negative: \(\lambda_1, \ldots, \lambda_N \geq 0\).

16.2.2. Wishart Ensemble

When the entries of \(\mathbf{H}\) are i.i.d. normal random variables, then \(\mathbf{E}\) is a Wishart matrix belonging to the Wishart Ensemble. Moreover, the probability distribution defined over the space of \(\mathbf{E}\)’s is known as the Wishart distribution.

An important parameter to characterize these matrices is the ratio \(q = N/T\).

When \(q \to 0\), the number of observations \(T\) is much larger than the number of features \(N\), and one can faithfully reconstruct the true covariance matrix from the SCM.

But when \(q=\mathcal{O}(1)\), the number of observations is of the same order as the number of features, and the SCM is very distorted version of the true covariance matrix.

Let us fix \(q=1/2\) and look at the histogram of the eigenvalues of a Wishart matrix, as we increase the number of observations \(T\).

[1]:
import numpy as np
import matplotlib.pyplot as plt

q = 1 / 2
fig, ax = plt.subplots(1, 3, figsize=(15, 5))  # Larger figure size for better spacing

# Define the different values of T to loop over
Ts = [500, 1000, 10000]

for i, T in enumerate(Ts):
    N = int(q * T)
    H = np.random.randn(N, T)
    E = H @ H.T / T
    lambdas = np.linalg.eigvalsh(E)

    # Plot the histogram of eigenvalues
    ax[i].hist(lambdas, bins=100, density=True, alpha=0.7, color='black')
    ax[i].set_title(f'T = {T}', fontsize=12)
    ax[i].set_xlabel('Eigenvalue', fontsize=10)
    ax[i].set_ylabel('Density', fontsize=10)

# Adjust layout
plt.tight_layout()
plt.show()

../../_images/material_part16_notebook_18_0.png

We see that for large \(T\) the histogram converges to a seemingly continuous distribution.

16.3. Marchenko Pastur Distribution

The Marchenko Pastur distribution is a distribution of eigenvalues. The formula is:

\[\rho(\lambda) = \frac{1}{2 \pi q \lambda} \sqrt{(\lambda_+ - \lambda)(\lambda - \lambda_-)} \quad \text{for} \quad \lambda \in [\lambda_+, \lambda_-],\]

where \(\lambda_+, \lambda_-\) are given by:

\[\lambda_\pm = (1 \pm \sqrt{q})^2.\]

16.3.1. Marchenko Pastur Law

The Marchenko Pastur law implies that when both \(N\) and \(T\) are large, the empirical eigenvalue distribution of Wishart matrices converges to the Marchenko Pastur distribution with \(q=N/T\).

We can verify this with our previous example where we fixed \(q=1/2\).

[3]:
import numpy as np
import matplotlib.pyplot as plt

# Parameters
q = 1 / 2
T = 10000
N = int(q * T)

# Generate the random matrix and compute eigenvalues
H = np.random.randn(N, T)
E = H @ H.T / T
lambdas = np.linalg.eigvalsh(E)

# Calculate Marchenko-Pastur distribution
lambda_minus = (1 - np.sqrt(q))**2
lambda_plus = (1 + np.sqrt(q))**2
x = np.linspace(lambda_minus, lambda_plus, 500)
mp_density = (1 / (2 * np.pi * q * x)) * np.sqrt((lambda_plus - x) * (x - lambda_minus))
mp_density = np.where((x >= lambda_minus) & (x <= lambda_plus), mp_density, 0)  # Set density to 0 outside the range

[6]:
# Plotting
plt.figure(figsize=(5, 5))
plt.hist(lambdas, bins=100, density=True, alpha=0.7, color='black', label='Empirical Eigenvalues')
plt.plot(x, mp_density, color='red', linewidth=2, label='Marchenko-Pastur Distribution')
plt.axvline(lambda_minus, color='blue', linestyle='--', label=r'$\lambda_-$')
plt.axvline(lambda_plus, color='green', linestyle='--', label=r'$\lambda_+$')

# Labels and legend
# plt.title(f'Eigenvalue Distribution and Marchenko-Pastur Law (T={T})', fontsize=14)
plt.xlabel('Eigenvalue', fontsize=12)
plt.ylabel('Density', fontsize=12)
plt.legend(fontsize=12)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()
../../_images/material_part16_notebook_23_0.png

Let us now look at different values of \(q\).

Case q=2:

[7]:
import numpy as np
import matplotlib.pyplot as plt

# Parameters
q = 2
T = 5000
N = int(q * T)

# Generate the random matrix and compute eigenvalues
H = np.random.randn(N, T)
E = H @ H.T / T
lambdas = np.linalg.eigvalsh(E)

# Calculate Marchenko-Pastur distribution
lambda_minus = (1 - np.sqrt(q))**2
lambda_plus = (1 + np.sqrt(q))**2
x = np.linspace(lambda_minus, lambda_plus, 500)
mp_density = (1 / (2 * np.pi * q * x)) * np.sqrt((lambda_plus - x) * (x - lambda_minus))
mp_density = np.where((x >= lambda_minus) & (x <= lambda_plus), mp_density, 0)  # Set density to 0 outside the range

[23]:

# Plotting
plt.figure(figsize=(5, 5))
hist_vals, bins, _ = plt.hist(lambdas, bins=100, density=True, alpha=0.7, color='black', label='Empirical Eigenvalues')
plt.plot(x, mp_density, color='red', linewidth=2, label='Marchenko-Pastur Distribution')
plt.axvline(lambda_minus, color='blue', linestyle='--', label=r'$\lambda_-$')
plt.axvline(lambda_plus, color='green', linestyle='--', label=r'$\lambda_+$')


zero_mass = (q - 1) / q  # Amplitude of delta function at zero
bin_width = bins[1] - bins[0]
plt.bar(0, zero_mass/bin_width, width=bin_width, color='red', alpha=0.7, label=r'Mass at origin')

plt.yscale('log')
# Labels and legend
# plt.title(f'Eigenvalue Distribution and Marchenko-Pastur Law (T={T}, q={q})', fontsize=14)
plt.xlabel('Eigenvalue', fontsize=12)
plt.ylabel('Density', fontsize=12)
plt.legend(fontsize=12)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()


../../_images/material_part16_notebook_26_0.png

When \(q>1\), the Marchenko Pastur distribution has a mass at the origin, its amplitude is given by \((q-1)/q\) (times a Dirac delta function).

Case q=1:

[24]:
import numpy as np
import matplotlib.pyplot as plt

# Parameters
q = 1
T = 5000
N = int(q * T)

# Generate the random matrix and compute eigenvalues
H = np.random.randn(N, T)
E = H @ H.T / T
lambdas = np.linalg.eigvalsh(E)

# Calculate Marchenko-Pastur distribution
lambda_minus = (1 - np.sqrt(q))**2
lambda_plus = (1 + np.sqrt(q))**2
x = np.linspace(lambda_minus, lambda_plus, 500)
mp_density = (1 / (2 * np.pi * q * x)) * np.sqrt((lambda_plus - x) * (x - lambda_minus))
mp_density = np.where((x >= lambda_minus) & (x <= lambda_plus), mp_density, 0)  # Set density to 0 outside the range
/var/folders/h0/4_tf3pcn1h32ks9grh325v400000gn/T/ipykernel_61003/3342109622.py:18: RuntimeWarning: divide by zero encountered in divide
  mp_density = (1 / (2 * np.pi * q * x)) * np.sqrt((lambda_plus - x) * (x - lambda_minus))
/var/folders/h0/4_tf3pcn1h32ks9grh325v400000gn/T/ipykernel_61003/3342109622.py:18: RuntimeWarning: invalid value encountered in multiply
  mp_density = (1 / (2 * np.pi * q * x)) * np.sqrt((lambda_plus - x) * (x - lambda_minus))
[25]:

# Plotting
plt.figure(figsize=(5, 5))
hist_vals, bins, _ = plt.hist(lambdas, bins=100, density=True, alpha=0.7, color='black', label='Empirical Eigenvalues')
plt.plot(x, mp_density, color='red', linewidth=2, label='Marchenko-Pastur Distribution')
plt.axvline(lambda_minus, color='blue', linestyle='--', label=r'$\lambda_-$')
plt.axvline(lambda_plus, color='green', linestyle='--', label=r'$\lambda_+$')


plt.yscale('log')
# Labels and legend
# plt.title(f'Eigenvalue Distribution and Marchenko-Pastur Law (T={T}, q={q})', fontsize=14)
plt.xlabel('Eigenvalue', fontsize=12)
plt.ylabel('Density', fontsize=12)
plt.legend(fontsize=12)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

../../_images/material_part16_notebook_30_0.png

When \(q=1\), the Marchenko Pastur distribution has a singularity at the origin. It diverges as \(\sim 1/\sqrt{\lambda}\). In fact, in this case, the matrix \(\mathbf{H}\) is a Wigner matrix with its eigenvalues distributed according to the Wigner semicircle distribution, and the corresponding distribution of the square of these eigenvalues (i.e., the eigenvalues of \(\mathbf{E}\)) is distributed according to the distribution illustrated above.

16.3.2. Non-unit variance

When the variance of the entries of \(\mathbf{H}\) is not 1, the Marchenko Pastur distribution is given by:

\[\rho(\lambda) = \frac{1}{2 \pi q \lambda \sigma^2} \sqrt{(\lambda_+ - \lambda)(\lambda - \lambda_-)} \quad \text{for} \quad \lambda \in [\lambda_+, \lambda_-],\]

where \(\lambda_\pm = (1 \pm \sqrt{q})^2 \sigma^2\).

16.3.3. Universality

In fact, the Marchenko Pastur distribution is universal in the sense that it does not depend on the specific distribution of the entries of \(\mathbf{H}\).

The theorem is due to Marchenko and Pastur (1967) and it states that:

Let \(\mathbf{H}\) be an \(N \times T\) random matrix with i.i.d. entries \(x_{ij}\) that verify:

\[\mathbb{E}[x_{ij}] = 0, \quad \mathbb{E}[x_{ij}^2] = 1 \quad \text{and} \quad \mathbb{E}[x_{ij}^4] < \infty.\]

Then when \(N, T \to \infty\), the empirical eigenvalue distribution of \(\mathbf{E} = \frac{1}{T} \mathbf{H} \mathbf{H}^T\) converges to the Marchenko Pastur distribution with \(q=N/T\).

Thus, the entries do not need to be Gaussian, as long as the above moment conditions are verified.

The most important feature predicted by the Marchenko Pastur law is that the eigenvalues live in a finite, well defined interval.

The Marchenko Pastur law is an extremely important result, with applications that continue to be discovered. With this result, random matrix theory took a new dimension.

16.4. Application to Finance

Understanding correlations between stocks is maybe the most important problem in finance. Thus, measuring or estimating correlations between stock prices is crucial. This generally underlies risk management and portfolio allocation strategies.

The question every analyst wants to answer is: Given a set of financial assets (say a set of stocks), characterized by their average returns and risk (i.e., volatility), what is the optimal weight for each asset such that the portfolio provides the best return for a fixed level of risk?

This question can not be answered without a reliable determination of the stock correlation matrix.

For \(N\) stocks, the correlation matrix is of size \(N \times N\) and has \(N(N-1)/2\) free parameters. It has to be determined from \(N\) time series of length \(T\).

For instance, for \(N=400\) stocks of the S&P 500 index this means \(\approx 80,000\) parameters to estimate. Say we have 4 years of data, i.e., \(N\) time series of length \(T=252 \times 4 = 1008\) trading days. In this case, \(T\) is not very large compared to \(N\), and the estimation of the correlation matrix ought to be noisy, i.e, random to a large extent. In this case, the structure of the correlation matrix is actually dominated by noise.

In principle, the eigenvectors of the correlation matrix that correspond to the smallest eigenvalues are the ones that we should use to construct the portfolio that maximizes the return-to-risk ratio. However, these eigenvectors are also the most affected by noise. Therefore, it is important to separate the signal (eigenvectors that have real information) from the noise.

16.4.1. Correlation of S&P 500 components

Let us load the daily returns of the S&P 500 components.

[2]:
import pandas as pd
link = (
    "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies#S&P_500_component_stocks"
)
df = pd.read_html(link, header=0)[0]
[3]:
df.head()
[3]:
Symbol Security GICS Sector GICS Sub-Industry Headquarters Location Date added CIK Founded
0 MMM 3M Industrials Industrial Conglomerates Saint Paul, Minnesota 1957-03-04 66740 1902
1 AOS A. O. Smith Industrials Building Products Milwaukee, Wisconsin 2017-07-26 91142 1916
2 ABT Abbott Laboratories Health Care Health Care Equipment North Chicago, Illinois 1957-03-04 1800 1888
3 ABBV AbbVie Health Care Biotechnology North Chicago, Illinois 2012-12-31 1551152 2013 (1888)
4 ACN Accenture Information Technology IT Consulting & Other Services Dublin, Ireland 2011-07-06 1467373 1989
[4]:
sym_str = []
for sym in df.Symbol:
    # if sym == 'BRK.B': ## some errors with this ticker
    #     continue
    sym_str.append(sym)
[5]:
# np.where(np.array(sym_str) == 'BRK.B')
[6]:
%%time
import yfinance as yf
df = yf.download(sym_str,
                 start="2000-01-01",
                 end="2005-11-30")
[*********************100%***********************]  503 of 503 completed

102 Failed downloads:
['OTIS', 'PLTR', 'DAL', 'CZR', 'LULU', 'UBER', 'PODD', 'KVUE', 'NXPI', 'DG', 'MPC', 'MRNA', 'V', 'FTV', 'VICI', 'VLTO', 'NCLH', 'QRVO', 'DELL', 'FTNT', 'CTVA', 'ABNB', 'ENPH', 'SW', 'PAYC', 'LDOS', 'PYPL', 'GEHC', 'FOXA', 'HCA', 'KEYS', 'IR', 'APTV', 'FOX', 'VRSK', 'KHC', 'GEV', 'EPAM', 'CARR', 'CHTR', 'KMI', 'KDP', 'CPAY', 'FANG', 'TEL', 'BX', 'CEG', 'AMCR', 'ABBV', 'CFG', 'LYV', 'VST', 'HLT', 'IQV', 'ZTS', 'MA', 'PSX', 'LW', 'NWSA', 'HPE', 'GNRC', 'PARA', 'TSLA', 'CMG', 'CBOE', 'AMTM', 'DAY', 'PM', 'XYL', 'MSCI', 'AVGO', 'CTLT', 'HWM', 'KKR', 'GM', 'LYB', 'AWK', 'DFS', 'INVH', 'PANW', 'META', 'FSLR', 'CDW', 'DOW', 'BR', 'TRGP', 'SYF', 'CRWD', 'GDDY', 'TDG', 'UAL', 'HII', 'NWS', 'TMUS', 'ALLE', 'SOLV', 'NOW', 'ANET', 'ULTA', 'SMCI']: YFPricesMissingError('$%ticker%: possibly delisted; no price data found  (1d 2000-01-01 -> 2005-11-30) (Yahoo error = "Data doesn\'t exist for startDate = 946702800, endDate = 1133326800")')
['BF.B']: YFPricesMissingError('$%ticker%: possibly delisted; no price data found  (1d 2000-01-01 -> 2005-11-30)')
['BRK.B']: YFTzMissingError('$%ticker%: possibly delisted; no timezone found')
CPU times: user 10.9 s, sys: 1.43 s, total: 12.3 s
Wall time: 45.3 s
[7]:
df
[7]:
Price Adj Close ... Volume
Ticker A AAPL ABBV ABNB ABT ACGL ACN ADBE ADI ADM ... WTW WY WYNN XEL XOM XYL YUM ZBH ZBRA ZTS
Date
2000-01-03 00:00:00+00:00 43.463020 0.843077 NaN NaN 8.288178 1.215037 NaN 16.274673 28.214649 6.248421 ... NaN 973700 NaN 2738600 13458200 NaN 3033493 NaN 1055700 NaN
2000-01-04 00:00:00+00:00 40.142929 0.771997 NaN NaN 8.051378 1.208433 NaN 14.909400 26.787292 6.183335 ... NaN 1201700 NaN 425200 14510800 NaN 3315031 NaN 522450 NaN
2000-01-05 00:00:00+00:00 37.652863 0.783293 NaN NaN 8.036577 1.320692 NaN 15.204174 27.178352 6.085704 ... NaN 1184600 NaN 500200 17485000 NaN 4642602 NaN 612225 NaN
2000-01-06 00:00:00+00:00 36.219181 0.715509 NaN NaN 8.317783 1.307485 NaN 15.328291 26.435350 6.118246 ... NaN 1307700 NaN 344100 19461600 NaN 3947658 NaN 263925 NaN
2000-01-07 00:00:00+00:00 39.237442 0.749401 NaN NaN 8.406582 1.380124 NaN 16.072981 27.178352 6.215880 ... NaN 1728000 NaN 469500 16603800 NaN 6063647 NaN 333900 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2005-11-22 00:00:00+00:00 21.471949 2.004027 NaN NaN 12.146412 5.756105 19.911514 33.459999 24.157951 15.514473 ... 408757.0 1194300 1844900.0 1165200 17084900 NaN 4531322 2415350.0 460300 NaN
2005-11-23 00:00:00+00:00 21.520235 2.021802 NaN NaN 12.044474 5.811047 19.681894 33.860001 23.992666 15.584385 ... 269573.0 583400 1241600.0 956500 12535100 NaN 4187466 1782003.0 346600 NaN
2005-11-25 00:00:00+00:00 21.544378 2.088985 NaN NaN 12.041378 5.782519 20.019140 33.910000 24.170670 15.584385 ... 71385.0 387000 248400.0 278900 6692600 NaN 1553191 806284.0 78300 NaN
2005-11-28 00:00:00+00:00 21.520235 2.098625 NaN NaN 12.032114 5.689543 19.875635 33.209999 23.642996 15.419132 ... 249716.0 866800 1201400.0 961600 15008400 NaN 3960455 2082145.0 369200 NaN
2005-11-29 00:00:00+00:00 21.423653 2.051627 NaN NaN 11.788074 5.592340 20.263098 32.400002 23.687513 15.406424 ... 484219.0 1447800 1319400.0 1213900 18592400 NaN 3747632 2305346.0 375500 NaN

1486 rows × 3018 columns

[8]:
df = df.Close

[10]:
df = df.dropna(axis=1)
[11]:
df.describe()
[11]:
Ticker A AAPL ABT ACGL ADBE ADI ADM ADP ADSK AEE ... WM WMB WMT WRB WST WY XEL XOM YUM ZBRA
count 1486.000000 1486.000000 1486.000000 1486.000000 1486.000000 1486.000000 1486.000000 1486.000000 1486.000000 1486.000000 ... 1486.000000 1486.000000 1486.000000 1486.000000 1486.000000 1486.000000 1486.000000 1486.000000 1486.000000 1486.000000
mean 23.861300 0.664605 20.344236 3.293209 22.208247 44.042129 14.695453 36.782113 14.772824 43.275805 ... 25.615674 16.373156 17.748793 3.963122 8.042779 56.940855 19.694273 43.131734 10.797933 32.594770
std 14.722275 0.475175 2.607677 1.213181 7.128621 15.386976 4.234437 6.614726 11.025208 5.389743 ... 4.202009 10.380676 1.424192 1.900309 2.778387 7.526926 5.790209 7.630408 4.161873 12.469408
min 7.761087 0.234286 13.298689 1.270833 8.350000 18.290001 7.916667 21.610996 4.984375 27.812500 ... 13.062500 0.717376 14.163333 0.855967 4.137500 36.687500 5.660000 30.270000 4.268512 15.194444
25% 15.901288 0.340223 18.642854 1.851111 16.902500 35.717500 12.032500 32.202368 7.605000 40.459844 ... 23.812500 7.803502 16.697500 2.473251 6.016250 50.782499 16.497500 37.857499 7.712527 22.313195
50% 19.041488 0.420268 20.186284 3.368334 20.655001 40.254999 13.610000 35.152567 9.265313 42.535000 ... 26.710000 14.498332 17.751666 3.462058 6.803750 57.750000 18.195001 41.258905 10.334292 27.413333
75% 25.041129 0.969029 22.353882 4.308056 28.640000 47.857500 16.665000 41.784565 18.675000 46.054999 ... 28.900000 26.710028 18.686458 5.410370 9.635625 63.500000 25.590625 45.097499 13.588246 43.784167
max 113.733910 2.487857 25.901581 6.255556 41.625000 101.375000 25.320000 54.622288 47.599998 56.290001 ... 32.380001 36.435467 22.833332 9.392593 14.915000 74.125000 31.639999 64.980003 19.166067 61.889999

8 rows × 352 columns

[101]:
# Compute daily returns
daily_returns = df.pct_change().dropna()

# Compute cumulative returns
cumulative_returns =  daily_returns.cumsum()

# Plot cumulative returns
plt.figure(figsize=(5, 5))
plt.plot(cumulative_returns)
plt.show()
../../_images/material_part16_notebook_44_0.png
[78]:
daily_returns.values[-1].shape
[78]:
(352,)
[84]:
plt.figure(figsize=(5,5))
plt.hist(daily_returns.values[-1],bins=50,density=True,alpha=0.7,color='black');
plt.xlabel("final returns")
[84]:
Text(0.5, 0, 'final returns')
../../_images/material_part16_notebook_46_1.png
[12]:
from scipy.stats import mode

tickers = sorted(list(set(df.columns.values)))                   # sorted list of unique tickers
tkrlens = [len(df[tkr][(df.index>df.index[0]) & (df.index<df.index[-1])]) for tkr in tickers]  # find lengths of times series for each ticker
tkrmode = mode(tkrlens)[0]

[13]:
# tkrmode,tkrlens
int(tkrmode)

[13]:
1484
[14]:
# idenfity tickers whose lenghts equal the mode and call these good tickers
good_tickers = [tickers[i] for i,tkr in enumerate(tkrlens) if tkrlens[i]==int(tkrmode)]


[15]:
len(good_tickers)
[15]:
352
[34]:
import numpy as np
import pandas as pd

rtndf = pd.DataFrame()  # initialize a return dataframe
std_lst = [] # list of standard deviations of return time series


# Initialize containers for results
std_lst = []
normalized_returns = []

# Loop through tickers
for tkr in good_tickers:
    # Select relevant data for the current ticker
    tmpdf = df[tkr][(df.index > df.index[0]) & (df.index < df.index[-1])]

    # Compute returns
    tmprtndf = ((tmpdf - tmpdf.shift(1)) / tmpdf).dropna()

    # Compute and store standard deviation of returns
    std_lst.append(np.std(tmprtndf))

    # Normalize returns to have mean 0 and std 1
    rsdf = (tmprtndf - tmprtndf.mean()) / tmprtndf.std()

    # Collect normalized returns
    normalized_returns.append(rsdf)

# Combine all normalized returns into a single DataFrame
rtndf = pd.concat(normalized_returns, axis=1)

# Assign column names for clarity
rtndf.columns = good_tickers


[35]:
rtndf = rtndf.dropna(axis=1)
[36]:
rtndf
[36]:
A AAPL ABT ACGL ADBE ADI ADM ADP ADSK AEE ... WM WMB WMT WRB WST WY XEL XOM YUM ZBRA
Date
2000-01-05 00:00:00+00:00 -1.652114 0.339111 -0.094890 4.533024 0.494402 0.391044 -0.863298 -0.469242 -2.154160 2.964119 ... 0.171534 0.939150 -1.043349 -0.959997 -0.222766 2.522511 1.329485 3.275063 0.217536 0.656975
2000-01-06 00:00:00+00:00 -0.976707 -2.213384 1.784607 -0.590885 0.209925 -0.698878 0.260417 0.668884 -2.079548 -0.320116 ... -0.371835 0.354974 0.568648 2.964085 -0.550886 2.241927 -0.318732 3.113565 -0.420512 -2.002760
2000-01-07 00:00:00+00:00 1.987224 1.059422 0.559123 2.789052 1.173289 0.723180 0.806858 1.120238 3.345925 1.160135 ... -2.088760 0.433407 3.599164 -0.924117 0.529835 -1.686104 0.013379 -0.197399 -1.084342 -0.392793
2000-01-10 00:00:00+00:00 1.483355 -0.416753 -0.371708 1.699401 0.942561 2.075901 -0.019386 1.207366 -0.405292 -0.615250 ... 0.179665 -0.267385 -0.928873 -0.586748 1.045775 -0.207362 0.013379 -0.911440 1.788797 1.139295
2000-01-11 00:00:00+00:00 -0.320440 -1.258685 -0.756382 0.447644 -1.854867 -1.075436 -0.858814 -0.971833 -0.877784 -0.169490 ... -0.580316 -0.151340 -0.750856 -0.945843 -1.525405 -0.518909 0.013379 0.178791 -0.337250 0.238076
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2005-11-21 00:00:00+00:00 0.081205 0.145864 -0.689956 0.268425 -0.122601 0.307816 -0.105377 -0.269489 1.524217 0.196095 ... 0.317202 0.226646 0.141722 -0.009257 -0.231146 0.531092 -0.314097 1.188517 0.274958 0.759514
2005-11-22 00:00:00+00:00 0.452270 0.550249 -1.405676 -0.284010 0.103800 0.231241 -0.148685 -0.494847 0.940656 0.318441 ... -0.419457 0.374999 0.607321 0.819670 -1.212837 0.224354 0.224016 0.298565 0.197704 0.075627
2005-11-23 00:00:00+00:00 0.087384 0.207454 -0.444021 0.462745 0.303559 -0.154691 0.216604 -0.199132 -0.604925 0.606260 ... -0.173555 -0.370975 0.391365 0.245149 -0.258263 -0.488602 0.560124 0.212561 0.374536 0.067912
2005-11-25 00:00:00+00:00 0.058808 0.753891 -0.011320 -0.312457 0.043060 0.210892 -0.019386 0.309780 -0.315012 0.736627 ... -0.008719 0.293798 -0.062294 -0.173453 0.406501 -0.058102 0.200910 0.243391 -0.543512 0.007384
2005-11-28 00:00:00+00:00 0.001752 0.109287 -0.038407 -0.927121 -0.525170 -0.550410 -0.583155 -0.748739 0.196496 0.145696 ... -0.340571 -0.552984 -0.481010 -0.526534 -0.806800 0.080393 0.050847 -1.492546 0.027047 -0.592475

1483 rows × 352 columns

[102]:
rtndf.plot(figsize=(5,5),alpha=0.5,legend=False);
plt.grid(alpha=0.3);
plt.show();
../../_images/material_part16_notebook_54_0.png
[20]:
rtndf.columns = good_tickers
cmat = rtndf.corr()                   # compute correlation matrix
evls, evcs = np.linalg.eig(cmat)      # compute eigenvalue/vector decomposition of matrix

[21]:
N = len(evls)
T = len(rtndf)

q = N/T
Q = 1/q
print(N,T,q,Q)
352 1483 0.23735670937289277 4.213068181818182
[33]:
import matplotlib.pyplot as plt
# plt.hist([val for val in evls if val < 1.5],bins=50,density=True);
# Plot Marchenko-Pastur distribution
sig2 = 0.55
lambda_plus = (1 + np.sqrt(q))**2*sig2
lambda_minus = (1 - np.sqrt(q))**2*sig2
x = np.linspace(lambda_minus, lambda_plus, 1000)
mp_density = (1 / (2 * np.pi * q * x * sig2)) * np.sqrt((lambda_plus - x) * (x - lambda_minus))


plt.figure(figsize=(5,5))
plt.plot(x, mp_density, 'r-', lw=2, label='Marchenko-Pastur')

plt.legend()

_ = plt.hist( evls[evls<3] ,bins=100,density=True,color='black',alpha=0.7)
# plt.yscale('log')
plt.xlabel(r"$\lambda$",fontsize=16)
plt.ylabel(r"$\rho(\lambda)$",fontsize=16)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

../../_images/material_part16_notebook_57_0.png

16.4.2. Market simulation

Simulate daily return time series drawn from a heavy tailed distribution. We chose the Student’s t distribution with degree of freedom parameter 5.

Let us simulate \(N=500\) stocks over \(T=2000\) trading days.

[100]:

import matplotlib.pyplot as plt
import numpy as np
deg_freedom = 5  # Student's t distribution degree of freedom parameter
num_srs = 500    # number of time series to generate
num_pts = 2000   # number of points per time series
rpaths = np.random.standard_t(deg_freedom,size=(num_pts,num_srs))  # generate paths
evalplot = np.linalg.eig(pd.DataFrame(rpaths).corr())[0]                  # compute eigenvalues

plt.figure(figsize=(5,5))
_ = plt.plot(rpaths.cumsum(axis=0),alpha=0.2)





../../_images/material_part16_notebook_59_0.png
[75]:
rpaths[-1].shape
[75]:
(500,)
[80]:
plt.figure(figsize=(5,5))
plt.hist(rpaths[-1]/100,bins=50,density=True,alpha=0.7,color='black');
plt.xlabel("final returns")

[80]:
Text(0.5, 0, 'final returns')
../../_images/material_part16_notebook_61_1.png
[90]:
plt.figure(figsize=(5,5))
plt.hist(evalplot,bins=75,density=True,color='black',alpha=0.7);
# Marchenko-Pastur parameters
Q = num_pts / num_srs  # Aspect ratio
print(Q)
lambda_plus = (1 + np.sqrt(1 / Q))**2
lambda_minus = (1 - np.sqrt(1 / Q))**2
x = np.linspace(lambda_minus, lambda_plus, 1000)

def marchenko_pastur_pdf(x, q):
    """Marchenko-Pastur probability density function."""
    return (Q / (2 * np.pi)) * np.sqrt((lambda_plus - x) * (x - lambda_minus)) / x
# Compute Marchenko-Pastur density
mp_pdf = marchenko_pastur_pdf(x, q)
plt.plot(x, mp_pdf, color='red', lw=2, label='Marchenko-Pastur')

plt.xlabel(r"$\lambda$",fontsize=16)
plt.ylabel(r"$\rho(\lambda)$",fontsize=16)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()
4.0
../../_images/material_part16_notebook_62_1.png

Simulation with market mode

[93]:
import matplotlib.pyplot as plt
import numpy as np
deg_freedom = 5  # Student's t distribution degree of freedom parameter
num_srs = 500    # number of time series to generate
num_pts = 2000   # number of points per time series
# Generate random paths
rpaths = np.random.standard_t(deg_freedom, size=(num_pts, num_srs))

market_influence = 0.1  # Strength of the market influence

# Generate a market factor (common mode)
market_factor = np.random.normal(0, 1, size=(num_pts, 1))

# Add market influence to each path
rpaths_with_market = rpaths + market_influence * market_factor

# Compute the correlation matrix and its eigenvalues
correlation_matrix = pd.DataFrame(rpaths_with_market).corr()
evalplot = np.linalg.eig(correlation_matrix)[0]  # Eigenvalues

[98]:
plt.figure(figsize=(5,5))
_ =plt.plot(rpaths_with_market.cumsum(axis=0),alpha=0.2)
../../_images/material_part16_notebook_65_0.png
[95]:
plt.figure(figsize=(5,5))
plt.hist(evalplot,bins=75,density=True,color='black',alpha=0.7);
# Marchenko-Pastur parameters
Q = num_pts / num_srs  # Aspect ratio
print(Q)
lambda_plus = (1 + np.sqrt(1 / Q))**2
lambda_minus = (1 - np.sqrt(1 / Q))**2
x = np.linspace(lambda_minus, lambda_plus, 1000)

def marchenko_pastur_pdf(x, q):
    """Marchenko-Pastur probability density function."""
    return (Q / (2 * np.pi)) * np.sqrt((lambda_plus - x) * (x - lambda_minus)) / x
# Compute Marchenko-Pastur density
mp_pdf = marchenko_pastur_pdf(x, q)
plt.plot(x, mp_pdf, color='red', lw=2, label='Marchenko-Pastur')

plt.xlabel(r"$\lambda$",fontsize=16)
plt.ylabel(r"$\rho(\lambda)$",fontsize=16)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()
4.0
../../_images/material_part16_notebook_66_1.png

The SP500 correlation looks very much like a Marchenko Pastur distribution, plus some information (or signal). To extract the non-noisy part of the stock correlation, we could fit the Marchenko Pastur distribution to the empirical eigenvalues and remove the eigenvalues that are compatible with the Marchenko Pastur distribution.

With this idea, we can denoise the correlation matrix, and hope to build robust portfolios. To learn more about this you can read A First Course in Random Matrix Theory, Potters & Bouchaud (2020).

16.5. Application to Medical Imaging

In vivo imaging of the brain tissue has been revolutionized by the development of diffusion Magnetic Resonance Imaging (dMRI) and tractography. Thermal noise corrupts dMRI measurements and propagates to the quality of the reconstructed images, hampering it.

In 2016, Veraart et al. proposed a denoising method based on the Marchenko Pastur distribution, which fast and optimal in the sense that it provides an optimal compromise between noise suppression and loss of anatomical information.

The method is called the Marchenko Pastur Principal Component Analysis denoising algorithm and is implemented in the DIPY package. See here.

The idea is to construct a data matrix \(\tilde{\mathbf{H}}\) based on the original matrix after nullifying all eigenvalues that are smaller than the \(\lambda_+\) predicted by the Marchenko Pastur distribution (i.e., the pure-noise eigenvalues). This is a prametric denoising method, where both the noise level \(\sigma^2\) and the number of principal components \(P\) are estimated from the data.

The reduction in noise is given by

\[\sigma_{\text{denoised}}^2 = \frac{P}{N}\sigma^2\]

where \(P\) is the number of principal components that are retained and \(N\) the dimension of the data vector. (The number eigenvalues described by the Marchenko Pastur distribution is \(N-P\), they correspond to those that are nulled.)

See Denoising of diffusion MRI using random matrix theory, Veraart et al. (2016) for details.

16.6. Application to Learning Algorithms

Refer to Random Matrix Methods for Machine Learning, Couillet & Liao (2023) and the online ICML tutorial: Random Matrix Theory and Machine Learning.