Seminar Series
Mondays 2-3pm (GMT) (unless otherwise specified)
Participants can join our mailing list to receive notifcations about the seminar series.
Organizers plan events in a hybrid format allowing participants on-site at the Oxford Statistics Department to connect with participants off-site through Zoom.
The seminar series covers a wide range of topics at the intersection of machine learning and finance. Topics include network analysis, limit order books and analysis of order flows, time series forecasting, synthetic data generation, asset pricing, microstructure, news sentiment, portfolio management, high-dimensional statistics and model selection.
Please explore projects from our research group.
Talks (2023-2024)
Abstract
TBDAbstract
The recent high-profile failures of a number of crypto firms have reignited the debate on the appropriate policy response to address the risks in crypto. The “shadow financial” functions enabled by crypto markets share many of the vulnerabilities of traditional finance and risks are often exacerbated by specific features of crypto. Authorities may consider different, and not mutually exclusive, lines of action to tackle the risks in crypto. These include (i) bans, which could tackle specific aspects of the crypto ecosystem, (ii) containment so that the real economy is insulated from crypto risks, and (iii) the regulation of the crypto sector. The paper highlights the pros and cons of the different approaches and proposes a framework to choose when bans, containment and regulation are most appropriate. In any case, central banks and public authorities could also work to make traditional financial more attractive, thereby allowing responsible innovation to thrive.Abstract
TBDAbstract
TBDAbstract
TBDAbstract
TBDAbstract
TBDPast Talks:
Abstract
The extant literature predicts market returns with “simple” models that use only a few parameters. Contrary to conventional wisdom, we theoretically prove that simple models severely understate return predictability compared to “complex” models in which the number of parameters exceeds the number of observations. We empirically document the virtue of complexity in US equity market return prediction. Our findings establish the rationale for modeling expected returns through machine learning.Abstract
We propose that investment strategies should be evaluated based on their net-of-trading-cost return for each level of risk, which we term the "implementable efficient frontier." While numerous studies use machine learning return forecasts to generate portfolios, their agnosticism toward trading costs leads to excessive reliance on fleeting small-scale characteristics, resulting in poor net returns. We develop a framework that produces a superior frontier by integrating trading-cost-aware portfolio optimization with machine learning. The superior net-of-cost performance is achieved by learning directly about portfolio weights using an economic objective. Further, our model gives rise to a new measure of "economic feature importance".Abstract
This paper develops a novel method to estimate a latent factor model for a large target panel with missing observations by optimally using the information from auxiliary panel data sets. We refer to our estimator as target-PCA. Transfer learning from auxiliary panel data allows us to deal with a large fraction of missing observations and weak signals in the target panel. We show that our estimator is more efficient and can consistently estimate weak factors, which are not identifiable with conventional methods. We provide the asymptotic inferential theory for target-PCA under very general assumptions on the approximate factor model and missing patterns. In an empirical study of imputing data in a mixed-frequency macroeconomic panel, we demonstrate that target-PCA significantly outperforms all benchmark methods.Abstract
In this presentation we summarize different modelling and and predictive strategies for cryptocurrency. We begin with univariate models that features time-varying moments up to the fourth. Then we extend the forecasting to multivariate model using time-varying Vector Autoregressive models. Finally, we introduce a new study that combines multivariate models and time-varying higher moments in portfolio allocation.Abstract
We study a multi-factor block model for variable clustering and connect it to the regularized subspace clustering by formulating a distributionally robust version of the nodewise regression. To solve the latter problem, we derive a convex relaxation, provide guidance on selecting the size of the robust region, and hence the regularization weighting parameter, based on the data, and propose an ADMM algorithm for implementation. We validate our method in an extensive simulation study. Finally, we propose and apply a variant of our method to stock return data, obtain interpretable clusters that facilitate portfolio selection and compare its out-of-sample performance with other clustering methods in an empirical study. This talk is based on joint work with Xunyu Zhou and Xiao Xu.Abstract
Supply chain business interruption has been identified as a key risk factor in recent years, with high-impact disruptions due to disease outbreaks, logistic issues such as the recent Suez Canal blockage showing examples of how disruptions could propagate across complex emergent networks. Researchers have highlighted the importance of gaining visibility into procurement interdependencies between suppliers to develop more informed business contingency plans. However, extant methods such as supplier surveys rely on the willingness or ability of suppliers to share data and are not easily verifiable. In this article, we pose the supply chain visibility problem as a link prediction problem from the field of Machine Learning (ML) and propose the use of an automated method to detect potential links that are unknown to the buyer with Graph Neural Networks (GNN). Using a real automotive network as a test case, we show that our method performs better than existing algorithms. Additionally, we use Integrated Gradient to improve the explainability of our approach by highlighting input features that influence GNN’s decisions. We also discuss the advantages and limitations of using GNN for link prediction, outlining future research directions.Abstract
The overnight material news events and sources of stock illiquidity can be potentially important sources of jumps in stock returns. We find that for the average firm in the cross-section, stock illiquidity is more likely to drive a stock return jump than either day or overnight news flow frequency and content, however, for larger firms there is a higher likelihood that the stock return jump is driven by overnight news flow frequency. Yet our results find a larger idiosyncratic jump size for a higher number of day news articles than stock illiquidity for the average and large firms. Our results show how day and overnight news flow, stock illiquidity, and order flow are reflected in stock return jumps and idiosyncratic jump risk.Abstract
The presence of time series momentum has been widely documented in financial markets across asset classes and countries. In this study, we find a predictable pattern of the realized semivariance estimators for the returns of commodity futures, particularly during the reversals of time series momentum. Based on this finding, we propose a rule-based time series momentum strategy that has a statistically significant higher Sharpe ratio compared to the benchmark of the original time series momentum strategy in the out-of-sample data. The results are robust to different subsamples, lookback windows, volatility scaling, execution lag, and transaction cost.Abstract
We develop spectral volume models to systematically estimate, explain, and exploit the high-frequency periodicity in intraday trading activities using Fourier analysis. The framework consistently recovers periodicities at specific frequencies in three steps, despite their low signal-to-noise ratios. This reveals important and universal high-frequency periodicities across 2,573 stocks in the United States (US) and Chinese markets over a full year. The dominant frequencies are at 10-seconds, 15-seconds, 20-seconds, 30-seconds, 1-minute, and 5-minutes for the US market and 1-minute, 2.5-minutes, 5-minutes, and 10-minutes for the Chinese market. They each explain from 1.5 to 10 percent of the variance of de-trended intraday volumes on average. Through three different perspectives, we provide statistically significant evidence that this phenomenon is driven by trading algorithms that rely on periodic information arrivals, rather than trading cost considerations. Finally, we demonstrate the practical value of uncovering these high-frequency periodicities in improving intraday volume predictions, which leads to potential economic gains in intraday execution strategies.Abstract
Understanding stock market instability is a key question in financial management as practitioners seek to forecast breakdowns in asset comovements which expose portfolios to rapid and devastating collapses in value. The structure of these comovements can be described as a graph where companies are represented by nodes and edges capture correlations between their price movements. Learning a timely indicator of comovement breakdowns (manifested as modifications in the graph structure) is central in understanding both financial stability and volatility forecasting. We propose to use the edge reconstruction accuracy of a graph autoencoder (GAE) as an indicator for how spatially homogeneous connections between assets are, which, based on financial network literature, we use as a proxy to infer market volatility. Our experiments on the S&P 500 over the 20152022 period show that higher GAE reconstruction error values are correlated with higher volatility. We also show that outofsample autoregressive modeling of volatility is improved by the addition of the proposed measure. Our paper contributes to the literature of machine learning in finance particularly in the context of understanding stock market instability.Abstract
Neural networks that are able to reliably execute algorithmic computation may hold transformative potential to both machine learning and theoretical computer science. On one hand, they could enable the kind of extrapolative generalisation scarcely seen with deep learning models. On another, they may allow for running classical algorithms on inputs previously considered inaccessible to them. Over the past few years, the pace of development in this area has gradually become intense. As someone who has been very active in its latest incarnation, I have witnessed these concepts grow from isolated toy experiments, through NeurIPS spotlights, all the way to helping detect patterns in complicated mathematical objects (published on the cover of Nature) and supporting the development of generalist reasoning agents. In this talk, I will give my personal account of this journey, and especially how our own interpretation of this methodology, and understanding of its potential, changed with time. It should be of interest to a general audience interested in graphs, (classical) algorithms, reasoning, and building intelligent systems.Abstract
We consider the estimation of causal effects in panel data settings. During a given time period, one observes units of interest and stores the realized outcomes into a matrix. At a fixed point in time, a subset of the units is exposed to an irreversible treatment, i.e., the data matrix of treated units has a block structure. The objective is to design an estimator for the counterfactual outcomes of the block of treated units. For large sample sizes and under typical statistical settings, we show that the use of matrix completion (MC) estimators for counterfactual recovery yields phase transition (PT) phenomena, where it is possible to distinguish regions of the parameter space where a perfect estimation of the counterfactual is possible from those where it is not. We determine the separating line (the so-called phase transition (PT) curve) between the regions, and show that it admits a closed form expression that directly relates time series and cross-sectional heterogeneity among units to the number of untreated units and the initial time of the treatment. Our methodology is designed to handle settings where the starting time of the treatment and the number of control (untreated) units are not necessarily identical, i.e., where the block of counterfactuals in the matrix of control outcomes is a rectangular matrix. We support our theoretical analysis with numerical simulations, which further indicate that an exact counterfactual recovery is attainable even for fairly small sample sizes. (joint work with Mijailo Stojnic).Abstract
This paper proposes a method for model determination in ultra-high dimensional cointegrated systems where the cross-section dimension m can even largely exceed the sample size T. For such ultra-high dimensional cases, we require an adequate non-standard pre-screening step which we develop for the nonstationary cointegration vector but also for the stationary loading matrix. We prove that identified sets for the non-zero loadings and the cointegration space contain the respective true sets with high probability. A feasible algorithm is provided, making the technique easily accessible for practitioners. In a second step, we employ reduced rank regression based on the pre-selected set of variables, and show the cointegration rank selection consistency of the overall procedure. In order to achieve consistent rank selection, we propose a tailored information criterion which is also of general interest for factor models when both strong and weak factors are present. Results of the simulation study demonstrate competitive performance of the proposed methodology. In an empirical study with 1045 NASDAQ stocks, the proposed methodology allows for large-scale multivariate predictive regression for the entire system.Abstract
Any lead-lag effect in an asset pair implies the future returns on the lagging asset have the potential to be predicted from past and present prices of the leader, thus creating statistical arbitrage opportunities. We utilize robust lead-lag indicators to uncover the origin of price discovery and we propose an econometric model exploiting that effect with level 1 data of limit order books (LOB). We also develop a high-frequency trading strategy based on the model predictions to capture arbitrage opportunities. The framework is then evaluated on six months of DAX 30 cross-listed stocks' LOB data obtained from three European exchanges in 2013 -- Xetra, Chi-X, and BATS. We show that a high-frequency trader can profit from lead-lag relationships because of predictability, even when trading costs, latency, and execution-related risks are considered. Keywords - Lead-lag relationship, High-frequency trading, Statistical arbitrage, Limit order book, Cross-listed stocks, Econometric models.Abstract
In Cong, Feng, He, and He (2022), we develop a new class of tree-based models (P-Tree) for analyzing (unbalanced) panel data utilizing global (instead of local) split criteria that incorporate economic guidance to guard against overfitting while preserving interpretability. We grow a P-Tree top-down to split the cross section of asset returns to construct stochastic discount factors and test assets, generalizing sequential security sorting and visualizing (asymmetric) nonlinear interactions among firm characteristics and macroeconomic states. Data-driven P-Tree models reveal that idiosyncratic volatility and earnings-to-price ratio interact to drive cross-sectional return variations in U.S. equities; market volatility and inflation constitute the most critical regime-switching that asymmetrically interacts with characteristics. P-Trees outperform most known observable and latent factor models in pricing individual stocks and test portfolios, while delivering transparent trading strategies and risk-adjusted investment outcomes (e.g., out-of-sample annualized Sharp ratios of about 3 and monthly alpha around 0.8%). Time-permitting, I will briefly discuss Cong, Feng, He, and Li (2022) --- a further development of the panel tree framework for jointly clustering asset returns and modeling heterogeneous factor pricing under a Bayesian framework.Abstract
Recent studies suggest that networks among firms (sectors) play an essential role in asset pricing. However, it is challenging to capture and investigate the implications of networks due to the continuous evolution of networks in response to market micro and macro changes. This paper combines two state-of-the-art machine learning techniques to develop an end-to-end graph neural network model and shows its applicability in asset pricing. First, we apply the graph attention mechanism to learn dynamic network structures of the equity market over time and then use a recurrent convolutional neural network to diffuse and propagate firms fundamental information into the learned networks. Our model is efficient in both return prediction and portfolio performance. The result persists in different sensitivity tests and simulated data. We also show that the dynamic network learned from our model is able to capture major market events over time.Abstract
In the last decade, the arrival of new forms of social media has drastically increased the amount of personal data generated online. The massive amount of data available has shown a lot of opportunities for industries and research. In particular, increasing numbers of quantitative investors start to rely on alternative data to adapt their position in the market. However, it is still unclear whether aggregated online data could generate excess returns in active investing and allow refining positions on the stock market. In the present talk, we propose to tackle the question by focusing on three underlying themes. First, we will introduce one of the first viable approaches to the estimation of individual-level ideological positions derived from social media content. Second, we will show how a consensus model can be used to predict opinion evolution in online collective behaviour and how the "wisdom of the crowd" relates to group influence. Finally, we will explore whether aggregated opinion signals have potential to predict financial fundamentals and build an edge on the market.Abstract
Estimating high dimensional covariance matrices for portfolio optimization is challenging because the number of parameters to be estimated grows quadratically in the number of assets. When the matrix dimension exceeds the sample size, the sample covariance matrix becomes singular. A possible solution is to impose a (latent) factor structure for the cross-section of asset returns as in the popular capital asset pricing model. Recent research suggests dimension reduction techniques to estimate the factors in a data-driven fashion. We present an asymmetric autoencoder neural network-based estimator that incorporates the factor structure in its architecture and jointly estimates the factors and their loadings. We test our method against well established dimension reduction techniques from the literature and compare them to observable factors as benchmark in an empirical experiment using stock returns of the past five decades. Results show that the proposed estimator is very competitive, as it significantly outperforms the benchmark across most scenarios. Analyzing the loadings, we find that the constructed factors are related to the stocks’ sector classification.Abstract
We propose a Degree-Corrected Block Model with Dependent Multivariate Poisson edges (DCBM-DMP) to study stock co-jump dependency. To estimate the community structure, we extend the SCORE algorithm in Jin (2015) and develop a Spectral Clustering On Ratios-of-Eigenvectors for networks with Dependent Multivariate Poisson edges (SCORE-DMP) algorithm. We prove that SCORE-DMP enjoys strong consistency in community detection. Empirically, using high-frequency data of S&P 500 constituents, we construct two co-jump networks according to whether the market jumps and find that they exhibit different community features than GICS. We further show that the co-jump networks help in stock return prediction.Abstract
We learn from data that volatility is mostly path-dependent. Up to 90% of the variance of the implied volatility of equity indexes is explained endogenously by past index returns, and up to 65% for (noisy estimates of) future daily realized volatility. The path-dependency that we uncover is remarkably simple because a linear combination of a weighted sum of past daily returns and the square root of a weighted sum of past daily squared returns with different time-shifted power-law weights capturing both short and long memory. This simple model, which is homogeneous in volatility, is shown to consistently outperform existing models across equity indexes and train/test sets for both implied and realized volatility. It suggests a simple continuous-time path-dependent volatility (PDV) model that may be fed historical or risk-neutral parameters. The weights can be approximated by superpositions of exponential kernels to produce Markovian models. In particular, we propose a 4-factor Markovian PDV model which captures all the important stylized facts of volatility, produces very realistic price and volatility paths, and jointly fits SPX and VIX smiles remarkably well. We thus show, for the first time, that a continuous-time Markovian parametric stochastic volatility (actually, PDV) model can practically solve the joint S&P 500/VIX smile calibration problem.Abstract
Industry classification schemes provide a taxonomy for segmenting companies based on their business activities. They are relied upon in industry and academia as an integral component of many types of financial and economic analysis. However, even modern classification schemes have failed to embrace the era of big data and remain a largely subjective undertaking prone to inconsistency and misclassification. To address this, we propose a multimodal neural model for training company embeddings, which harnesses the dynamics of both historical pricing data and financial news to learn objective company representations that capture nuanced relationships. We explain our approach in detail and highlight the utility of the embeddings through several case studies and application to the downstream task of industry classification.Abstract
Financial economics and econometrics literature demonstrate that the limit order book data is useful in predicting short-term volatility in stock markets. In this paper, we are interested in forecasting short- term realized volatility in a multivariate approach based on limit order book data and relational stock market networks. To achieve this goal, we introduce Graph Transformer Network for Volatility Forecasting. The model allows combining limit order book features and a large number of temporal and cross-sectional relations from different sources. Through experiments based on about 500 stocks from S&P 500 index, we find a better performance for our model than for other benchmarks.Abstract
In light of micro-scale inefficiencies induced by the high degree of fragmentation of the Bitcoin trading landscape, we utilize a granular data set comprised of orderbook and trades data from the most liquid Bitcoin markets, in order to understand the price formation process at sub-1 second time scales. To achieve this goal, we construct a set of features that encapsulate relevant microstructural information over short lookback windows. These features are subsequently leveraged first to generate a leader-lagger network that quantifies how markets impact one another, and then to train linear models capable of explaining between 10% and 37% of total variation in 500ms future returns (depending on which market is the prediction target). The results are then compared with those of various PnL calculations that take trading realities, such as transaction costs, into account. The PnL calculations are based on natural taker strategies (meaning they employ market orders) that we associate to each model. Our findings emphasize the role of a market’s fee regime in determining its propensity to being a leader or a lagger, as well as the profitability of our taker strategy. Taking our analysis further, we also derive a natural maker strategy (i.e., one that uses only passive limit orders), which, due to the difficulties associated with backtesting maker strategies, we test in a real-world live trading experiment, in which we turned over 1.5 million USD in notional volume. Lending additional confidence to our models, and by extension to the features they are based on, the results indicate a significant improvement over a naive benchmark strategy, which we also deploy in a live trading environment with real capital, for the sake of comparison.Abstract
Many high-dimensional problems involve reconstruction of a low-rank matrix from incomplete and corrupted observations. Despite substantial progress in designing efficient estimation algorithms, it remains largely unclear how to assess the uncertainty of the obtained low-rank estimates, and how to construct valid yet short confidence intervals for the unknown low-rank matrix. In this talk, I will discuss how to perform inference and uncertainty quantification for two examples of low-rank models, (1) heteroskedastic PCA with missing data, and (2) noisy matrix completion. For both problems, we identify statistically efficient estimators that admit non-asymptotic distributional characterizations, which in turn enable optimal construction of confidence intervals for, say, the unseen entries of the low-rank matrix of interest. All this is accomplished by a powerful leave-one-out analysis framework that originated from probability and random matrix theory. This is based on joint work with Yuling Yan, Cong Ma, and Jianqing Fan.Biography: Yuxin Chen is currently an associate professor in the Department of Statistics and Data Science at the University of Pennsylvania. Before joining UPenn, he was an assistant professor of electrical and computer engineering at Princeton University. He completed his Ph.D. in Electrical Engineering at Stanford University, and was also a postdoc scholar at Stanford Statistics. His current research interests include high-dimensional statistics, nonconvex optimization, and reinforcement learning. He has received the Alfred P. Sloan Research Fellowship, the ICCM best paper award (gold medal), the AFOSR and ARO Young Investigator Awards, the Google Research Scholar Award, and was selected as a finalist for the Best Paper Prize for Young Researchers in Continuous Optimization. He has also received the Princeton Graduate Mentoring Award.
Abstract
(Volatility forecasting) We apply machine learning models to forecast intraday realized volatility (RV), by exploiting commonality in intraday volatility via pooling stock data together, and by incorporating a proxy for the market volatility. Neural networks dominate linear regressions and tree models in terms of performance, due to their ability to uncover and model complex latent interactions among variables. Our findings remain robust when we apply trained models to new stocks that have not been included in the training set, thus providing new empirical evidence for a universal volatility mechanism among stocks. Finally, we propose a new approach to forecasting one-day-ahead RVs using past intraday RVs as predictors, and highlight interesting diurnal effects that aid the forecasting mechanism. The results demonstrate that the proposed methodology yields superior out-of-sample forecasts over a strong set of traditional baselines that only rely on past daily RVs.Abstract
Spectral methods are simple but powerful approaches for extracting information from noisy data and have been widely used in various applications. In this talk, we demystify the success of spectral methods by establishing sharp theoretical guarantees for their performance in clustering and synchronization. (1) The first part of the talk is about a novel singular subspace perturbation analysis for spectral clustering. We consider two arbitrary matrices where one is a leave-one-column-out submatrix of the other one and establish a new perturbation upper bound for the distance between their corresponding singular subspaces. Powered by this tool, we obtain an explicit exponential error rate for the performance of spectral clustering in sub-Gaussian mixture models. (2) The second part of the talk is about the exact minimax optimality of a spectral method in the phase synchronization problem with additive Gaussian noises and incomplete data. We prove that it achieves the minimax lower bound of the problem with a matching leading constant under a squared l2 loss. This shows that the spectral method has the same performance as more sophisticated procedures including maximum likelihood estimation, generalized power method, and semidefinite programming, when consistent parameter estimation is possible.Biography: Anderson Ye Zhang is an assistant professor in the Department of Statistics and Data Science at the University of Pennsylvania. Before joining Penn, he was a William H. Kruskal Instructor in Department of Statistics at the University of Chicago. He completed his Ph.D. in Statistics and Data Science at Yale University. His research includes spectral analysis, network analysis, clustering, ranking, and synchronization.
Abstract
We propose a general modeling and algorithmic framework for discrete structure recovery that can be applied to a wide range of problems. Under this framework, we are able to study the recovery of clustering labels, ranks of players, signs of regression coefficients, cyclic shifts, and even group elements from a unified perspective. A simple iterative algorithm is proposed for discrete structure recovery, which generalizes methods including Lloyd's algorithm and the power method. A linear convergence result for the proposed algorithm is established in this paper under appropriate abstract conditions on stochastic errors and initialization. We illustrate our general theory by applying it on several representative problems, (1) clustering in Gaussian mixture model, (2) approximate ranking, (3) sign recovery in compressed sensing, (4) multireference alignment, and (5) group synchronization, and show that minimax rate is achieved in each case.Biography: Chao Gao is an Assistant Professor in Statistics at University of Chicago
Abstract
Globally, capital markets have gone through a paradigm shift towards complete automation through artificial intelligence, turning it into a highly competitive area at the intersection of statistical models from various branches of machine learning. A principled understanding of the interactions between statistical models that operate in a common environment will soon be a key success factor for leaders in the field. In this talk I will first discuss the unique challenges of capital markets through the lens of machine learning and then provide an overview how Borealis AI addresses them from an atomistic and a holistic point of view. In the second part of the talk I will focus on our recent work on continuous-time modeling of irregular time-series and describe an expressive differential deformation of the Wiener process using neural ordinary differential equations. Finally, we will see how an augmentation of this model with a latent process driven by a stochastic differential equation can further increase the flexibility of this system and allows us to capture non-Markovian dynamics.Biography: Andreas Lehrmann is a machine learning researcher at Borealis AI. Previously, he held postdoctoral positions at Facebook Reality Labs and Disney Research. He received his Ph.D. at ETH Zurich and the Max-Planck-Institute for Intelligent Systems under a Microsoft Research scholarship.
Abstract
How many samples are needed to accurately learn the covariance matrix, C, of a distribution over d-dimensional vectors? In modern data applications where d is large, the answer is often unacceptably high: the sample complexity of covariance learning inherently depends poorly on dimension. In this talk I will discuss efforts to address this issue by designing data collection methods and learning algorithms which reduce complexity by leveraging a priori knowledge about the covariance matrix. Specifically, I will discuss the setting when C is known to have Toeplitz structure. Toeplitz covariance matrices arise in many applications, from time series analysis, to wireless communications, to medical imaging. In many of these applications, data collection is expensive, so reducing sample complexity is an important goal. We will start by taking a fresh look at classical and widely used algorithms, including methods based on selecting samples according to a sparse ruler. Then, I will introduce a novel sampling and estimation strategy that improves on existing methods in many settings. Our new approach for learning Toeplitz structured covariance utilizes tools from random matrix sketching, non-linear approximation theory, and sparse Fourier transform algorithms. It fits into a broader line of work which seeks to address problems in active learning using tools from theoretical computer science and randomized numerical linear algebra.Biography: Chris Musco is an Assistant Professor at New York University in the Tandon School of Engineering
Abstract
Estimated covariance matrices are widely used to construct portfolios with variance-minimizing optimization, yet the embedded sampling error produces portfolios with systematically underestimated variance. This effect is especially severe when the number of securities greatly exceeds the number of observations. In this high dimension low sample size (HL) regime, we show that a dispersion bias in the leading eigenvector of the estimated covariance matrix is a material source of distortion in the minimum variance portfolio. We correct the bias with the data-driven GPS (Global Positioning System) shrinkage estimator, which improves with the size of the market, and which is structurally identical to the James Stein estimator for a collection of averages. We illustrate the power of the GPS estimator with a numerical example, and conclude with open problems that have emerged from our research.Biography: Lisa Goldberg is Professor of the Practice of Economics at University of California Berkeley. She is the co-director of the Berkeley Consortium for Data Analytics in Risk. She is Head of Research at Aperio Group, now part of BlackRock.
Abstract
What will happen to Y if we do A? A variety of meaningful social and engineering questions can be formulated this way: What will happen to a patient’s health if they are given a new therapy? What will happen to a country’s economy if policy-makers legislate a new tax? What will happen to a data center’s latency if a new congestion control protocol is used? We explore how to answer such counterfactual questions using observational data---which is increasingly available due to digitization and pervasive sensors---and/or very limited experimental data. The two key challenges are: (i) counterfactual prediction in the presence of latent confounders; (ii) estimation with modern datasets which are high-dimensional, noisy, and sparse. The key framework we introduce is connecting causal inference with tensor completion. In particular, we represent the various potential outcomes (i.e., counterfactuals) of interest through an order-3 tensor. The key theoretical results presented are: (i) Formal identification results establishing under what missingness patterns, latent confounding, and structure on the tensor is recovery of unobserved potential outcomes possible. (ii) Introducing novel estimators to recover these unobserved potential outcomes and proving they are finite-sample consistent and asymptotically normal. The efficacy of our framework is shown on high-impact applications. These include working with: (i) TaurRx Therapeutics to identify patient sub-populations where their therapy was effective. (ii) Uber Technologies on evaluating the impact of driver engagement policies without running an A/B test. (iii) The Poverty Action Lab at MIT to make personalized policy recommendations to improve childhood immunization rates across villages in Haryana, India. Finally, we discuss connections between causal inference, tensor completion, and offline reinforcement learning.Biography: Anish is currently a postdoctoral fellow at the Simons Institute at UC Berkeley. He did his PhD at MIT in EECS where he was advised by Alberto Abadie, Munther Dahleh, and Devavrat Shah. His research focuses on designing and analyzing methods for causal machine learning, and applying it to critical problems in social and engineering systems. He currently serves as a technical consultant to TauRx Therapeutics and Uber Technologies on questions related to experiment design and causal inference. Prior to the PhD, he was a management consultant at Boston Consulting Group. He received his BSc and MSc at Caltech.
Recording
About The Seminar
Organizers: Mihai Cucuringu, Christopher Policastro, Chao Zhang
Acknowledgements: Website template from the Stanford MLSys Seminar Series