Research

I have diverse interests in statistics and machine learning. My dissertation and current research focus on problems of classification, variable selection, high dimensional inference and dimension reduction. In addition to my theoretical interests, I am keenly interested in statistical applications in medical settings such as electronic medical records (EMRs) and personalized medicine. Below is a list of selected papers discussing the aforementioned topics. For brief summaries click on the details button.


Selected Papers and Preprints

Property testing in high dimensional Ising models with H. Liu [arXiv] Details This paper explores the information-theoretic limitations of graph property testing in zero-field Ising models. Instead of learning the entire graph structure, sometimes testing a basic graph property such as connectivity, cycle presence or maximum clique size is a more relevant and attainable objective. Understanding the statistical complexity of property testing requires the distinction of ferromagnetic (i.e., positive interactions only) and general Ising models. Using combinatorial constructs such as graph packing and strong monotonicity, we characterize how target properties affect the corresponding minimax upper and lower bounds within the realm of ferromagnets. On the other hand, by studying the detection of an antiferromagnetic (i.e., negative interactions only) Curie-Weiss model buried in Rademacher noise, we show that property testing is strictly more challenging over general Ising models. In terms of methodological development, we propose two types of correlation based tests: computationally efficient screening for ferromagnets, and score type tests for general models, including a fast cycle presence test.

Combinatorial inference for graphical models with J. Lu and H. Liu [arXiv] Details We propose a new family of combinatorial inference problems for graphical models. Unlike classical statistical inference where the main interest is point estimation or parameter testing, combinatorial inference aims at testing the global structure of the underlying graph. Examples include testing the graph connectivity, the presence of a cycle of certain size, or the maximum degree of the graph. To begin with, we develop a unified theory for the fundamental limits of a large family of combinatorial inference problems. We propose new concepts including structural packing and buffer entropies to characterize how the complexity of combinatorial graph structures impacts the corresponding minimax lower bounds. On the other hand, we propose a family of novel and practical structural testing algorithms to match the lower bounds. We provide thorough numerical results on both synthetic graphical models and brain networks to illustrate the usefulness of these proposed methods.

Connectivity and Clique

Adaptive Inferential Method for Monotone Graph Invariants with J. Lu and H. Liu [arXiv] Details We consider the problem of undirected graphical model inference. In many applications, instead of perfectly recovering the unknown graph structure, a more realistic goal is to infer some graph invariants (e.g., the maximum degree, the number of connected subgraphs, the number of isolated nodes). In this paper, we propose a new inferential framework for testing nested multiple hypotheses and constructing confidence intervals of the unknown graph invariants under undirected graphical models.

On the characterization of a class of Fisher-consistent loss functions and its application to boosting with J. S. Liu and T. Cai [pdf] [JMLR] Details Motivated by classification problems that arise in EMR settings, we propose a generic and robust boosting procedure which is able to work with non-convex loss functions. One key contribution is the relaxation of convexity required by Zou et al, as non-convex loss functions are less susceptible to outliers. We prove that in order to achieve Fisher consistency it suffices for a loss function \phi to satisfy
\phi(x) - \phi(x') \geq (g(x) - g(x'))k(x') for all x \in \mathbb{R}, x' \in \{z \in \mathbb{R} : k(z) \leq 0\},
where g,k are increasing and continuous functions satisfying further properties. The above condition relaxes convexity, which appears as a special case with g being the identity.

Support recovery for single index models in high-dimensions with Q. Lin and J. S. Liu [arXiv][AMSA] Details In this paper we study the support recovery problem for single index models Y=f(\boldsymbol{X}^{\intercal}\boldsymbol{\beta},\varepsilon), where f is an unknown link function, \boldsymbol{X}\sim N_p(0,\mathbb{I}_{p}) and \boldsymbol{\beta} is an s-sparse unit vector such that \beta_{i}\in \{\pm\frac{1}{\sqrt{s}},0\}. In particular, we look into the performance of two computationally inexpensive algorithms — the diagonal thresholding sliced inverse regression and a semi-definite programming (SDP) approach and show that support recovery can be successfully performed as long as the rescaled sample size \frac{n}{s \log p} is sufficiently large.

L_1–regularized least squares for support recovery of high dimensional single index models with Gaussian designs with J. S. Liu and T. Cai [arXiv][JMLR] Details We consider single index models of the type Y=f(\boldsymbol{X}^{\intercal}\boldsymbol{\beta},\varepsilon), where the predictor comes from a Gaussian design \boldsymbol{X} \sim N_p(0, \boldsymbol{\Sigma}). We show, somewhat surprisingly, that fitting an L_1-regularized least squares can recover the support of the single index model, in an optimal rescaled sample size, provided that the covariance satisfies the irrepresentable condition and the following key condition holds \mbox{Cov}(Y, \boldsymbol{X}^{\intercal}\boldsymbol{\beta}) \neq 0.

Surrogate aided unsupervised recovery of sparse signals in single index models for binary outcomes with A. Chakrabortty, R. Carroll and T. Cai [arXiv] Details This work is motivated by modern studies involving large databases such as electronic medical records, where the outcome of interest Y is binary (i.e., a disease indicator) and unobserved. In addition to observing a predictor vector \boldsymbol{X} we assume that a surrogate variable S to the disease status Y is available. Importantly, the surrogate is only assumed to be accurately predictive of the disease status in the extremities of its support. Modeling the relationships between Y and S with the predictors via single index models, we obtain sharp finite sample guarantees on recovering the signal proportionally by simply fitting a least squares LASSO estimator to a subset of the observed data.

A unified theory of confidence regions and testing for high dimensional estimating equations with Y. Ning, J. S. Liu and H. Liu [arXiv] Details We propose a new inferential framework for constructing confidence regions and testing hypotheses in statistical models specified by a system of high dimensional estimating equations. We construct an influence function by projecting the fitted estimating equations to a sparse direction obtained by solving a large-scale linear program. Our main theoretical contribution is to establish a unified Z-estimation theory of confidence regions for high dimensional problems. As a result, our approach provides valid inference for a broad class of high dimensional constrained estimating equation problems, which are not covered by existing methods. Such examples include, noisy compressed sensing, instrumental variable regression, undirected graphical models, discriminant analysis and vector autoregressive models.

Agnostic estimation for misspecified phase retrieval models with Z. Wang and H. Liu [pdf], [NIPS 2016], Details This article considers a significant semi-parametric generalization of the sparse phase retrieval model Y = (\boldsymbol{X}^{\intercal} \boldsymbol{\beta})^2 + \varepsilon where \boldsymbol{X}\sim N_d(0,\mathbb{I}_{d}). The so-called misspecified phase retrieval (MPR) models take the form Y=f(\boldsymbol{X}^{\intercal}\boldsymbol{\beta},\varepsilon) and satisfy \mbox{Cov}(Y,( \boldsymbol{X}^{\intercal}\boldsymbol{\beta})^2) > 0. We propose an estimation procedure for agnostic estimation in misspecified phase retrieval, which consists of solving a cascade of two convex programs and provably recovers the direction of \boldsymbol{\beta}. Furthermore, we prove that our algorithm is minimax optimal over the class of MPR models. Interestingly, our minimax analysis characterizes the statistical price of misspecifying the link function in phase retrieval models. Below is a schematic illustration of the two-steps of our algorithm, showing how the second step estimate \hat{\boldsymbol{\beta}} improves on the first step estimate \hat{\mathbf{v}}.
AGENT: first step

Kernel machine testing for risk prediction with stratified case cohort studies with R. Payne, M. K. Jensen and T. Cai [Biometrics] Details In this article, we propose inverse probability weighted variance component type tests for identifying important marker sets through a Cox proportional hazards kernel machine regression framework under a case cohort sampling design (CCH). The CCH design introduces significant analytical complexity due to outcome-dependent, finite-population sampling. The proposed IPW test statistics have complex null distributions that cannot easily be approximated explicitly. The CCH sampling induces correlation which renders standard resampling methods such as the bootstrap useless for the purpose of approximating the distribution correctly. We, therefore, propose a novel perturbation resampling scheme that can effectively recover the induced correlation structure.

Kernel machine score test for pathway analysis in the presence of semi-competing risks with B. Hejblum and J. Sinnott [SMMR][CRAN] Details In cancer studies, patients often experience two different types of events: a non-terminal event such as recurrence or metastasis, and a terminal event such as cancer-specific death. Identifying pathways and networks of genes associated with one or both of these events is an important step in understanding disease development and targeting new biological processes for potential intervention. In this paper we propose a combined testing procedure for a pathway’s association with both the cause-specific hazard of recurrence and the marginal hazard of death. The dependency between the two outcomes is accounted for through perturbation resampling to approximate the test’s null distribution, without any further assumption on the nature of the dependency.

Classification of CT pulmonary angiography reports by presence, chronicity, and location of pulmonary embolism with natural language processing with S. Yu, K.K. Kumamaru, E. George, R.M. Dunne, A. Bedayat, A.R. Hunsaker, K.E. Dill, T. Cai, and F.J. Rybicki [JBI] Details In this paper we describe an efficient tool based on natural language processing for classifying the detail state of pulmonary embolism (PE) recorded in CT pulmonary angiography reports. The classification tasks include: PE present vs. absent, acute PE vs. others, central PE vs. others, and subsegmental PE vs. others. Statistical learning algorithms were trained with features extracted using the NLP tool and gold standard labels obtained via chart review from two radiologists. The areas under the receiver operating characteristic curves (AUC) for the four tasks using our approach show significant advantages over classifiers trained with standard text mining classifiers such as bag-of-words Naive Bayes type of classifiers.


Software Packages

Our [CRAN] package performing the kernel machine test for pathway analysis in semi-competing risk settings.