Physical and digital books, media, journals, archives, and databases.
Results include
  1. Large-scale and high-dimensional statistical learning methods and algorithms

    Qian, Junyang
    [Stanford, California] : [Stanford University], 2020

    In the past two decades, many areas such as genomics, neuroscience, economics and Internet services have been producing increasingly big datasets that have high dimension, large sample size, or both. This provides unprecedented opportunities for us to retrieve and infer valuable information from the data. Meanwhile, it also poses new challenges for statistical methodologies and computational algorithms. On the one hand, we want to formulate a reasonable model to capture the desired structures and improve the quality of statistical estimation and inference. On the other hand, in the face of increasingly large datasets, computation can be a big hurdle for one to arrive at meaningful conclusions. This thesis stands at the intersection of the two topics, proposing statistical methods to capture desired structures in the data, and seeking scalable approaches to optimizing the computation for very large datasets. We propose a scalable and flexible framework for solving large-scale sparse regression problems with the lasso/elastic-net and a scalable framework for solving sparse reduced rank regression in the presence of multiple correlated responses and other nuances such as missing values. Optimized implementations are developed for genomics data in the PLINK 2.0 format in R packages snpnet and multiSnpnet respectively. The two methods are demonstrated on the very large and ultrahigh-dimensional UK Biobank studies and see significant improvement over traditional predictive modeling methods. In addition, we consider a different class of high-dimensional problems, heterogeneous causal effect estimation. Unlike the setting of supervised learning, the main challenge of such problems is that in the historical data, we never observe the other side of the coin, so we have no access to the ground truth of the true difference among treatments. We propose adaptation of nonparametric statistical learning methods, in particular gradient boosting and multivariate adaptive regression splines, to the estimation of treatment effect based on the predictors available. The implementation is packaged in an R package causalLearning

  2. Large-scale genomic inference of multiple phenotypes

    Tanigawa, Yosuke
    [Stanford, California] : [Stanford University], 2021

    Many human diseases and other observable non-disease traits are multifactorial. Some of them have shared genetic bases, yet a systematic analysis of such components across human phenome has been challenging, limiting our understanding of the shared genetic factors across traits and their influences on disease. The emergence of genotyped cohorts with dense phenotypic information and catalogs of molecular profiles provides unprecedented opportunities. The statement of the dissertation is that given a sufficiently large genomic dataset with multiple phenotypes, such as population-based genotyped cohorts with dense phenotypic information or large-scale functional genomic datasets with rich molecular phenotypes, we can typically learn the genetics of diseases better by jointly analyzing multiple traits. This dissertation consists of 6 chapters. In chapter 1, I introduce some of the key concepts and methods commonly used in large-scale genomic inference, provide an overview of the rest of the thesis, and summarize my other research contributions during my graduate studies. In chapters 2, 3, and 4, I introduce methods to analyze large-scale population-based genotyped cohorts, such as UK Biobank. Specifically, in chapter 2, I introduce Decomposition of Genetic Associations (DeGAs), a method that characterizes the latent genetic components from genome- and phenome-wide association summary statistics. As I describe in the chapter, I analyzed genetic associations across more than 2,000 phenotypes in UK Biobank, systematically characterized latent components and their functional enrichment, and demonstrated its application to guide functional follow-up experiments in adipocytes. This work was published in Nature Communications (Tanigawa*, Li* et al., 2019). Protein-altering variants that are protective against human disease provide in vivo validation of therapeutic targets. In chapter 3, I propose an approach to scan for such protein-altering variants. Using datasets consists of more than 514,000 individuals with European ancestries in two population cohorts in the UK and Finland, as well as multiple phenotyping endpoints, such as intraocular pressure and glaucoma, I report allelic series of rare protein-altering variants in ANGPTL7 show protection against glaucoma. This work was published in PLOS Genetics (Tanigawa et al., 2020). Laboratory tests are often used in clinical practice to guide diagnosis and treatment plans. In chapter 4, I present a comprehensive genetic analysis of 35 blood and urine biomarkers in UK Biobank. I characterized the genetic basis of serum and urine laboratory tests and demonstrated their influences on disease. This work was published in Nature Genetics (Sinnott-Armstrong*, Tanigawa* et al., 2021). In chapter 6, I consider an analysis of molecular phenotypes, specifically focusing on transcription factors (TFs) with an aid of large-scale gene-to-phenotype information characterized from mouse knockout experiments. TFs regulate cellular context-specific functions of the genome, yet finding TFs with cell-type-specific functional importance is challenging. I co-developed WhichTF, a new method designed to address this problem. The method takes an experimentally characterized open chromatin regions as input and returns a ranked list of TFs with an integrative analysis of functional annotation in Mammalian phenotype ontology, sequence conservation, and gene regulatory domain models. A manuscript describing this work (Tanigawa*, Dyer*, and Bejerano. 2019) and its revision have been submitted to a journal. I will conclude in chapter 6, where I summarize the advantages of multi-trait analysis in large-scale human genetic studies and delineate the future prospects

  3. Statistical learning for large-scale survival data

    Li, Ruilin (Researcher in bioinformatics)
    [Stanford, California] : [Stanford University], 2021

    The constantly growing population biobanks have provided scientists and researchers unprecedented opportunities to understand human diseases genetics. Survival analysis gives insights on the association between the predictors and time-to-event responses and is particularly suitable for such data. On the other hand, millions of genetic variants sequenced from hundreds of thousands of individuals also pose computational challenges. Chapter 1 and Chapter 3 of this dissertation present three methods that reduce the memory requirement and improve the computational speed in analyzing such data. The first method is a variable screening procedure that exploits the sparsity structure on the association between the predictors and the response in high-dimensional datasets, which reduces the frequency of expensive I/O operations for larger-than-RAM data. The second method utilizes a 2-bits-per-entry compact representation specifically for genetic matrices, which further reduces memory requirement and makes our bandwidth bound optimization algorithm scalable to more CPU cores. The third method combines the compact representation for genetic variants and a simplified version of the compressed sparse block format to represent genetic data with a large number of rare variants. The prediction performance of survival models suffers when the number of censored survival time is large. This could happen If we define the survival time as the age of onset of a rare disease. In Chapter 2, I will provide a group-sparse regression-based algorithm to boost the prediction performance on such data. This method is applicable when there are other survival responses with a large number of observed events and are associated with the same predictors as the rare event response. Finally, Chapter 4 provides a baseline-adjusted concordance index as a stable evaluation metric of survival models. This metric is particularly useful in evaluating stratified Cox models, as well as in model selection using cross validation


Course- and topic-based guides to collections, tools, and services.
No guide results found... Try a different search

Library website

Library info; guides & content by subject specialists
No website results found... Try a different search


Digital showcases for research and teaching.
No exhibits results found... Try a different search


Geospatial content, including GIS datasets, digitized maps, and census data.
No earthworks results found... Try a different search

More search tools

Tools to help you discover resources at Stanford and beyond.