All episodes of StatLearn 2012 - Workshop on "Challenging

StatLearn 2012 - Workshop on "Challenging problems in Statistical Learning"

Episodes

StatLearn 2012 - Workshop on "Challenging problems in Statistical Learning"

Published 12/03/14

2.2 Functional estimation in high dimensional data : Application to classification (Sophie Dabo-Niang)

Functional data are becoming increasingly common in a variety of fields. Many studies underline the importance to consider the representation of data as functions. This has sparked a growing attention in the development of adapted statistical tools that allow to analyze such kind of data : functional data analysis (FDA). The aims of FDA are mainly the same as in the classical statistical analysis, e.g. representing and visualizing the data, studying variability and trends, comparing different...

Published 12/03/14

4.1 Data-driven penalties: heuristics, results and thoughts... (Pascal Massart)

The idea of selecting a model via penalizing a log-likelihood type criterion goes back to the early seventies with the pioneering works of Mallows and Akaike. One can find many consistency results in the literature for such criteria. These results are asymptotic in the sense that one deals with a given number of models and the number of observations tends to infinity. A non asymptotic theory for these type of criteria has been developed these last years that allows the size as well as the...

Published 12/03/14

4.2 A sliced inverse regression approach for block-wise evolving data streams (Jérôme Saracco)

In this communication, we focus on data arriving sequentially by block in a stream. A semiparametric regression model involving a common EDR (Effective Dimension Reduction) direction B is assumed in each block. Our goal is to estimate this direction at each arrival of a new block. A simple direct approach consists in pooling all the observed blocks and estimate the EDR direction by the SIR (Sliced Inverse Regression) method. But some disadvantages appear in practice such as the storage of the...

Published 12/03/14

1.1 Dimension reduction based on finite mixture modeling of inverse regression (Luca Scrucca)

Consider the usual regression problem in which we want to study the conditional distribution of a response Y given a set of predictors X. Sufficient dimension reduction (SDR) methods aim at replacing the high-dimensional vector of predictors by a lower-dimensional function R(X) with no loss of information about the dependence of the response variable on the predictors. Almost all SDR methods restrict attention to the class of linear reductions, which can be represented in terms of the...

Published 12/03/14

1.2 Information Visualization: An Introduction to the Field and Applications for Statistics (Petra Isenberg)

Information visualization is a research area that focuses on making structures and content of large and complex data sets visually understandable and interactively analyzable. The goal of information visualization tools and techniques is to increase our ability to gain insight and make decisions for many types of datasets, tasks, and analysis scenarios. With the increase in size and complexity of data sets today, the research area of information visualization increasingly gains in importance...

Published 12/03/14

2.1 Hypothesis Testing and Bayesian Inference: New Applications of Kernel Methods (Arthur Gretton)

In the early days of kernel machines research, the "kernel trick" was considered a useful way of constructing nonlinear learning algorithms from linear ones, by applying the linear algorithms to feature space mappings of the original data. Recently, it has become clear that a potentially more far reaching use of kernels is as a linear way of dealing with higher order statistics, by mapping probabilities to a suitable reproducing kernel Hilbert space (i.e., the feature space is an RKHS). I...

Published 12/03/14

2.3 Discriminative clustering for high-dimensional data (Camille Brunet)

A new family of 12 probabilistic models, introduced recently, aims to simultaneously cluster and visualize high-dimensional data. It is based on a mixture model which fits the data into a latent discriminative subspace with an intrinsic dimension bounded by the number of clusters. An estimation procedure, named the Fisher-EM algorithm has also been proposed and turns out to outperform other subspace clustering in most situations. Moreover the convergence properties of the Fisher-EM algorithm...

Published 12/03/14

3.1 Exploring Clustering Structure in Ranking Data (Brendan Murphy)

Cluster analysis is concerned with finding homogeneous groups in a population. Model-based clustering methods provide a framework for developing clustering methods through the use of statistical models. This approach allows for uncertainty to be quantified using probability and for the properties of a clustering method to be understood on the basis of a well defined statistical model. Mixture models provide a basis for many model-based clustering methods. Ranking data arise when judges rank...

Published 12/03/14

3.2 Co-clustering under different approaches (Mohamed Nadif)

Cluster analysis is an important tool in a variety of scientific areas including pattern recognition, document clustering, and the analysis of microarray data. Although many clustering procedures such as hierarchical, strict partitioning and overlapping clusterings aim to construct an optimal partition of objects or, sometimes, variables, there are other methods, known as co-clustering or block clustering procedures, which consider the two sets simultaneously. In several situations, compared...

Published 12/03/14

3.3 Complexity control in overlapping stochastic block models (Pierre Latouche)

Networks are highly used to represent complex systems as sets of interactions between units of interest. For instance, regulatory networks can describe the regulation of genes with transcriptional factors while metabolic networks focus on representing pathways of biochemical reactions. In social sciences, networks are commonly used to represent relational ties between actors. Numerous graph clustering algorithms have been proposed since the earlier work of Moreno [2]. Most of them partition...

Published 12/03/14

4.3 Transfer to an Unlabeled Task using kernel marginal predictors (Gilles Blanchard)

We consider a classification problem: the goal is to assign class labels to an unlabeled test data set, given several labeled training data sets drawn from different but similar distributions. In essence, the goal is to predict labels from (an estimate of) the marginal distribution (of the unlabeled data) by learning the trends present in related classification tasks that are already known. In this sense, this problem belongs to the category of so-called "transfer learning" in machine...

Published 12/03/14