An introduction to supervised and unsupervised multivariate statistical analysis¶

Manuela Salvucci, RCSI¶

2019-10-09¶

What will we cover in this session?¶

Basic data exploration
A tour on un-supervised and supervised learning algorithms
Practise with the TCGA GBM dataset

How?¶

Mix of theory and hands-on practise using jupyter notebooks (python kernel)

What's a jupyter notebook?¶

jupyter_notebook_logo

"jupyter" + "notebook"
- "jupyter": "acronym" from Julia + python + R (though many more languages are now supported)
- "notebook": collection of text, code, plots for an analysis (literate programming)
Jupyter project: https://jupyter.org/index.html
Gallery of interesting jupyter notebooks: https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks

Anatomy of a jupyter notebook¶

Notebook consists of a series of "cells"
Each cell can have text (markdown) or code (in our case, python).
A cell can be run by selecting it and pressing button at the top (or pressing ctr-enter)
When the cell is run, the output is shown underneath:

Simple expressions

In [1]:

2+2

Out[1]:

Comments

In [2]:

# Lines that start with a # are comments.

Store data in variables

In [3]:

a = 2 + 2
b = 1 + 1
a + b

Out[3]:

Cells can reference variables from prevoius cells (make sure to run the previous cells first!):

In [4]:

a - b

Out[4]:

Pandas tables¶

Variables can also be used to store whole tables with data (using the pandas library), e.g. from a spreadsheet

In [5]:

import os
import pandas
countries = pandas.read_excel(os.path.join('data', 'countries.xls'), index_col='Country')
countries

Out[5]:

	Population	Area	Capital
Country
Ireland	4784000	84421	Dublin
Italy	60590000	301338	Rome
Germany	82790000	357386	Berlin

We can access individual columns from the table

In [6]:

countries['Population']

Out[6]:

Country
Ireland     4784000
Italy      60590000
Germany    82790000
Name: Population, dtype: int64

Perform operations on single columns

In [7]:

countries['Population'].max()

Out[7]:

82790000

Or on multiple columns

In [8]:

countries['Population'] / countries['Area']

Out[8]:

Country
Ireland     56.668365
Italy      201.069895
Germany    231.654290
dtype: float64

Making plots

In [9]:

countries['Population'].plot(kind='bar');

Where can I find this material?¶

Presentation and code materials at https://bitbucket.org/manuela_s/statistical_analysis_workshop/
Binder version at http://tiny.cc/od51dz
Any question, comment, come chat to me or email me at manuelasalvucci@rcsi.ie

Exercise¶

Go to http://tiny.cc/od51dz
Click on "00_tools_and_setup_practise.py"
Play around

Continue in 10 min.

Data exploration¶

Exploratory data analysis (EDA) is a crucial part of any research project
- quality control (missing data, outlier detection, ...)
- descriptive statistics
- visualization
- insights to guide downstream analysis

data_exploration_cartoon Image from http://www.codeheroku.com/post.html?name=Introduction%20to%20Exploratory%20Data%20Analysis%20(EDA)

Data exploration¶

We will use "classical" datasets:
- clean data
- QC-ed data
- no missing values
- balanced cases

The iris dataset¶

The iris dataset (https://en.wikipedia.org/wiki/Iris_flower_data_set) was first described by Fisher in 1936
It includes measurements for 4 flower characteristics (length and width for petals and sepals) for 150 flowers from 3 different iris species

iris_species

Let's explore the iris dataset¶

To load the data, we do:

In [2]:

iris = sklearn.datasets.load_iris()
features = pandas.DataFrame(iris.data, columns=iris.feature_names)
target = pandas.Series(pandas.Categorical.from_codes(iris.target, iris.target_names), name='species')
iris = pandas.concat([features, target], axis=1)

Let's explore the iris dataset¶

Use .head() to show the first few samples.

In [3]:

iris.head()

Out[3]:

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

iris_petal_sepal Image from https://www.integratedots.com/determine-number-of-iris-species-with-k-means/

Let's explore the iris dataset¶

Each row represents 1 observation (i.e. 1 sample/flower)

Out[4]:

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5	3.6	1.4	0.2	setosa

Each column represents 1 variable(i.e. 1 type of attribute/characteristic/feature)

Out[5]:

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5	3.6	1.4	0.2	setosa

Let's explore the iris dataset¶

Here, "species" is our phenotype of interest which we refer to as "target class" or "target"

Out[6]:

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5	3.6	1.4	0.2	setosa

Summary statistics for the measurements

Out[7]:

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
count	150	150	150	150
mean	5.84333	3.05733	3.758	1.19933
std	0.828066	0.435866	1.7653	0.762238
min	4.3	2	1	0.1
25%	5.1	2.8	1.6	0.3
50%	5.8	3	4.35	1.3
75%	6.4	3.3	5.1	1.8
max	7.9	4.4	6.9	2.5

Let's explore the iris dataset¶

Breakdown of each species in the dataset

In [8]:

iris.species.value_counts()

Out[8]:

virginica     50
versicolor    50
setosa        50
Name: species, dtype: int64

Let's get plotting the iris dataset¶

Sepal length by species

Let's get plotting the iris dataset¶

In [10]:

fig, axes = matplotlib.pyplot.subplots(nrows=1, ncols=4, figsize=(24, 8))
# Dot plot with no grouping variable
seaborn.swarmplot(y='sepal length (cm)', color='k', data=iris, ax=axes[0])
# Dot plot grouped by species
seaborn.swarmplot(x='species', y='sepal length (cm)', dodge=True, data=iris.reset_index(), ax=axes[1])
# Boxplot grouped by species
seaborn.boxplot(x='species', y='sepal length (cm)', dodge=True, data=iris.reset_index(), ax=axes[2])
# Violin plot grouped by species
seaborn.violinplot(x='species', y='sepal length (cm)', dodge=True, data=iris.reset_index(), ax=axes[3]);

Let's get plotting the iris dataset¶

All features grouped by species

In [11]:

iris.loc[[6, 92, 140]]

Out[11]:

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	species
6	4.6	3.4	1.4	0.3	setosa
92	5.8	2.6	4.0	1.2	versicolor
140	6.7	3.1	5.6	2.4	virginica

Let's stack the table and add a grouping variable

In [12]:

iris_tall = iris.set_index('species', append=True).stack().to_frame('value')
iris_tall.index.names = ['id', 'species', 'feature']
iris_tall.loc[[6, 92, 140]]

Out[12]:

			value
id	species	feature
6	setosa	sepal length (cm)	4.6
		sepal width (cm)	3.4
		petal length (cm)	1.4
		petal width (cm)	0.3
92	versicolor	sepal length (cm)	5.8
		sepal width (cm)	2.6
		petal length (cm)	4.0
		petal width (cm)	1.2
140	virginica	sepal length (cm)	6.7
		sepal width (cm)	3.1
		petal length (cm)	5.6
		petal width (cm)	2.4

Let's get plotting the iris dataset¶

All features grouped by species

In [13]:

seaborn.catplot(x='species', y='value', col='feature', sharey=False, data=iris_tall.reset_index());

Let's get plotting the iris dataset¶

Bivariate analysis

In [14]:

seaborn.jointplot(x='sepal length (cm)', y='petal width (cm)', kind='reg', data=iris);

Let's get plotting the iris dataset¶

Bivariate analysis

In [15]:

seaborn.scatterplot(x='sepal length (cm)', y='petal width (cm)', hue='species', data=iris);

Let's get plotting the iris dataset¶

Bivariate analysis

In [16]:

g = seaborn.pairplot(iris, hue='species');

Let's get plotting the iris dataset¶

Correlation analysis

Though correlation does not imply causation....¶

xkcd_correlation

The Anscombe quartet: same correlation, different underlying data....¶

https://en.wikipedia.org/wiki/Anscombe%27s_quartet

mean=7.50, std=1.94, r=0.82
mean=7.50, std=1.94, r=0.82
mean=7.50, std=1.94, r=0.82
mean=7.50, std=1.94, r=0.82

Notes¶

When in doubt, make some more plots...
Useful tips on data visualization:
Software for data visualization:
- (A few) of python's plotting libraries:
  - matplotlib
  - seaborn
  - ggplot
  - plotly
  - bokeh
  - ggplot
- Other libraries/packages/toolboxes:
  - ggplot2 (R)
  - GRAMM (MATLAB)
GUI-based tools:
- http://gliovis.bioinfo.cnio.es/
- http://www.instantclue.uni-koeln.de/ (from https://www.ncbi.nlm.nih.gov/pubmed/30140043)

Data modelling¶

Herbert Simon: “Learning is any process by which a system improves performance from experience.”

Data, data & more data needed…¶

data_data_more_data Image from http://web.orionhealth.com/rs/981-HEV-035/images/Introduction_To_Machine_Learning_US.pdf

Data sources for cancer and GBM specifically...¶

TCGA datasets
GlioVis (http://gliovis.bioinfo.cnio.es/)
https://portals.broadinstitute.org/ccle

Fat data rather than big data....¶

In biology, we typically have a lot of data for a relatively small number of samples (cell lines, animal models, patients)
The curse of dimensionality (10-100s of rows with samples x 100K features)

Dimensionality reduction¶

Techniques designed to visualize high-dimensional data in in lower dimensions, typically 2D:
- linear transformation:
  - Principal Component Analysis (PCA)
- non-linear transformation:
  - t-distributed stochastic neighbor embedding (t-SNE, https://lvdmaaten.github.io/tsne/)
  - UMAP (Uniform Manifold Approximation and Projection, https://github.com/lmcinnes/umap)
Aid in exploratory data analysis and modelling
Useful literature:

Principal component analysis (PCA) applied to the iris dataset¶

Let's explore PCA and alternative embeddings using the fashion MNIST dataset¶

Dataset compiled and released by Zelando for 70K images spanning 10 classes of clothing items

PCA vs. t-SNE vs. UMAP¶

Autoencoders¶

Auto-encoders (artificial neural networks) to extract meta-features
While the dimensionality reduction techniques we have seen so far are unsupervised, autoencoders are a special class of supervised learning

Image from https://towardsdatascience.com/deep-autoencoders-using-tensorflow-c68f075fd1a3

Supervised vs. unsupervised learning algorithms¶

supervised_vs_unsupervised Image from https://blog.westerndigital.com/machine-learning-pipeline-object-storage/

Supervised vs. unsupervised learning algorithms¶

supervised_vs_unsupervised_cartoon2 Image from https://mapr.com/blog/apache-spark-machine-learning-tutorial/

Unsupervised learning¶

What is un-supervised learning?¶

Learn a model from an un-labelled dataset
Model can identify patterns/structures inherent in the data
- similar samples (cell lines, animal models, patients) can be grouped together
- anomaly detection (i.e. rare phenotype)

unsupervised_learning Image from https://medium.com/@deepanshugaur1998/scikit-learn-beginners-part-3-6fb05798acb1

Clustering¶

From Wikipedia: "Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters)."
Several algorithms available:
- centroid-based (k-means)
- connectivity-based (hierarchical)
- density-based (DBSCAN, OPTICS, ...)
- ....

Clustering¶

sklearn_clustering_methods

a table summaring use-case for each approach available at https://scikit-learn.org/stable/modules/clustering.html#clustering

K-means clustering applied to the iris dataset¶

How many clusters? Elbow rule¶

Hierarchical clustering applied to the iris dataset¶

Hierarchical clustering - a more familiar view?¶

Without clustering¶

With clustering¶

Consensus clustering (aka ensemble clustering)¶

From Wikipedia: " a number of different (input) clusterings have been obtained for a particular dataset and it is desired to find a single (consensus) clustering which is a better fit in some sense than the existing clusterings."
Clustering algorithm is run multiple times --> assessment of number of clusters and cluster stability
Useful readings:
- Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data (https://link.springer.com/article/10.1023/A:1023949509487)
- Bioconductor (R) Consensus Cluster package (https://bioconductor.org/packages/release/bioc/vignettes/ConsensusClusterPlus/inst/doc/ConsensusClusterPlus.pdf)
- Critical limitations of consensus clustering in class discovery(https://www.ncbi.nlm.nih.gov/pubmed/25158761)
- Avoiding common pitfalls when clustering biological data (https://www.ncbi.nlm.nih.gov/pubmed/27303057)

Clustering is typically combined with other covariates to provide an integrated view¶

"Cars" classicla dataset

cars Image from http://doi.org/10.5281/zenodo.3242593

Clustering is typically combined with phenotypic and other molecular data to provide an integrated view¶

TCGA cutanous melanoma dataset Image from http://doi.org/10.5281/zenodo.3242593

"Famous" clustering from the GBM realm¶

Transcriptomic-based subtyping (from bulk) tissue:

Consensus clustering¶

Let's explore the Spotify dataset with this app http://tiny.cc/w196dz (developed by @Ljfernando, https://towardsdatascience.com/consensus-clustering-f5d25c98eaf2) (5 min)

Supervised learning¶

classification vs regression
regression
- case: linear regression
classification
- case: decision trees
- case: random forest
- metrics of performance
  - confusion matrix
other models (not covered)
- variation of forest like xgboots, extreme trees
- support vector machine (SVM)
- (deep) neural networks
- model stacking

What is supervised learning?¶

Learn a model from a labelled dataset
Model can predict label on previously unseen samples

supervised_learning Image from https://medium.com/@deepanshugaur1998/scikit-learn-beginners-part-2-ca78a51803a8

Generalization¶

How do we know how well what we've learned from one dataset may apply to other datasets?

Simplest technique:

Split our dataset in a "training set" and "testing set" (and sometimes a "validation set").
Learn feature -> label mapping from "training set" only.
Use model to guess labels for "testing set". Check how well it matches the known labels.

This gives us an estimate for how well the model can predict labels for samples it has not seen.

Cost: Requires bigger dataset. Other techniques (like cross-validation) reduce the overhead. For Random Forests there is a "trick" that allows us to train on all the data and yet estimate performance on unseen data.

Generalization¶

supervised_vs_unsupervised_training_testing Image from https://blogs.nvidia.com/blog/2018/08/02/supervised-unsupervised-learning/

Classification vs. regression¶

In classification the output is a class or label (discrete), while in regression the output is number (continous)
Classification can choose between 2 classes (binary) or more (multi-class)

classification_vs_regression Image from https://medium.com/@ali_88273/regression-vs-classification-87c224350d69

Linear regression with the Boston housing dataset¶

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
.. topic:: References

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

Linear regression with the Boston housing dataset¶

Out[5]:

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	price
0	0.00632	18	2.31	0.538	6.575	65.2	4.09	1	296	15.3	396.9	4.98	24
1	0.02731	0	7.07	0.469	6.421	78.9	4.9671	2	242	17.8	396.9	9.14	21.6
2	0.02729	0	7.07	0.469	7.185	61.1	4.9671	2	242	17.8	392.83	4.03	34.7
3	0.03237	0	2.18	0.458	6.998	45.8	6.0622	3	222	18.7	394.63	2.94	33.4
4	0.06905	0	2.18	0.458	7.147	54.2	6.0622	3	222	18.7	396.9	5.33	36.2

Let's predict the price from LSTAT¶

LSTAT: Percentage of lower status of the population

Let's predict the price from LSTAT¶

LSTAT: Percentage of lower status of the population

Model performance in the training set:
Root mean-squared error (RMSE) = 4.93
Coefficient of determination (R2) = 0.73

Model performance in the testing set:
Root mean-squared error (RMSE) = 3.99
Coefficient of determination (R2) = 0.74

Classification with IRIS dataset¶

Predict species from petal length/width¶

Visualize what the classifier has learned¶

Decision tree performance (training set)¶

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        38
  versicolor       1.00      0.94      0.97        36
   virginica       0.95      1.00      0.97        38

    accuracy                           0.98       112
   macro avg       0.98      0.98      0.98       112
weighted avg       0.98      0.98      0.98       112

Decision tree performance (testing set)¶

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        12
  versicolor       1.00      0.79      0.88        14
   virginica       0.80      1.00      0.89        12

    accuracy                           0.92        38
   macro avg       0.93      0.93      0.92        38
weighted avg       0.94      0.92      0.92        38

Random Forest¶

random_forest_cartoon

Classification Random Forest¶

Train many decision trees.
Each tree is trained on slightly different data; different subsets of samples and of features.
To make a classification, we use all trees and the class with the most votes wins.

Why: Training on subsets of samples and features help avoid overfitting (learning a model that reflects the training data too closely). Using multiple trees help get high accuracy on unseen data. (from https://www.geeksforgeeks.org/regularization-in-machine-learning/)

Classification Random Forest to predict species in the Iris dataset¶

Confusion matrix¶

confusion_matrix_cartoon Image from https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28

Classification Random Forest to predict species in the Iris dataset¶

Confusion matrix¶

OOB Accuracy = 0.95

Behind the scenes¶

Trees inside the forest¶

Classification Random Forest to predict species in the Iris dataset¶

Votes for individual trees in the forest¶

Random Forest as feature selection algorithms¶

Impurity-based variable importance
Permutation-based variable importance
SHAP (SHapley Additive exPlanations)-based variable importance

SHAP feature importance¶

Feature importance for the dataset Image from https://github.com/slundberg/shap

SHAP feature importance¶

Feature importance for the dataset Image from https://github.com/slundberg/shap

SHAP feature importance¶

Feature importance for specific dataset observations Image from https://github.com/slundberg/shap

Supervised learning¶

Classification vs. regression
Regression
- case: linear regression
Classification
- case: decision trees
- case: random forest
- metrics of performance
  - confusion matrix
Other models (not covered)
- Variation of tree-based approaches like xgboots, extreme trees
- Support vector machine (SVM): A hyperplane to separate the groups in high-dimensional space.
- (deep) neural networks: machine learning inspired by animal brains. Many layers of simple neurons that
  collectively can learn complex patterns.
- Model stacking: combining multiple learning algorithms for higher accuracy.

Questions and hands-on practise¶

Quiz 1:¶

Are the transcriptomics-based GBM subtyping from Verhaak (Cancer Cell, 2010; https://www.ncbi.nlm.nih.gov/pubmed/20129251) and Wang (Cancer Cell, 2017; https://www.ncbi.nlm.nih.gov/pubmed/28697342) examples of supervised or unsupervised algorithms?

Answer¶

unsupervised approaches

Quiz 2:¶

We built a random forest classifier with 90% accuracy (i.e. predicted and ground truth labels agree in 90% of cases). Is this accuracy sufficient?

Answer¶

It depends....
Compare a use-case where we want to identify a rare phenotype vs. a use-case where we have a case-control study with a perfectly balanced dataset
The question did not include enough information to evaluate whether 90% accuracy is enough in those settings

How to deal with unbalanced datasets?¶

Down-sampling data for majority class
Up-sampling data for minority class
Generate "realistic" synthetic data (SMOTE and ADASYN methods)

unbalanced_datasets Image from https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets

What if I don't want to code?¶

WEKA (Waikato Environment for Knowledge Analysis): https://waikato.github.io/weka-wiki/
KNIME: https://www.knime.com/
RapidMiner: https://rapidminer.com/products/studio/
Orange: https://orange.biolab.si/#Orange-Features
Obviously: https://www.obviously.ai/

Any questions???¶

Don't be shy
Have a question later on? Email me at manuelasalvucci@rcsi.ie

Hands-on practise on a GBM dataset¶

Go to http://tiny.cc/od51dz to practise on data generated by investigators from the TCGA consortium
Click on "06_TCGA_GBM_practise.py"
~20 min