An introduction to supervised and unsupervised multivariate statistical analysis

Manuela Salvucci, RCSI

2019-10-09

What will we cover in this session?

  • Basic data exploration
  • A tour on un-supervised and supervised learning algorithms
  • Practise with the TCGA GBM dataset

How?

  • Mix of theory and hands-on practise using jupyter notebooks (python kernel)

What's a jupyter notebook?

jupyter_notebook_logo

Anatomy of a jupyter notebook

  • Notebook consists of a series of "cells"
  • Each cell can have text (markdown) or code (in our case, python).
  • A cell can be run by selecting it and pressing Run button at the top (or pressing ctr-enter)
  • When the cell is run, the output is shown underneath: Cell
  • Simple expressions
In [1]:
2+2
Out[1]:
4
  • Comments
In [2]:
# Lines that start with a # are comments. 
  • Store data in variables
In [3]:
a = 2 + 2
b = 1 + 1
a + b
Out[3]:
6
  • Cells can reference variables from prevoius cells (make sure to run the previous cells first!):
In [4]:
a - b
Out[4]:
2

Pandas tables

  • Variables can also be used to store whole tables with data (using the pandas library), e.g. from a spreadsheet
In [5]:
import os
import pandas
countries = pandas.read_excel(os.path.join('data', 'countries.xls'), index_col='Country')
countries
Out[5]:
Population Area Capital
Country
Ireland 4784000 84421 Dublin
Italy 60590000 301338 Rome
Germany 82790000 357386 Berlin
  • We can access individual columns from the table
In [6]:
countries['Population']
Out[6]:
Country
Ireland     4784000
Italy      60590000
Germany    82790000
Name: Population, dtype: int64
  • Perform operations on single columns
In [7]:
countries['Population'].max()
Out[7]:
82790000
  • Or on multiple columns
In [8]:
countries['Population'] / countries['Area']
Out[8]:
Country
Ireland     56.668365
Italy      201.069895
Germany    231.654290
dtype: float64
  • Making plots
In [9]:
countries['Population'].plot(kind='bar');

Where can I find this material?

Exercise

  1. Go to http://tiny.cc/od51dz
  2. Click on "00_tools_and_setup_practise.py"
  3. Play around

Continue in 10 min.

Data exploration

  • Exploratory data analysis (EDA) is a crucial part of any research project
    • quality control (missing data, outlier detection, ...)
    • descriptive statistics
    • visualization
    • insights to guide downstream analysis

data_exploration_cartoon Image from http://www.codeheroku.com/post.html?name=Introduction%20to%20Exploratory%20Data%20Analysis%20(EDA)

Data exploration

  • We will use "classical" datasets:
    • clean data
    • QC-ed data
    • no missing values
    • balanced cases

The iris dataset

  • The iris dataset (https://en.wikipedia.org/wiki/Iris_flower_data_set) was first described by Fisher in 1936
  • It includes measurements for 4 flower characteristics (length and width for petals and sepals) for 150 flowers from 3 different iris species

iris_species

Let's explore the iris dataset

  • To load the data, we do:
In [2]:
iris = sklearn.datasets.load_iris()
features = pandas.DataFrame(iris.data, columns=iris.feature_names)
target = pandas.Series(pandas.Categorical.from_codes(iris.target, iris.target_names), name='species')
iris = pandas.concat([features, target], axis=1)

Let's explore the iris dataset

  • Use .head() to show the first few samples.
In [3]:
iris.head()
Out[3]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

Let's explore the iris dataset

  • Each row represents 1 observation (i.e. 1 sample/flower)
Out[4]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5 3.6 1.4 0.2 setosa
  • Each column represents 1 variable(i.e. 1 type of attribute/characteristic/feature)
Out[5]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5 3.6 1.4 0.2 setosa

Let's explore the iris dataset

  • Here, "species" is our phenotype of interest which we refer to as "target class" or "target"
Out[6]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5 3.6 1.4 0.2 setosa
  • Summary statistics for the measurements
Out[7]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
count 150 150 150 150
mean 5.84333 3.05733 3.758 1.19933
std 0.828066 0.435866 1.7653 0.762238
min 4.3 2 1 0.1
25% 5.1 2.8 1.6 0.3
50% 5.8 3 4.35 1.3
75% 6.4 3.3 5.1 1.8
max 7.9 4.4 6.9 2.5

Let's explore the iris dataset

  • Breakdown of each species in the dataset
In [8]:
iris.species.value_counts()
Out[8]:
virginica     50
versicolor    50
setosa        50
Name: species, dtype: int64

Let's get plotting the iris dataset

  • Sepal length by species

Let's get plotting the iris dataset

In [10]:
fig, axes = matplotlib.pyplot.subplots(nrows=1, ncols=4, figsize=(24, 8))
# Dot plot with no grouping variable
seaborn.swarmplot(y='sepal length (cm)', color='k', data=iris, ax=axes[0])
# Dot plot grouped by species
seaborn.swarmplot(x='species', y='sepal length (cm)', dodge=True, data=iris.reset_index(), ax=axes[1])
# Boxplot grouped by species
seaborn.boxplot(x='species', y='sepal length (cm)', dodge=True, data=iris.reset_index(), ax=axes[2])
# Violin plot grouped by species
seaborn.violinplot(x='species', y='sepal length (cm)', dodge=True, data=iris.reset_index(), ax=axes[3]);

Let's get plotting the iris dataset

  • All features grouped by species
In [11]:
iris.loc[[6, 92, 140]]
Out[11]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) species
6 4.6 3.4 1.4 0.3 setosa
92 5.8 2.6 4.0 1.2 versicolor
140 6.7 3.1 5.6 2.4 virginica
  • Let's stack the table and add a grouping variable
In [12]:
iris_tall = iris.set_index('species', append=True).stack().to_frame('value')
iris_tall.index.names = ['id', 'species', 'feature']
iris_tall.loc[[6, 92, 140]]
Out[12]:
value
id species feature
6 setosa sepal length (cm) 4.6
sepal width (cm) 3.4
petal length (cm) 1.4
petal width (cm) 0.3
92 versicolor sepal length (cm) 5.8
sepal width (cm) 2.6
petal length (cm) 4.0
petal width (cm) 1.2
140 virginica sepal length (cm) 6.7
sepal width (cm) 3.1
petal length (cm) 5.6
petal width (cm) 2.4

Let's get plotting the iris dataset

  • All features grouped by species
In [13]:
seaborn.catplot(x='species', y='value', col='feature', sharey=False, data=iris_tall.reset_index());

Let's get plotting the iris dataset

  • Bivariate analysis
In [14]:
seaborn.jointplot(x='sepal length (cm)', y='petal width (cm)', kind='reg', data=iris);

Let's get plotting the iris dataset

  • Bivariate analysis
In [15]:
seaborn.scatterplot(x='sepal length (cm)', y='petal width (cm)', hue='species', data=iris);

Let's get plotting the iris dataset

  • Bivariate analysis
In [16]:
g = seaborn.pairplot(iris, hue='species');

Let's get plotting the iris dataset

  • Correlation analysis

Though correlation does not imply causation....

Though correlation does not imply causation....

xkcd_correlation

The Anscombe quartet: same correlation, different underlying data....

mean=7.50, std=1.94, r=0.82
mean=7.50, std=1.94, r=0.82
mean=7.50, std=1.94, r=0.82
mean=7.50, std=1.94, r=0.82

Notes

Data modelling

Herbert Simon: “Learning is any process by which a system improves performance from experience.”

Data sources for cancer and GBM specifically...

Fat data rather than big data....

  • In biology, we typically have a lot of data for a relatively small number of samples (cell lines, animal models, patients)
  • The curse of dimensionality (10-100s of rows with samples x 100K features)

Dimensionality reduction

Principal component analysis (PCA) applied to the iris dataset

Let's explore PCA and alternative embeddings using the fashion MNIST dataset

  • Dataset compiled and released by Zelando for 70K images spanning 10 classes of clothing items fashion_mnist

PCA vs. t-SNE vs. UMAP

Autoencoders

  • Auto-encoders (artificial neural networks) to extract meta-features
  • While the dimensionality reduction techniques we have seen so far are unsupervised, autoencoders are a special class of supervised learning

autoencoder Image from https://towardsdatascience.com/deep-autoencoders-using-tensorflow-c68f075fd1a3

Supervised vs. unsupervised learning algorithms

supervised_vs_unsupervised Image from https://blog.westerndigital.com/machine-learning-pipeline-object-storage/

Supervised vs. unsupervised learning algorithms

supervised_vs_unsupervised_cartoon2 Image from https://mapr.com/blog/apache-spark-machine-learning-tutorial/​

Unsupervised learning

What is un-supervised learning?

  • Learn a model from an un-labelled dataset
  • Model can identify patterns/structures inherent in the data
    • similar samples (cell lines, animal models, patients) can be grouped together
    • anomaly detection (i.e. rare phenotype)

unsupervised_learningImage from https://medium.com/@deepanshugaur1998/scikit-learn-beginners-part-3-6fb05798acb1

Clustering

  • From Wikipedia: "Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters)."

  • Several algorithms available:

    • centroid-based (k-means)
    • connectivity-based (hierarchical)
    • density-based (DBSCAN, OPTICS, ...)
    • ....

Clustering

sklearn_clustering_methods

K-means clustering applied to the iris dataset

K-means clustering applied to the iris dataset

How many clusters? Elbow rule

Hierarchical clustering applied to the iris dataset

Hierarchical clustering - a more familiar view?

Without clustering

With clustering

Consensus clustering (aka ensemble clustering)

Clustering is typically combined with other covariates to provide an integrated view

  • "Cars" classicla dataset

cars Image from http://doi.org/10.5281/zenodo.3242593

Clustering is typically combined with phenotypic and other molecular data to provide an integrated view

"Famous" clustering from the GBM realm

  • Transcriptomic-based subtyping (from bulk) tissue: verhaak_subtypes

Consensus clustering

Supervised learning

  • classification vs regression
  • regression
    • case: linear regression
  • classification
    • case: decision trees
    • case: random forest
    • metrics of performance
      • confusion matrix
  • other models (not covered)
    • variation of forest like xgboots, extreme trees
    • support vector machine (SVM)
    • (deep) neural networks
    • model stacking

What is supervised learning?

  • Learn a model from a labelled dataset
  • Model can predict label on previously unseen samples​

supervised_learningImage from https://medium.com/@deepanshugaur1998/scikit-learn-beginners-part-2-ca78a51803a8

Generalization

How do we know how well what we've learned from one dataset may apply to other datasets?

Simplest technique:

  1. Split our dataset in a "training set" and "testing set" (and sometimes a "validation set").
  2. Learn feature -> label mapping from "training set" only.
  3. Use model to guess labels for "testing set". Check how well it matches the known labels.

This gives us an estimate for how well the model can predict labels for samples it has not seen.

Cost: Requires bigger dataset. Other techniques (like cross-validation) reduce the overhead. For Random Forests there is a "trick" that allows us to train on all the data and yet estimate performance on unseen data.

Classification vs. regression

  • In classification the output is a class or label (discrete), while in regression the output is number (continous)
  • Classification can choose between 2 classes (binary) or more (multi-class)

classification_vs_regressionImage from https://medium.com/@ali_88273/regression-vs-classification-87c224350d69

Linear regression with the Boston housing dataset

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
.. topic:: References

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

Linear regression with the Boston housing dataset

Out[5]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT price
0 0.00632 18 2.31 0 0.538 6.575 65.2 4.09 1 296 15.3 396.9 4.98 24
1 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.9 9.14 21.6
2 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
3 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
4 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.9 5.33 36.2

Let's predict the price from LSTAT

  • LSTAT: Percentage of lower status of the population

Let's predict the price from LSTAT

  • LSTAT: Percentage of lower status of the population
Model performance in the training set:
Root mean-squared error (RMSE) = 4.93
Coefficient of determination (R2) = 0.73
Model performance in the testing set:
Root mean-squared error (RMSE) = 3.99
Coefficient of determination (R2) = 0.74

Classification with IRIS dataset

Predict species from petal length/width

Visualize what the classifier has learned

Decision tree performance (training set)

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        38
  versicolor       1.00      0.94      0.97        36
   virginica       0.95      1.00      0.97        38

    accuracy                           0.98       112
   macro avg       0.98      0.98      0.98       112
weighted avg       0.98      0.98      0.98       112

Decision tree performance (testing set)

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        12
  versicolor       1.00      0.79      0.88        14
   virginica       0.80      1.00      0.89        12

    accuracy                           0.92        38
   macro avg       0.93      0.93      0.92        38
weighted avg       0.94      0.92      0.92        38

Random Forest

random_forest_cartoon

Classification Random Forest

  • Train many decision trees.
  • Each tree is trained on slightly different data; different subsets of samples and of features.
  • To make a classification, we use all trees and the class with the most votes wins.

Why: Training on subsets of samples and features help avoid overfitting (learning a model that reflects the training data too closely). Using multiple trees help get high accuracy on unseen data. overfitting (from https://www.geeksforgeeks.org/regularization-in-machine-learning/)

Classification Random Forest to predict species in the Iris dataset

Confusion matrix

confusion_matrix_cartoon Image from https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28

Classification Random Forest to predict species in the Iris dataset

Confusion matrix

OOB Accuracy = 0.95

Behind the scenes

Trees inside the forest

Classification Random Forest to predict species in the Iris dataset

Votes for individual trees in the forest

Random Forest as feature selection algorithms

  • Impurity-based variable importance
  • Permutation-based variable importance
  • SHAP (SHapley Additive exPlanations)-based variable importance

SHAP feature importance

SHAP feature importance

SHAP feature importance

Supervised learning

  • Classification vs. regression
  • Regression
    • case: linear regression
  • Classification
    • case: decision trees
    • case: random forest
    • metrics of performance
      • confusion matrix
  • Other models (not covered)
    • Variation of tree-based approaches like xgboots, extreme trees
    • Support vector machine (SVM): A hyperplane to separate the groups in high-dimensional space.
    • (deep) neural networks: machine learning inspired by animal brains. Many layers of simple neurons that
      collectively can learn complex patterns.
    • Model stacking: combining multiple learning algorithms for higher accuracy.

Questions and hands-on practise

Quiz 1:

Are the transcriptomics-based GBM subtyping from Verhaak (Cancer Cell, 2010; https://www.ncbi.nlm.nih.gov/pubmed/20129251) and Wang (Cancer Cell, 2017; https://www.ncbi.nlm.nih.gov/pubmed/28697342) examples of supervised or unsupervised algorithms?

Answer

  • unsupervised approaches

Quiz 2:

We built a random forest classifier with 90% accuracy (i.e. predicted and ground truth labels agree in 90% of cases). Is this accuracy sufficient?

Answer

  • It depends....
  • Compare a use-case where we want to identify a rare phenotype vs. a use-case where we have a case-control study with a perfectly balanced dataset
  • The question did not include enough information to evaluate whether 90% accuracy is enough in those settings

How to deal with unbalanced datasets?

  • Down-sampling data for majority class
  • Up-sampling data for minority class
  • Generate "realistic" synthetic data (SMOTE and ADASYN methods)

unbalanced_datasets Image from https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets

What if I don't want to code?

Any questions???

  • Don't be shy
  • Have a question later on? Email me at manuelasalvucci@rcsi.ie

Hands-on practise on a GBM dataset

  • Go to http://tiny.cc/od51dz to practise on data generated by investigators from the TCGA consortium
  • Click on "06_TCGA_GBM_practise.py"
  • ~20 min