"jupyter" + "notebook"
Jupyter project: https://jupyter.org/index.html
Gallery of interesting jupyter notebooks: https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks
2+2
4
# Lines that start with a # are comments.
a = 2 + 2
b = 1 + 1
a + b
6
a - b
2
import os
import pandas
countries = pandas.read_excel(os.path.join('data', 'countries.xls'), index_col='Country')
countries
Population | Area | Capital | |
---|---|---|---|
Country | |||
Ireland | 4784000 | 84421 | Dublin |
Italy | 60590000 | 301338 | Rome |
Germany | 82790000 | 357386 | Berlin |
countries['Population']
Country Ireland 4784000 Italy 60590000 Germany 82790000 Name: Population, dtype: int64
countries['Population'].max()
82790000
countries['Population'] / countries['Area']
Country Ireland 56.668365 Italy 201.069895 Germany 231.654290 dtype: float64
countries['Population'].plot(kind='bar');
Continue in 10 min.
Image from http://www.codeheroku.com/post.html?name=Introduction%20to%20Exploratory%20Data%20Analysis%20(EDA)
iris = sklearn.datasets.load_iris()
features = pandas.DataFrame(iris.data, columns=iris.feature_names)
target = pandas.Series(pandas.Categorical.from_codes(iris.target, iris.target_names), name='species')
iris = pandas.concat([features, target], axis=1)
iris.head()
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5 | 3.6 | 1.4 | 0.2 | setosa |
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5 | 3.6 | 1.4 | 0.2 | setosa |
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5 | 3.6 | 1.4 | 0.2 | setosa |
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | |
---|---|---|---|---|
count | 150 | 150 | 150 | 150 |
mean | 5.84333 | 3.05733 | 3.758 | 1.19933 |
std | 0.828066 | 0.435866 | 1.7653 | 0.762238 |
min | 4.3 | 2 | 1 | 0.1 |
25% | 5.1 | 2.8 | 1.6 | 0.3 |
50% | 5.8 | 3 | 4.35 | 1.3 |
75% | 6.4 | 3.3 | 5.1 | 1.8 |
max | 7.9 | 4.4 | 6.9 | 2.5 |
iris.species.value_counts()
virginica 50 versicolor 50 setosa 50 Name: species, dtype: int64
fig, axes = matplotlib.pyplot.subplots(nrows=1, ncols=4, figsize=(24, 8))
# Dot plot with no grouping variable
seaborn.swarmplot(y='sepal length (cm)', color='k', data=iris, ax=axes[0])
# Dot plot grouped by species
seaborn.swarmplot(x='species', y='sepal length (cm)', dodge=True, data=iris.reset_index(), ax=axes[1])
# Boxplot grouped by species
seaborn.boxplot(x='species', y='sepal length (cm)', dodge=True, data=iris.reset_index(), ax=axes[2])
# Violin plot grouped by species
seaborn.violinplot(x='species', y='sepal length (cm)', dodge=True, data=iris.reset_index(), ax=axes[3]);
iris.loc[[6, 92, 140]]
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | species | |
---|---|---|---|---|---|
6 | 4.6 | 3.4 | 1.4 | 0.3 | setosa |
92 | 5.8 | 2.6 | 4.0 | 1.2 | versicolor |
140 | 6.7 | 3.1 | 5.6 | 2.4 | virginica |
iris_tall = iris.set_index('species', append=True).stack().to_frame('value')
iris_tall.index.names = ['id', 'species', 'feature']
iris_tall.loc[[6, 92, 140]]
value | |||
---|---|---|---|
id | species | feature | |
6 | setosa | sepal length (cm) | 4.6 |
sepal width (cm) | 3.4 | ||
petal length (cm) | 1.4 | ||
petal width (cm) | 0.3 | ||
92 | versicolor | sepal length (cm) | 5.8 |
sepal width (cm) | 2.6 | ||
petal length (cm) | 4.0 | ||
petal width (cm) | 1.2 | ||
140 | virginica | sepal length (cm) | 6.7 |
sepal width (cm) | 3.1 | ||
petal length (cm) | 5.6 | ||
petal width (cm) | 2.4 |
seaborn.catplot(x='species', y='value', col='feature', sharey=False, data=iris_tall.reset_index());
seaborn.jointplot(x='sepal length (cm)', y='petal width (cm)', kind='reg', data=iris);
seaborn.scatterplot(x='sepal length (cm)', y='petal width (cm)', hue='species', data=iris);
g = seaborn.pairplot(iris, hue='species');
mean=7.50, std=1.94, r=0.82 mean=7.50, std=1.94, r=0.82 mean=7.50, std=1.94, r=0.82 mean=7.50, std=1.94, r=0.82
Herbert Simon: “Learning is any process by which a system improves performance from experience.”
Image from http://web.orionhealth.com/rs/981-HEV-035/images/Introduction_To_Machine_Learning_US.pdf
Image from https://towardsdatascience.com/deep-autoencoders-using-tensorflow-c68f075fd1a3
Image from https://blog.westerndigital.com/machine-learning-pipeline-object-storage/
Image from https://mapr.com/blog/apache-spark-machine-learning-tutorial/
Image from https://medium.com/@deepanshugaur1998/scikit-learn-beginners-part-3-6fb05798acb1
From Wikipedia: "Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters)."
Several algorithms available:
Image from http://doi.org/10.5281/zenodo.3242593
Image from https://medium.com/@deepanshugaur1998/scikit-learn-beginners-part-2-ca78a51803a8
How do we know how well what we've learned from one dataset may apply to other datasets?
Simplest technique:
This gives us an estimate for how well the model can predict labels for samples it has not seen.
Cost: Requires bigger dataset. Other techniques (like cross-validation) reduce the overhead. For Random Forests there is a "trick" that allows us to train on all the data and yet estimate performance on unseen data.
Image from https://blogs.nvidia.com/blog/2018/08/02/supervised-unsupervised-learning/
Image from https://medium.com/@ali_88273/regression-vs-classification-87c224350d69
.. _boston_dataset: Boston house prices dataset --------------------------- **Data Set Characteristics:** :Number of Instances: 506 :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target. :Attribute Information (in order): - CRIM per capita crime rate by town - ZN proportion of residential land zoned for lots over 25,000 sq.ft. - INDUS proportion of non-retail business acres per town - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) - NOX nitric oxides concentration (parts per 10 million) - RM average number of rooms per dwelling - AGE proportion of owner-occupied units built prior to 1940 - DIS weighted distances to five Boston employment centres - RAD index of accessibility to radial highways - TAX full-value property-tax rate per $10,000 - PTRATIO pupil-teacher ratio by town - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town - LSTAT % lower status of the population - MEDV Median value of owner-occupied homes in $1000's :Missing Attribute Values: None :Creator: Harrison, D. and Rubinfeld, D.L. This is a copy of UCI ML housing dataset. https://archive.ics.uci.edu/ml/machine-learning-databases/housing/ This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics ...', Wiley, 1980. N.B. Various transformations are used in the table on pages 244-261 of the latter. The Boston house-price data has been used in many machine learning papers that address regression problems. .. topic:: References - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261. - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18 | 2.31 | 0 | 0.538 | 6.575 | 65.2 | 4.09 | 1 | 296 | 15.3 | 396.9 | 4.98 | 24 |
1 | 0.02731 | 0 | 7.07 | 0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2 | 242 | 17.8 | 396.9 | 9.14 | 21.6 |
2 | 0.02729 | 0 | 7.07 | 0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2 | 242 | 17.8 | 392.83 | 4.03 | 34.7 |
3 | 0.03237 | 0 | 2.18 | 0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3 | 222 | 18.7 | 394.63 | 2.94 | 33.4 |
4 | 0.06905 | 0 | 2.18 | 0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3 | 222 | 18.7 | 396.9 | 5.33 | 36.2 |
Model performance in the training set: Root mean-squared error (RMSE) = 4.93 Coefficient of determination (R2) = 0.73
Model performance in the testing set: Root mean-squared error (RMSE) = 3.99 Coefficient of determination (R2) = 0.74
precision recall f1-score support setosa 1.00 1.00 1.00 38 versicolor 1.00 0.94 0.97 36 virginica 0.95 1.00 0.97 38 accuracy 0.98 112 macro avg 0.98 0.98 0.98 112 weighted avg 0.98 0.98 0.98 112
precision recall f1-score support setosa 1.00 1.00 1.00 12 versicolor 1.00 0.79 0.88 14 virginica 0.80 1.00 0.89 12 accuracy 0.92 38 macro avg 0.93 0.93 0.92 38 weighted avg 0.94 0.92 0.92 38
Why: Training on subsets of samples and features help avoid overfitting (learning a model that reflects the training data too closely). Using multiple trees help get high accuracy on unseen data. (from https://www.geeksforgeeks.org/regularization-in-machine-learning/)
Image from https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28
OOB Accuracy = 0.95
Are the transcriptomics-based GBM subtyping from Verhaak (Cancer Cell, 2010; https://www.ncbi.nlm.nih.gov/pubmed/20129251) and Wang (Cancer Cell, 2017; https://www.ncbi.nlm.nih.gov/pubmed/28697342) examples of supervised or unsupervised algorithms?
We built a random forest classifier with 90% accuracy (i.e. predicted and ground truth labels agree in 90% of cases). Is this accuracy sufficient?
Image from https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets