Sort2013 Part III: Machine Learning in Python
I haven’t done much with Machine Learning since graduating from school years ago. However recently there have been a number of projects where the use of machine learning can bring a significant benefit. This lecture was a great refresher and introduction to how task can be accomplished using python.
Why should we focus on machine learning now?
The power of a machine learning algorithm is its ability to GENERALIZE from a finite set of examples.
- Database mining - Large Dataset from growth and automation, web click data, etc
- Can get a better feel for total number of people with the flu on the web faster than hospitals
- Hard by hand app development - Handwriting Recognition, Computer Vision, Language Processing
- Self-customizing programs - Netflix, Amazon product recommendations, Google ads, etc
- Prediction - Continuous outcomes, Catgorical outcomes like email spam
Clustering
Clustering is grouping all items that have a similiar relationship than items that appear in other clusters
There are a few types of clustering and how the compare. Below is alist of this different cluster types.
- MiniBatchKMeans
- AffinityPropagation
- MeanShift
- Spectral
- Ward
- DBSCAN
- K-Means
In the lecture he specifically covered K-Means clustering.
K-Means clustering
Allows you to take a feature vector and figure out how the information should group together (correlate)
Given a training data set and a number of clusters find the position of the centroids. However the weakness you have with K is that you need to specify a number to begin with for it to use with grouping.
It is often used with Image Color Compression (Converting a 16-bit image to a 6-bit image). Which is accomplished by Replace each pixel color in the original with the color of its nearest k-means centroid.
Advantages
- Simple to implement
- Usually very fast
- Works well for many applications
Disadvantages
- Have to know number of clusters in advances
- Linear Partitioning
- Outcome can be dependent on initial centroid position (so run multiple times with different centers)
- Isn’t perfect at grouping data that isn’t seperated by equal distance (Like multiple Curves being associated instead of blobs)
Regression Prediction (Intuition)
Intuition
- Supervised learning
- Generating a graph based on points and finding the line (2D Linear Regression)
- This is done by finding the minimal error in the generated line given the total number of points.
Gradient Descent
- Define a cost function that reflects the total error as a function of the regression parameters
- Find the parameters that then minimize the cost function -> Start with Random parameter values -> Adjust parameters by some step (directly in proportion to the cost function results) -> Repeat until parameters no longer change
Classifiers
- Supervised learning - training set includes “truth”
- Categorical outcomes (Gaussian Mixture Model - GMM and other probabilitstic classifiers)
- Examples -> Logistic Regression -> Support Vector Machine -> Decision Tree
Metrics
Precision: percentage of the objects classified as A, really are A
Recall: Of all the A objects the percentage that we actually classified as A
Decision Tree
- Object to be classified has an associated set of properties
- classifier can be constructued as set of rules
Fruit & Vegetable Example
- Properties -> size -> shape -> color
- Rules -> if size is small ^ shape is round ^ (color is green v color is red) -> etc
Can be constructed working with a tree that sets what is there where every leaf from a root becomes a new rule
Classifiers can be set using entropy - A coin toss has an entropy of 1 bit. The highest information gain will have the least entropy (unpredictability).
Overfitting is a disadvantage because it doesn’t create the clear seperation necessary, this can be solved using “Random Forests” or multiple decision tress.
Python
The main question that one might ask is why to look into python as the language for machine learning. Well it turns out that Python has basically become the defacto standard for scientific tools and languages.
Books
- Python for Data Analysis
Libraries
- Numpy
- Scipy
- Scikit-learn
Resources
Kaggle.com is a machine learning competition problem.
Coursera classes
- Machine Learnining - Standford
- Intro to Data Science - University of Washington
- Discrete Optimizations