My notes from the @pdxruby talk on 2010/04/06
Machine Learning and Data Mining
Randall Thomas
Engine Yard
- Randall’s Slides from Talk
- netflix, amazon, google: recommending movies, books and music, links based on your personal experience
- the future is about information…not data (how many gigabytes of data do you have sitting around?)
- if it’s so cool, how come everyone isn’t doing it? it’s hard
- world’s shortest stats course
- two types of statistics
- descriptive: the average height in this room is 5’ 6”
- inferential: odds are, this horse is going to come in first.
- the two tasks
- classification: you try to come up with a system for classification (cluster analysis, decision trees)
- prediction: card counting, i predict that this deck is hot
- or both: we want to both classify the data and draw inferences about new data
- two types of statistics
- two types
- supervised learning
- unsupervised learning: the way a bayesian filter works… i have no idea what the inputs were, but i can look at the macro behavior, and then make predictions. this is also the way markov models work, the way spam filters work.
- R
- heavy-weight lifting tool for statistics
- has shell for working in statistics
- 5 numbers, one picture
- pallas.telperion.info/ruby-stats
- RSRuby
- lets you eval R code
- Computer friendly data descriptions
- feature vector: simple 0 or 1 for each feature. beer, wine, whiskey, gin are the vectors. (1 if you like it, 0 if you don’t)
- attempt bitwise and of vectors
- feature vector: simple 0 or 1 for each feature. beer, wine, whiskey, gin are the vectors. (1 if you like it, 0 if you don’t)
- Clustering…
- Simple Geometric: just use the distance formula. If you have 2 dimensions, or 3 dimensions, there is a simple formula. that formula generalizes to N dimensions
- R code: plot(sort(mydata$profits))
- Not Simple Geometric Clustering
- Support Vector Machines: create maximal separation of unseparatable data by projecting onto different planes.
- You can seperate into two groups: one that is good, and one that is bad. one that are people attacking your IP ports, and one that isn’t. one that is spam, one that isn’t.
- You can apply the SVM over and over again recursively… this turns into a decision tree.
- Read:
- First: Introductory Statistics with R by Peter Dalgaard (2nd edition) – teaching you the basics in a tutorial fashion
- Second: A Handbook of Statistical Analyses Using R by Brian S Everitt and Torsten Hothorn
- load the free PDF in Rvignette(package = “HSAUR”)
- The Elements of Statistical Learning by Hastie, Tibshirani, Friedman
- www-stat.stanford.edu/~hastie/Papers/ESLII.pdf
- Regression in R
- Examples of companies doing this…
- Collective Intellect: doing mining of memes
Hi,
I find several free to download books about Machine learning, Application on Machine Learning and New Advances in Machine learning.
So, if you are interested, link to download is: http://www.intechopen.com/search?q=Machine+learning
Hope you will enjoy it.
I’m coming to this from the opposite side. Familiar with all the maths and stats and trying to learn Ruby by finding a Ruby implementation of familiar machine learning concepts. The iWork link seems to be for rubyists who want to conquer the world with AI.