My notes from the @pdxruby talk on 2010/04/06

Machine Learning and Data Mining
Randall Thomas
Engine Yard
  • Randall’s Slides from Talk
  • netflix, amazon, google: recommending movies, books and music, links based on your personal experience
    • the future is about information…not data (how many gigabytes of data do you have sitting around?)
    • if it’s so cool, how come everyone isn’t doing it? it’s hard
  • world’s shortest stats course
    • two types of statistics
      • descriptive: the average height in this room is 5’ 6”
      • inferential: odds are, this horse is going to come in first. 
    • the two tasks
      • classification: you try to come up with a system for classification (cluster analysis, decision trees)
      • prediction: card counting, i predict that this deck is hot
      • or both: we want to both classify the data and draw inferences about new data
  • two types
    • supervised learning
    • unsupervised learning: the way a bayesian filter works… i have no idea what the inputs were, but i can look at the macro behavior, and then make predictions. this is also the way markov models work, the way spam filters work.
  • R
    • heavy-weight lifting tool for statistics
    • has shell for working in statistics
  • 5 numbers, one picture
    • pallas.telperion.info/ruby-stats
  • RSRuby
    • lets you eval R code
  • Computer friendly data descriptions
    • feature vector: simple 0 or 1 for each feature. beer, wine, whiskey, gin are the vectors. (1 if you like it, 0 if you don’t)
      • attempt bitwise and of vectors
  • Clustering…
    • Simple Geometric: just use the distance formula. If you have 2 dimensions, or 3 dimensions, there is a simple formula. that formula generalizes to N dimensions
    • R code: plot(sort(mydata$profits))
  • Not Simple Geometric Clustering
    • Support Vector Machines: create maximal separation of unseparatable data by projecting onto different planes.
    • You can seperate into two groups: one that is good, and one that is bad. one that are people attacking your IP ports, and one that isn’t. one that is spam, one that isn’t.
    • You can apply the SVM over and over again recursively… this turns into a decision tree.
  • Read: 
    • First: Introductory Statistics with R by Peter Dalgaard (2nd edition) – teaching you the basics in a tutorial fashion
    • Second: A Handbook of Statistical Analyses Using R by Brian S Everitt and Torsten Hothorn
      • load the free PDF in Rvignette(package = “HSAUR”)
    • The Elements of Statistical Learning by Hastie, Tibshirani, Friedman
      • www-stat.stanford.edu/~hastie/Papers/ESLII.pdf
  • Regression in R
  • Examples of companies doing this…
    • Collective Intellect: doing mining of memes