Katherine Daniels

@beerops
kd@gamechanger.io
  • The site is going down.
  • But everything seemed to be fine.
    • checked web servers, databases, mongo, more.
  • What was wrong? The monitoring tool wasn’t telling us.
  • One idea: monitor more. monitor everything.
    • But if you’re looking for a needle in a haystack, the solution is not to add more hay.
    • Monitoring everything just adds more stuff to weed through. Including thousands of things that might be not good (e.g. disk quote too high), but aren’t actually whats causing the problem.
  • Monitor some of the things. The right things. But which things? If we knew, we’d already be monitoring.
  • Talk to Amazon…
    • “try switching the load balancer”
    • “try switching the web server”
  • We had written a service called healthd that was supposed to monitor api1, and api2.
  • But we didn’t have logging for healthd, so we didn’t know what was wrong.
  • We needing more detail.
  • So adding logging, so we knew which API had a problem.
  • We also had some people who tried the monitor everything problem.
  • They uncovered a user who seemed to be scripting the site.
  • They added metrics for where the time was being spent with the API handlers
  • The site would go down for a minute each time things would blip.
  • We set the timeouts to be lower.
  • We found some database queries to be optimized.
  • We found some old APIs that we didn’t need and we removed them.
  • The end result was that things got better. The servers were mostly happy.
  • But the real question is: How did we get to a point where our monitoring didn’t tell us what we needed? We thought we were believers in monitoring. And yet we got stuck.
  • Black Boxes (of mysterious mysteries)
    • Using services in the cloud gives you less visibility
  • Why did we have two different API services…cohabiting…and not being well monitored?
    • No one had the goal of creating a bad solution.
    • But we’re stuck. So how do we fix it?
    • We stuck nginx in front and let it route between them.
  • What things should you be thinking about?
    • Services: 
      • Are the services that should be running actually running?
      • Use sensu or nagio
    • Responsiveness:
      • Is the service responding?
    • System metrics:
      • CPU utilization, disk space, etc.
      • What’s worth an alert depends: on a web server it shouldn’t use all the memory, on a mongo db it should, and if it isn’t, that’s a problem.
    • Application metrics?
      • Are we monitoring performance, errors?
      • Do we have the thresholds set right?
      • We don’t want to to look at a sea of red: “Oh, just ignore that. It’s supposed to be red.”
  • Work through what happens?
    • Had 20 servers running 50 queues each. 
    • Each one has its own sensu monitor. HipChat shows an alert for each one… a thousand outages.
  • You must load test your monitoring system: Will it behave correctly under outages and other problems?
  • “Why didn’t you tell me my service was down?” “Service, what service? You didn’t tell us you were running a service.”