Katherine Daniels on Monitoring - William Hertling's Thoughtstream

Katherine Daniels

@beerops

kd@gamechanger.io

The site is going down.
But everything seemed to be fine.
- checked web servers, databases, mongo, more.
What was wrong? The monitoring tool wasn’t telling us.
One idea: monitor more. monitor everything.
- But if you’re looking for a needle in a haystack, the solution is not to add more hay.
- Monitoring everything just adds more stuff to weed through. Including thousands of things that might be not good (e.g. disk quote too high), but aren’t actually whats causing the problem.
Monitor some of the things. The right things. But which things? If we knew, we’d already be monitoring.
Talk to Amazon…
- “try switching the load balancer”
- “try switching the web server”
We had written a service called healthd that was supposed to monitor api1, and api2.
But we didn’t have logging for healthd, so we didn’t know what was wrong.
We needing more detail.
So adding logging, so we knew which API had a problem.
We also had some people who tried the monitor everything problem.
They uncovered a user who seemed to be scripting the site.
They added metrics for where the time was being spent with the API handlers
The site would go down for a minute each time things would blip.
We set the timeouts to be lower.
We found some database queries to be optimized.
We found some old APIs that we didn’t need and we removed them.
The end result was that things got better. The servers were mostly happy.
But the real question is: How did we get to a point where our monitoring didn’t tell us what we needed? We thought we were believers in monitoring. And yet we got stuck.
Black Boxes (of mysterious mysteries)
- Using services in the cloud gives you less visibility
Why did we have two different API services…cohabiting…and not being well monitored?
- No one had the goal of creating a bad solution.
- But we’re stuck. So how do we fix it?
- We stuck nginx in front and let it route between them.
What things should you be thinking about?
- Services:
- - Are the services that should be running actually running?
  - Use sensu or nagio
- Responsiveness:
- - Is the service responding?
- System metrics:
- - CPU utilization, disk space, etc.
  - What’s worth an alert depends: on a web server it shouldn’t use all the memory, on a mongo db it should, and if it isn’t, that’s a problem.
- Application metrics?
- - Are we monitoring performance, errors?
  - Do we have the thresholds set right?
  - We don’t want to to look at a sea of red: “Oh, just ignore that. It’s supposed to be red.”
Work through what happens?
- Had 20 servers running 50 queues each.
- Each one has its own sensu monitor. HipChat shows an alert for each one… a thousand outages.
You must load test your monitoring system: Will it behave correctly under outages and other problems?
“Why didn’t you tell me my service was down?” “Service, what service? You didn’t tell us you were running a service.”

Sensu: http://sensuapp.org/