Katherine Daniels
@beerops
kd@gamechanger.io
- The site is going down.
- But everything seemed to be fine.
-
- checked web servers, databases, mongo, more.
- What was wrong? The monitoring tool wasn’t telling us.
- One idea: monitor more. monitor everything.
-
- But if you’re looking for a needle in a haystack, the solution is not to add more hay.
- Monitoring everything just adds more stuff to weed through. Including thousands of things that might be not good (e.g. disk quote too high), but aren’t actually whats causing the problem.
- Monitor some of the things. The right things. But which things? If we knew, we’d already be monitoring.
- Talk to Amazon…
-
- “try switching the load balancer”
- “try switching the web server”
- We had written a service called healthd that was supposed to monitor api1, and api2.
- But we didn’t have logging for healthd, so we didn’t know what was wrong.
- We needing more detail.
- So adding logging, so we knew which API had a problem.
- We also had some people who tried the monitor everything problem.
- They uncovered a user who seemed to be scripting the site.
- They added metrics for where the time was being spent with the API handlers
- The site would go down for a minute each time things would blip.
- We set the timeouts to be lower.
- We found some database queries to be optimized.
- We found some old APIs that we didn’t need and we removed them.
- The end result was that things got better. The servers were mostly happy.
- But the real question is: How did we get to a point where our monitoring didn’t tell us what we needed? We thought we were believers in monitoring. And yet we got stuck.
- Black Boxes (of mysterious mysteries)
-
- Using services in the cloud gives you less visibility
- Why did we have two different API services…cohabiting…and not being well monitored?
-
- No one had the goal of creating a bad solution.
- But we’re stuck. So how do we fix it?
- We stuck nginx in front and let it route between them.
- What things should you be thinking about?
-
- Services:
-
- Are the services that should be running actually running?
- Use sensu or nagio
- Responsiveness:
-
- Is the service responding?
- System metrics:
-
- CPU utilization, disk space, etc.
- What’s worth an alert depends: on a web server it shouldn’t use all the memory, on a mongo db it should, and if it isn’t, that’s a problem.
- Application metrics?
-
- Are we monitoring performance, errors?
- Do we have the thresholds set right?
- We don’t want to to look at a sea of red: “Oh, just ignore that. It’s supposed to be red.”
- Work through what happens?
-
- Had 20 servers running 50 queues each.
- Each one has its own sensu monitor. HipChat shows an alert for each one… a thousand outages.
- You must load test your monitoring system: Will it behave correctly under outages and other problems?
- “Why didn’t you tell me my service was down?” “Service, what service? You didn’t tell us you were running a service.”
Sensu: http://sensuapp.org/