Adrian Cockcroft
@adrianco
Battery Ventures
Please, no More Minutes, Milliseconds, Monoliths… Or Monitoring Tools!
#Monitorama May 2014
- Why at a Monitoring talk when I’m known as the Cloud guy?
- 20 Years of free and open source tools for monitoring
- “Virtual Adrian” rules
-
- disk rule for all disks at once: look for slow and unbalanced usage
- network rule” slow and unbalanced usage
- …
- No more monitoring tools
-
- We have too many already
- We need more analysis tools
- Rule #1: Spend more time working on code that analyzes the meaning of metrics than the code that collects, moves, stores, and displays metrics.
- What’s wrong with minutes?
-
- Takes too long to see a problem
- Something broke at 2m20s.
- 40s of failure didn’t trigger (3m)
- 1st high metrics seen at agent on instance
- 1st high metric makes it to central server (3m30s)
- 1 data collection isn’t enough, so it takes 3 data points (5m30s)
- 5 minutes later, we take action that something is wrong.
- Should be monitoring by the second
- SaaS based products show what can be done
-
- monitoring by the second
- Netflix: Streaming metrics directly from front end services to a web browser
- Rule #2: Metric to display latency needs to be less than human attention span (~10s)
- What’s wrong with milliseconds?
-
- Some JVM tools measure response times in ms
-
- Network round trip within a datacenter is less than 1ms
- SSD access latency is usually less than 1 ms
- Cassandra response times can be less than 1ms
- Rounding errors make 1ms insufficient to accurately measure and detect problems.
- Rule #3: Validate that tour measurement system has enough accuracy and precision
- Monolithic Monitoring Systems
-
- Simple to build and install, but problematic
- What is it goes down? gets deployed?
- Should be a pool of analysis/display aggregators, a pool of distribution collection systems, all monitoring a large number of application.
- Scalability:
-
- problems scaling data collection, analysis, and reporting throughput
- limitations on the number of metrics that can be monitored
- In-Band, Out-of-band, or both?
-
- In-band: can leave you blind during outage
- SaaS: is out of band, but can also sometimes go down.
- So the right answer is to have both: SaaS and internal. No one outage can take everything out.
- Rule #4: Monitoring systems need to be more available and scalable than the systems being monitored.
- Issues with Continouus Deliver and Microservices
-
- High rate of change
-
- Code pushes can cause floods of new instances and metrics
- Short baseline for alert threshold analysis-everything looks unusual
- Ephermeral configurations
-
- short lifetimes make it hard to aggregate historical views
- Hand tweaked monitoring tools take too much work to keep running
- Microservices with complex calling patterns
-
- end-to-end request flow measurements are very important
- Request flow visualizations get very complicated
- How many? Some companies go from zero to 450 in a year.
- “Death Star” Architecture Diagrams
-
- You have to spend time thinking about visualizations
- You need hierarchy: ways to see micro services but also groups of services
- Autoscaled ephermeral instances at Netflix (the old way)
-
- Largest services use autoscaled red/block code pushes
- average lifetime of an instance is 36 hours
- Uses trailing load indicators
- Scryer: Predictive Auto-scaling at Netflix
-
- More morning load Sat/Sun high traffic
- lower load on wednesday
- 24 hours predicted traffic vs. ctually
- Uses forward prediction to scale based on expected load.
- Monitoring Tools for Developers
-
- Most monitoring tools are build to be used by operations people
-
- Focus on individual systems rather than applications
- Focus on utilization rather than throughput and response time.
- Hard to integrate and extend
- Developer oriented monitoring tools
-
- Application Performance Measurement (APM) and Analysis
- Business transactions, response time, JVM internal metrics
- Logging business metrics directly
- APIs for integration, data extraction, deep linking and embedding
-
- deep linking: should be able to cut and paste link to show anyone exactly the data I’m seeing
- embedding: to be able to put in wiki page or elsewhere.
- Dynamic and Ephemeral Challenges
-
- Datacenter Assets
-
- Arrive infrequently, disappear infrequently
- Stick around for three years or so before they get retired
- Have unique IP and mac addresses
- Cloud Assets
-
- Arrive in bursts. A netflix code push creates over a hundred per minute
- Stick around for a few hours before they get retired
- Often reuse the IP and Mac address that was just vacated.
- Use Netflix OSS Edda to record a full history of your configuration
- Distributed Cloud Application Challenges
-
- Cloud provider data stores don’t have the usual monitoring hooks: no way to install an agent on AWS mySQL.
- Dependency on web services as well as code.
- Cloud applications span zones and regions: monitoring tools also need to span and aggregate zones and regions.
- Monit
- Links
-
- http://techblog.netflix.com: Read about netflix tools and solutions to these problems
- Adrian’s blog: http://perfcap.blogspots.com
- Slideshare: http://slideshare.com/adriancockcroft
- Q&A:
-
- Post-commit, pre-deploy statistical tests. What do you test?
-
- Error rate. Performance. Latency.
- Using JMeter to drive.