Sorry, I got to “Beyond LAMP” about 25 minutes late – my notes don’t include anything from the first part of the meeting.
Updated: Here are some additional great notes covering the beginning part of the session, as well as some more organized notes from the end part.
- twitter uses cassandra
- no disk seeks when you do a write
- no master, you can do write on any machine
- when you post a twitter, it gets written into the queue for each of your followers. so if you have 1,000 followers, then it’s written 1,000 times.
- cassandra designed to use commodity servers
- monitoring
- one of the tricks is to know when a machine needs to be replaced when all you have are hundreds of commodity servers
- monitor, monitor, monitor
- cpu load, file descripters, bandwidth, database connections, database performance, disk space, etc.
- need monitoring system, ability to graph, all centrally, so you don’t have to go to individual machines.
- Being on the front page of digg crashed server (digg sends a lot of traffic)
- Why not memcache the front page? It gets loaded all day long, and it is always the same.
- Rewrote the system to only read from database when it’s not memcached. Refresh memcache once per day.
- Before change, was 60% db writes, 40% reads. After change was 99% db writes, only 1% reads. All the reads were now coming from memcache.
- At twitter, expect that people will come along and read what you’ve written. So they do write-thru caching. The tweet is put first into the cache, then into the database. This way they never need to read the database to get the recent tweets.
- when you get beyond a certain point, you can’t analyze the data on a single machine. you have terabytes and terabytes of data.
- Hadoop lets you run distributed jobs, that automatically retry when systems go down or fail.
- Without this, as the data grows, you end up asking simpler and simpler questions.
- With this, you can ask more sophisticated questions.
- Scaling search…
- they can process 1 to 2 searches per second
- search is hard
- What was the first thing to blow up for you?
- 1st was mysql, 2nd was apache.
- made the switch over to engineX for serving up images. much, much faster. Using Apache was like using a sledgehammer to server up images.
- connection issues with postgres.
- migrating data schema when at scale is really hard… turn off indexes before copying data
- twitter: one thing that kept our ops team awake at night…
- we are a rails app
- how do we maintain relationships?
- we had it normalize with a follower table: user_id, follower_id in a single table.
- lookups against this table were table
- they built an intermediate solution… denormalized data structure
- while they worked on a longer term solution… they built a custom social graph data tool.
- it need to work across 7 orders of magnitude: from someone with 1 follower to someone with 1M followers
- Questions
- Deployment
- twitter uses murder: bit-torrent for deployment. seed some servers, then those servers help feed others. brought main app deployment time from 12 minutes to 37 seconds. check out twitter opensource – they open source most of their tools
- Hardware databases
- twitter is using some, facebook experimenting with them. they are PCI express cards almost as fast as main memory.
- databases versus key stores
- it’s natural to go to denormalized – you just want the data you want in the form you want
- over time, more of the logic goes into the application code, so database indexes are less useful
- how do you manage when data is on particular servers
- at twitter, using cassandra, are already consistent
- there are bunch of new systems that have different tradeoffs. some have eventual consistency, some don’t. if your application can’t handle eventual consistency, then cassandra isn’t for you.
- did anyone consider any of the top ten database, like say oracle
- twitter: we strongly prefer open source systems. as we scale, we like to be able to peek under the hood, and see what is going on.
- facebook: we like open source, we like the way open source projects work together, we like to be nimble – these proprietary systems are not so nimble.
- it’s a combination of openness, ideology, and cost.
- berkeleydb versus memcached
- memcached is just a wrapper on top of berkely db
her find out in our online store lamps and other home products which will make you home like heaven.
Thanks
Cheap Lamp Seller