You're viewing old version number 8. - Current version
Key-Value database info
(Not everything here is related to key-value database storage.)
Culling posts tagged #database at ToledoTalk.com
http://toledotalk.com/cgi-bin/tt.pl/search/%23database
[X] - https://speakerdeck.com/sunnygleason/practical-tips-for-using-mysql-as-a-key-value-store
Adam D'Angelo, MySQL user since 2004
If you partition your data at the application level, MySQL scalability isn't an issue. Facebook reported [1] running 1800 MySQL servers with just two DBAs in 2008. You can't do joins across partitions, but the NoSQL databases don't allow this anyway. Facebook hasn't confirmed using Cassandra as the primary source for any data, and it seems like inbox search might be their only use of it. [2]
These distributed databases like Cassandra, MongoDB, and CouchDB3 aren't actually very scalable or stable. Twitter apparently has been trying to move from MySQL to Cassandra for over a year. When someone reports using one of these systems as their primary data store for over 1000 machines for over a year, I'll reconsider my opinion on this.
<< Update as of August 2011: after I wrote this, foursquare reported an 11-hour downtime because of MongoDB. [4] Separately, a friend's startup that was going through explosive growth tried to switch to MongoDB and gave up after a month due to instability. Twitter gave up on the Cassandra migration. [5] Facebook is moving away from Cassandra. [6] HBase is getting better but is still risky if you don't have people around with a deep understanding of it. [7] >>
The primary online data store for an application is the worst place to take a risk with new technology. If you lose your database or there's corruption, it's a disaster that could be impossible to recover from. If you're not the developer of one of these new databases, and you're one of a very small number of companies using them at scale in production, you're at the mercy of the developer to fix bugs and handle scalability issues as they come up.
You can actually get pretty far on a single MySQL database and not even have to worry about partitioning at the application level. You can "scale up" to a machine with lots of cores and tons of ram, plus a replica. If you have a layer of memcached servers in front of the databases (which are easy to scale out) then the database basically only has to worry about writes. You can also use S3 or some other distributed hash table to take the largest objects out of rows in the database. There's no need to burden yourself with making a system scale more than 10x further than it needs to, as long as you're confident that you'll be able to scale it as you grow.
Many of the problems created by manually partitioning the data over a large number of MySQL machines can be mitigated by creating a layer below the application and above MySQL that automatically distributes data. FriendFeed described a good example implementation of this [8].
Personally, I believe the relational data model is the "right" way to structure most of the data for an application like Quora (and for most user-generated content sites). Schemas allow the data to persist in a typed manner across lots of new versions of the application as it's developed, they serve as documentation, and prevent a lot of bugs. And SQL lets you move the computation to the data as necessary rather than having to fetch a ton of data and post-process it in the application everywhere. I think the "NoSQL" fad will end when someone finally implements a distributed relational database with relaxed semantics.
http://kylerush.net/blog/meet-the-obama-campaigns-250-million-fundraising-platform/
https://news.ycombinator.com/item?id=4842510
http://kev.inburke.com/kevin/reddits-database-has-two-tables
http://news.ycombinator.com/item?id=4468265
http://www.allthingsdistributed.com/2012/01/amazon-dynamodb.html
http://code.flickr.net/2010/02/08/ticket-servers-distributed-unique-primary-keys-on-the-cheap/
[X] http://backchannel.org/blog/friendfeed-schemaless-mysql
Lesson 3: Open Schema
The essence of this lesson is: don't worry about the schema.
They used to spend a lot of time worrying about the database, keeping everthing nice and normalized. You shouldn’t have to worry about the database. Schema updates are very slow when you get bigger. Adding a column to 10 million rows takes locks and doesn’t work. They used replication for backup and for scaling. Schema updates and maintaining replication is a pain. They would have to restart replication and could go a day without backups. Deployments are a pain because you have to orchestrate how new software and new database upgrades happen together.
Instead, they keep a Thing Table and a Data Table. Everything in Reddit is a Thing: users, links, comments, subreddits, awards, etc. Things keep common attribute like up/down votes, a type, and creation date. The Data table has three columns: thing id, key, value. There’s a row for every attribute. There’s a row for title, url, author, spam votes, etc. When they add new features they didn’t have to worry about the database anymore. They didn’t have to add new tables for new things or worry about upgrades. Easier for development, deployment, maintenance. The price is you can’t use cool relational features. There are no joins in the database and you must manually enforce consistency. No joins means it’s really easy to distribute data to different machines. You don’t have to worry about foreign keys are doing joins or how to split the data up. Worked out really well. Worries of using a relational database are a thing of the past.
The essence of this lesson is: the key to speed is to precompute everything and cache it.
The way to make a slow website is have a perfectly normalized database, collect it all on demand, and then render it. It takes forever on every single request. So if you have data that might be displayed in several different formats, like links on front page, in-box, or profile, store all those representations separately. So when somebody comes and gets the data it’s already there.
Every listing has 15 different sort orders (hot, new, top, old, this week). When someone submits a link they recalculate all the possible listing that link could effect. It may be a little wasteful upfront, but it’s it better to wasteful upfront than slow. Wasting disk and memory is better than keeping users waiting.
http://kev.inburke.com/kevin/reddits-database-has-two-tables/
This fits with a piece I read the other day about how MongoDB has high adoption for small projects because it lets you just start storing things, without worrying about what the schema or indexes need to be. Reddit’s approach lets them easily add more data to existing objects, without the pain of schema updates or database pivots. Of course, your mileage is going to vary, and you should think closely about your data model and what relationships you need.
Update, 11:31PM PDT: A former engineer at reddit adds this comment.
There isn’t a “table” for a subreddit. There is a thing/data pair that stores metadata about a subreddit, and there is a thing/data pair for storing links. One of the properties of a link is the subreddit that it is in. Same with the comments. There is one thing/data pair for comments and the subreddit it is in is a property.
Still today I tell people that even if you want to do key/value, postgres is faster than any NoSQL product currently available for doing key/value.
Update, 7:11PM PDT: From Hacker News, it looks like they use two tables for each “thing”, so a thing/data pair for accounts, a thing/data pair for links, etc.
From JR's : articles
1132 words - 7598 chars
- 6 min read
created on
updated on
- #
source
- versions