Could someone with Mongo experience help me gut-check this?
I want my data store to be durable and unsurprising -- barring a hardware failure or such, if I submit data it should either tell me that it failed to commit or it should be stored durably and without surprises (e.g., it should not truncate a long string to fit).
I've read some of the Mongo docco, and it's pretty exciting, but the lack of ACID -- primarily the Durability -- has kept me from really using it.
With a WAL journal, it sounds like maybe the durability issue is fixed. Is it? Could I use Mongo with relatively out-of-the-box settings plus --journal and count on a level of durability equivalent to a traditional RDBMS?
Yes, if you combine journaling with safe write operations. This allows you to call the appropriate driver method to confirm that the data has been written. You can also wait for it to be written to n slaves in replication. I believe that "written" means either to the data files or the journal. In the case of the journal, a hard crash would result in the journal being replayed so that the data is then written to disk and you don't get corruption.
"If safe is an integer, will replicate the insert to that many machines before returning success (or throw an exception if the replication times out, see wtimeout)."
Don't know, I'm going to be testing this tomorrow, but came to work today to a server that was grinding to a halt due to wild running node.js processes. Restarted the server, but that doesn't cleanly shutdown Mongo, and spent an hour repairing everything and setting permissions right (somehow they had gotten reset). This exact problem, so hope the --journal switch in the init file makes the difference.
Added "journal = true" to the config and it certainly stops and starts the service nicely now. No more repairs and such, my data is currently query only, so it's mainly "optical", cleaning up locks and such.
Journaling should give you the crash-safety you're looking for. You should combine it with safe writes to get the commit safety you want (not the default, but is easy to choose, see your driver's documentation).
My understanding is this release is the first stable with single node durability available as an option, which is why I'm considering picking Mongo back up.
Can't wait until they start implementing filtered indexes ( http://jira.mongodb.org/browse/SERVER-785 ). Sparse indexes are a step in the right direction, but filtered ones would be just a bit cooler :)
"New map/reduce options for incremental updates" would also be really cool if they had a way to do something like couchDBs incremental views. This would require keeping track of changes or a "trigger" functionality that runs the m/r task after every x inserts
Starting to get excited once again about MongoDB. I was kind of down about it after having some issues with real world implementations. Considering journaling is something I would never have thought would have made it in, I wonder if they will come around on the memory mapped I/O like everyone else eventually does.
EDIT: Also... does the group commit mean that ALL write transactions will be un-acknowledged to the client until the group commit finishes?
I'm curious what you're getting at IRT memory mapped I/O. To me, one of Mongo's selling points is the way it lets the OS manage caching for you instead of bothering you with tuning a bunch of buffers and things along those lines. For the sake of disclosure, I'm running 6 servers in a pretty high volume cluster, and while I've had to address some issues here and there, memory management hasn't been one of them.
"You can wait for group commit acknowledgement with the getLastError command. When running with --journal, the fsync:true option returns after the data is physically written to the journal (rather than actually fsync'ing all the data files). Note that the group commit interval (see above) is considerable: you may prefer to call getLastError without fsync, or with a w: parameter instead with replication. In releases after 1.8.0 the delay for commit acknowledgement will be shorter."
RTFM'ing myself in public. While it's a great step forward, the whole thing sounds pretty immature at this point. It'll be a good day when this stuff matures and MongoDB has fsync:true and --journal by default, much to the chagrin of sensationalists everywhere.
They'll really need to add some major smarts into the journaling and group commit if they're going to be able to stack up concurrent I/Os to help feed big disk arrays and even to get the best use out of SSDs which make back-and-forth latency on the I/O pipes even more significant.
Clients are able to wait for the next group commit by adding fsync:true to getLastError calls (some drivers allow you to add this to WriteConcern). We already have some enhancements to this planned for the 1.9 series.
A lot of people like it because it makes development faster. It's like the scripting language of databases: you can get stuff out the door really fast (with the obvious power/responsibility caveats).
This was the main reason why I picked it for a project. Its just sick the amount of code you DON'T have to write to use this database. I literally have 1 class called GenericCRUD with 5 functions that do all my database functions for all my models. Plus not having to worry about stored procedures anymore is a heavy weight lifted off your shoulders.
I've used it once, for an app that holds data which was not a good fit for a traditional relational database. The app in question essentially involved collecting data in a web app and then using that data to fill out hundreds of PDF forms. It gets really complicated as the data has to be potentially formatted (say a phone number might need to be split 555-555-5555 on one form but one number per square on another), concatenated (name might need to join first, last mi), as well as data about what page and x/y coordinates things go on for each form.
Initial attempts in SQL were painful. The only real way to do it was a key value table, but that gets painful when it comes to formatting for web presence (notably, each document has sections with a group of fields, plus some fields may need to be grouped together such as a series of checkboxes, or parts of a name). So at that point we're looking at writing up XML files to describe the presentation of these 200 forms from a key/value table to the web app.
At that point I realized this was doable, but going to be a mess. Enter mongo. Mongo essentially let's us store a dynamic schema of documents. For each form we can stick it all in a single document, as a series of embedded models, with all metadata and values needed in one go. We also get nice revision control within that using mongoid. We can now fetch all the data for a form, as well as save all the data for the form, in one VERY fast atomic operation (we're talking 100-800 field definitions for each form). Having never used mongo, it only took me a few days to implement this complete with handling for all field types and performance was fantastic.
Mongo also made it quite easy to populate our data since we're essentially just storing a tree of key and values. We wrote up a tool that loads up the PDF's and let's us draw boxes on top of the fields and set up the metadata, then export that to a YAML file for each form. The YAML is then stored in a tradiational SQL database and is used to create a new form in the system by simply converting it to a nested hash and having mongoid save it. Slick.
I'm getting a bit wordy here, but I think it's a great real world example of the type of problem mongo is a good fit for. I wouldn't personally use mongo for something that a relational database is a good fit for, but for something like this it allows you to solve the problem quicker and with significantly less code to maintain (really, the CRUD code for forms is no more than with SQL and probably less since it's only one operation on a document, and my pdf form generator is < 200 lines of ruby).
It's schema-less, and as the data format is a binary form of JSON, you can store arbitrary, even nested data structures directly. Indexes can be put on fields deep within the structure.
This saves a lot of time you'd normally spent defining schemas, and is very flexible. It didn't completely replace SQL for me, but it's a good fit for the heterogeneous free-form data generally encountered on the web.
I use it for offline operations since we only have a single overloaded MS SQL database server. So i sync the data to my local mongo instance creating documents with only the fields that i need. And i have scripts written to do analytics and partition the data to reveal patterns and generate reports and so on.
It's very fast and the flexible schema makes the code much more flexible and easy to write. And did i mention it was fast?
You should definitely give it a try and consider using it for such systems.
My only issue with it is that i am running it on a 32-bit system and so i'm limited to 2GB a database.
If you are working with location data, Mongo has built in geospatial indexing built in since 1.4 (earlier in the unstable builds) - http://www.mongodb.org/display/DOCS/Geospatial+Indexing which has been a big draw for a number of people I know using it (it looks like 1.8 brings spherical distances to the stable branch which makes the geo lookups a lot more useful if you need accurate distances and not just near by lookups).
If by "location data" you have points, and the only location operations you need are distance or bounding-box searches, it may do the trick.
If you're interested in polygons, lines, etc; more physically accurate (and completely implemented) distance queries, spatial joins, aggregation, 3D and surveyor-annotated data, set-theoretic operations ... PostGIS is far and away the way to go. It's far more mature and debugged than any of the NoSQL geospatial stuff I've seen, not only WRT correctness but also performance.
As a point of reference: There's a growing legion of geographers who do all their vector work in SQL using PostGIS.
All that said, for some applications, being tied to the relational model is a deal breaker. Just know that in terms of capability and maturity on the geospatial front, you'll be trading off a Cadillac for a partially assembled rocket sled.
> We don't currently handle wrapping at the poles or at the transition from -180° to +180° longitude, however we detect when a search would wrap and raise an error.
generalized grumble
Why does everyone always seem to punt on doing geospatial right? It's not _that_ hard.
Why not go ahead and help them? I'm sure they'd be happy to have the assistance. Fork mongo from github at https://github.com/mongodb/mongo . Happy hacking!
You know, nobody ever replies to these comments with an "I will", but I'm seriously considering it.
(It's a nice opportunity to publicly show off a specialty/core competency and brush up a bit on C++ a the same time. I'm not that easily provoked into action by internet commentary! ;) )
But, looking at the source, I think I will be probably a Bad Contributor and end up with a gigantic pull request and a (mostly) full re-implementation...
> Why does everyone always seem to punt on doing geospatial right? It's not _that_ hard.
Do you mean you think they don't know how to do it?
As I followed the roadmap on this specific point, it looks more like an incremental development to me: they first used rectangular coordinates in 1.6, then a spherical model in 1.7 etc.
It allows to bring a more lightweight solution quickly to people that need it (like me), then to evolve based on the feedback etc.
The problem I have is at least partly one of truth in advertising.
For example: If they were truly using a "spherical model", then one would not expect to have queries fail at the poles & dateline, would you?
At least it is documented and fails hard with an error rather than giving wrong results, so a developer can quickly figure out the weak spots --- though I bet a lot of people would prefer the wrong results to queries that cause exceptions in their systems.
> Do you mean you think they don't know how to do it?
I think it has more to do with the absurdly low bar they've set for themselves to check the "geospatial" box than it does with competence.
I don't know really - if limitations are documented like they are apparently, it's really not an issue for me - I don't feel cheated (but it's really an opinion!).
Your mileage may vary as they say: I used the GIS since 1.6 and it was very helpful for me in this form already :)
The limitations are documented as foot/side notes.
Analogy: It's like seeing "ACID compliance!" on a feature list, then finding buried in the documentation that is only the case for single-document transactions in unordered collections on a single machine only.
The new feature might be useful to some but including it on a feature list without disclaimer is misleading.
My experience has been that many applications (web apps in particular) use a relational database to break up documents into an SQL schema so they can be indexed, then assemble them again when needed. Mongo really ratchets down the friction on that operation. Instead of spending time building a schema and writing lots and lots of insert and update statements, you just build a JSON object and send it over.
Sure - I'm just pointing out something that could be a big deal (streaming results through an ordered structure in memory, that may or may not be backed by a lazy disk store) is actually not a big deal, due to a current implementation constraint.
All you need for full-text search is to add a multikey of tokens to the documents you want indexed. Tokenizers and stemmers are actually really easy to write, and there are libraries you can use to do that for you.
As a newb to non-relational databases, but planning on learning one soon, what is the advantage of MongoDB vs Redis? I'm planning to use ruby with either, but was interested if there was a reason to pick one over the other.
I don't think you should just learn one; they are all used for different niches. It depends how your app trades off different things (consistency, reliability of reads, reliability of writes, speed of reads, speed of writes, efficiency of hardware use, and so on).
They're really two different beasts. Mongo stores things in a manner similar to a relational DB (minus the relations). Think of it as a store for JSON that allows indexing and SQL-style queries using a JSON style syntax. Redis data structures are closer to what you'd find in computer science books (lists, sets, hashes, etc.). You should explore both and see what meets your needs.
Redis is for flat structures. MongoDB is for nested.
In both cases, (unlike CouchDB) you can alter data structures by more complex means than simply replacing the whole thing (such as incrementing a counter). In both (again unlike CouchDB) the updates overwrite in place and do not waste space (but also do not preserve past versions or allow readers to overlap writers).
Redis is for stuff that fits in memory. MongoDB scales up to "big data", provided the individual items are moderately sized.
Redis runs in RAM so it's blazingly fast. MongoDB is about as fast as MySQL.
Redis is single threaded so only one operation runs at once (the speed makes this mostly not a problem). Some operations globally block MongoDB, some can run in parallel.
In both, operations are atomic. Redis has transactions of a sort that group operations and ensure the data they relate to is unchanged. MongoDB operations can't be grouped into a transaction, but they can be a lot more complex so they effectively become a transaction (limited to operating on one data item).
That totally depends on what you want to do with Redis/MongoDB. Do you intend to use it as your primary data store? AFAIK One of MongoDB's goals is to be useful for a lot of things you'd normally use a relational database for. Redis on the other hand has a lot of nice things to handle more specialized cases of data.
(I work on MongoDB, but trying to give a balanced opinion.)
Redis is a great key-value store, MongoDB is more of a fully-featured database. Redis has some nice set operations and is pretty easy to learn (all of the commands are here: http://redis.io/commands). MongoDB is also pretty easy to learn (click the "Try it out" button at http://mongodb.org/), but there are a lot of advanced features to learn about.
So, if you need a key-value store, Redis is a great choice. If you want to do something more complex, MongoDB would probably work better.
Wow. I knew about the durability changes but I had no idea that sparse and covered indexes where coming. These three changes where the biggest drawbacks to mongo for me.
I got in so much trouble for a very similar web scale comment. Never shall trite jokes be used on hacker news. You'll get your head bit off. Personally, it gives me a giggle, and I guess that means I am much less mature than most consumers of Hacker News. http://news.ycombinator.com/item?id=2104276
I want my data store to be durable and unsurprising -- barring a hardware failure or such, if I submit data it should either tell me that it failed to commit or it should be stored durably and without surprises (e.g., it should not truncate a long string to fit).
I've read some of the Mongo docco, and it's pretty exciting, but the lack of ACID -- primarily the Durability -- has kept me from really using it.
With a WAL journal, it sounds like maybe the durability issue is fixed. Is it? Could I use Mongo with relatively out-of-the-box settings plus --journal and count on a level of durability equivalent to a traditional RDBMS?