Apache Geode: Distributed, in-memory database

sandstrom · on Nov 19, 2015

Didn't understand much from the linked page, but I found this website (from Pivotal, the commercial entity behind Geode) quite informative. Perhaps it's useful to others.

https://pivotal.io/big-data/pivotal-gemfire

===

I found this interesting deployment:

    China National Railways use Geode to run railway ticketing for the entire 
    country with a 10 node cluster, managing 2 TB of "hot data" in memory, 
    and 10 backup nodes for high availability and elastic scale.
    
    Holiday travel periods [Chinese New Year's] create peaks of 15,000 tickets 
    sold per minute, 1.4 billion page views per day and 40,000 visits per second.

http://pivotal.io/big-data/case-study/scaling-online-sales-f...

trhway · on Nov 19, 2015

to understand where it comes historically and ideologically/architecturally from - the keyword here is Smalltalk https://en.wikipedia.org/wiki/Gemstone_%28database%29

The guys have been for 3 decades sitting on the forested Oregon river bank (the place is called Beavertown for reason :) thinking straight and clear in Smalltalk... :)

GregChase · on Nov 19, 2015

Geode and its predecessor GemFire are pure Java.

The original engineering expertise came from the team that built the Gemstone OODB.

Beaverton isn't so forested anymore, but lots of indigenous object oriented expertise for sure.

trhway · on Nov 19, 2015

>Geode and its predecessor GemFire are pure Java.

i know (was talking at some point about joining, already here in SV), and this is why i wrote only "3 decades" as i don't know precise dates of GemFire only suppose it to be beginning of 200x. My simple arithmetics mistake though - 2002 - 1982 is 2 decades, sorry, i see where confusion comes from :)

sudsmenon · on Nov 19, 2015

FWIW, GemFire is all written in Java and has native clients for c++ and .net ( as well as REST clients). The product was built from the ground up in Java and runs highly scaled up , low latency systems across the globe. The place is called Beaverton and while there are many beautiful forested riverbanks, the Pivotal team does not sit next to one.

mclarenfan · on Nov 19, 2015

Also Indian railways:

     200,000 Concurrent ticket Purchases

https://pivotal.io/big-data/case-study/distributed-in-memory...

apache_rvs · on Nov 19, 2015

Geode wiki is quite good though: https://cwiki.apache.org/confluence/display/GEODE/Index

tuyguntn · on Nov 19, 2015

Landing page and documentation should be reconsidered. lots of unanswered questions,

- distributed, in-memory database - how does it compare to SAP HANA? or Redis? if "... database" where is the querying language? if you have own querying language other than SQL please show us.

- Performance is key - tell us about some benchmarks

- Consistency is a must - CA or CP in CAP theorem?

- Can I fully drop RDBMS in favor of Geode?

- web site built as Landing page/PR for product, then all of a sudden starts Community/Contributors/Getting started columns, with loooooong list of mailing lists, smaller content for contributors and very important part(getting started) presented as tiny small link :(, please put there some useful info about product, you already have Community/Contribute menus.

finnh · on Nov 19, 2015

> - Consistency is a must - CA or CP in CAP theorem?

There is no CA. So, they must mean CP.

Remember, kids: if they promise CA, run away.

fweespeech · on Nov 19, 2015

CA is possible if its read only, essentially, with all updates synchronized when a partition is not present [which is the majority of the time in the real world].

slashdev · on Nov 19, 2015

It's semantics, but I'd argue if you're read only than you're changing the definition of available.

GregChase · on Nov 20, 2015

Geode is write-intensive, and generally optimizes for consistency over availability. That being said - there's a lot built into Geode to ensure availability as well.

fweespeech · on Nov 20, 2015

Fair enough but read-only in the event of a network partition is a viable real world use case.

markito · on Nov 19, 2015

Thanks for the feedback. We're currently updating the website and the work is being tracked here -> https://issues.apache.org/jira/browse/GEODE-53

Here is a quick preview of how the new one will look like: markito.github.io/geode-website/

Any feedback is welcome!

metatype · on Nov 19, 2015

Complex objects can be stored by key and automatically partitioned and replicated. Query the data with OQL (an ODMG standard). For example:

select * from /person p where p.name = 'foo' and p.age > 42

tuyguntn · on Nov 19, 2015

from docs at[1] they only support SELECT statements in OQL, calling object method in query reminds me Esper complex event processing engine.

[1] - http://geode-docs.cfapps.io/docs/getting_started/querying_qu...

Brainix · on Nov 19, 2015

Feedback on the documentation:

It's not immediately obvious to me how Geode is different from Redis. When would I want to use Geode over Redis, and vice versa?

GregChase · on Nov 19, 2015

Redis and in-memory data grids are pretty different animals. I would characterize IMDG's like Geode to be concurrent write intensive, and have flexible data models. It also scales out better than Redis in a more automated fashion.

Redis is a great read-intensive cache. It also has a powerful data model, but you have to use their data models. Example: If you want to run calculations on lists or sets, they have powerful operations you can call.

IMDG's such as Geode were built with the rise of automated trading in the finance industry.

mclarenfan · on Nov 19, 2015

Geode also understands Redis protocol, so you can point your Redis clients to Geode. This is interesting because the problem with Redis data structures is that they cannot scale beyond the memory available to the Redis server. The further limitation that Redis server is single threaded and cannot really be scaled up, only makes the problem worse.

With Geode your Redis data structures can scale horizontally.

glogla · on Nov 19, 2015

This

> Data is persisted in write-optimized disk storage. Consistency checking is configurable between highest performance caching and ACID transactions.

seems pretty different to redis. I don't think Redis does ACID transactions.

krenoten · on Nov 19, 2015

Redis is single-threaded, so everything it does on a per-node basis is implicitly atomic. You can also force the single thread to handle a block of commands from a single socket at once by using MULTI (otherwise commands can be interleaved with other commands from other sockets).

jrallison · on Nov 19, 2015

It's not guaranteed to be atomic in failure conditions. Also, the biggest difference between Redis and Geode would be the "distributed" part, which involves maintaining these guarantees across a cluster of machines (which Redis demonstratingly doesn't do).

krenoten · on Nov 20, 2015

That's pretty hand-wavey. All databases can be made to lose all guarantees under failure conditions.

espeed · on Nov 19, 2015

Here's an old post showing how to create a custom indexing schemes in Geode (QuadTreeIndex for spatial data): http://blogs.vmware.com/vfabric/2012/12/gemfire-patternspart...

espeed · on Nov 19, 2015

Two Videos:

1. "Open Sourced GemFire In-Memory Distributed Database and Apache Contributors" (https://www.youtube.com/playlist?list=PL62pIycqXx-TTMXsq09BE...)

2. "Creating a Highly Scalable Stock Prediction System with R, Geode & Spring XD" (https://www.youtube.com/playlist?list=PL62pIycqXx-Rzd_HcjU7Y...)

orless · on Nov 19, 2015

I'm starting to get lost in the new Apache animals.

How does Apache Goede compares to Apache Ignite (advertised as "in-memory data fabric"?

espeed · on Nov 19, 2015

Ignite vs Pivotal Gemfilre (Apache Geode) https://ignite.apache.org/use-cases/compare/gemfire.html

This is cool: "Ignite allows for most of the data structures from java.util.concurrent framework to be used in a distributed fashion." https://ignite.apache.org/features/datastructures.html

GregChase · on Nov 19, 2015

Apache Geode and Apache Ignite are more similar than they are different.

Apache Ignite, based off the commercial distribution Grid Gain is newer to market.

Apache Geode, based off the commercial distribution GemFire, has a long history in the market.

orless · on Nov 19, 2015

I wonder why Apache would need to have two "more similar than different" products.

wmf · on Nov 19, 2015

Apache accepts anything that companies donate. (They say they don't, but it's hard to find anything they've rejected.)

rectang · on Nov 20, 2015

That's not quite the right emphasis.

Apache is happy to provide a home for any community that is willing to adhere to our governance rules and traditions. Competing projects are OK.

Projects are almost never rejected because preparing a proposal for incubation is rigorous and many projects who would be a poor fit self-select out.

Source: former VP Apache Incubator, who has both helped prepare successful proposals and privately counseled projects who decided not to come to Apache.

osi · on Nov 19, 2015

because people are interested in maintaining each

hkuser · on Nov 19, 2015

It looks better. Since 2002 in dev.

atombender · on Nov 20, 2015

Can anyone comment on Geode's non-Java support?

I'm asking because a lot of Java "big data" stuff tend to prioritize Java clients (ZooKeeper, Kafka, Hadoop HDFS, Storm, VoltDB and HBase come to mind), and while there are sometimes clients in other languages, they tend to be second-class citizens that take years to reach feature/performance parity with the Java stuff.

For example, last I checked there still wasn't a mature, feature-complete Kafka client (consumer and producer with built in offset management) for Go.

mclarenfan · on Nov 20, 2015

Gemfire (on which geode is based) has a fully featured c, c++, c# client which has feature parity with the Java client. I don't know if pivotal is going to open source these clients too.

There however is a REST api and a python client https://github.com/gemfire/py-gemfire-rest

wiradikusuma · on Nov 20, 2015

Sorry for the stupid question: how do in-memory DBs deal with power failures? e.g. someone walking in the server room and trip power cable.

I can understand for read-only in-memory, but what about writes?

markito · on Nov 20, 2015

In the case of Geode, you can make a "Region" (think table) persistent on disk and using the concept of shared nothing architecture [1] to avoid SPOFs.

What's also interesting is that we offer a very efficient way to recover data from disk as well[2] in the case of a crash of a single node or the entire cluster.

[1] https://en.wikipedia.org/wiki/Shared_nothing_architecture [2] http://gemfire.docs.pivotal.io/docs-gemfire/latest/managing/...

metatype · on Nov 20, 2015

Also, geode supports redundancy zones so data is replicated such that a cluster can survive a rack failure without data loss.

jacques_chester · on Nov 20, 2015

FWIW, Pivotal is hiring in our Big Data team, largely based in Palo Alto. Geode (incubating), HAWQ (incubating), Greenplum, Pivotal HD, MADlib etc are all mostly developed with engineering effort that we donate.

Hit me up with an email (jchester@pivotal.io) or visit pivotal.io/careers if you're interested.

avereveard · on Nov 19, 2015

> Gemcached > Geode servers can be configured to talk memcached protocol.

hey this is very interesting, it could work as a persistent acid memcached drop-in replacement!

metatype · on Nov 19, 2015

Yep. Also some work underway to talk redis protocol at https://cwiki.apache.org/confluence/display/GEODE/Geode+Redi....

hyc_symas · on Nov 19, 2015

There's already memcacheDB for that https://github.com/LMDB/memcachedb

avereveard · on Nov 20, 2015

That one doesn't seem distributed. But I just glanced at the readme.

hyc_symas · on Nov 20, 2015

Distribution is usually handled by the memcache client libraries.

mclarenfan · on Nov 20, 2015

Some advantages of gemcached over memcached are here:

https://cwiki.apache.org/confluence/display/GEODE/Moving+fro...

rdtsc · on Nov 19, 2015

Source here:

https://github.com/apache/incubator-geode

dberg · on Nov 19, 2015

I believe this competes with another well known open-source In-Memory Data Grid, Hazelcast. worth checking out.

threeseed · on Nov 19, 2015

Last time I checked Hazelcast couldn't be run standalone. Also I wouldn't use it as a database for anything more than a few MB of data.

This is probably closer to Apache Ignite aka Gridgain.

dberg · on Nov 19, 2015

Can you explain why ? I remember reading a ton about off-heap memory work they were doing

markito · on Nov 20, 2015

Geode FAQ Wiki page https://cwiki.apache.org/confluence/display/GEODE/Technology...

GregChase · on Nov 19, 2015

If you are interested to join the developer community around Apache Geode, subscribe via: dev-subscribe@geode.incubator.apache.org

Thaxll · on Nov 19, 2015

It looks similar to Hazelcast / Oracle Coherence?

wener · on Nov 20, 2015

Yes, both Geocode, Coherence, Ignite and Hazelcast are quit similar.Grid computation.

avodonosov · on Nov 19, 2015

Would you choose it over Datomic?

espeed · on Nov 19, 2015

Datomic's immutable storage and time-travel query capabilities are awesome, and I often miss them in other DBs. But Datomic currently isn't designed for write-intensive workloads. And while you can shard Datomic's transactor and then combine multiple DBs in a query (http://nosql.mypopescu.com/post/19310504456/thoughts-about-d...), that's only going to get you so far.

However, Apache Geode lets you add custom indexes so it might not be too hard to add Clojure's persistent data structures as a custom index scheme and hook in Apache Geode as a backend to Clojure Datalog:

Clojure Datalog: https://github.com/fogus/bacwn

Datscript: https://github.com/tonsky/datascript

Clojure's Persistent Data Structures for Java: https://github.com/grignaak/clj-ds

fiatjaf · on Nov 19, 2015

Let me guess: it's written in Java?

fiatjaf · on Nov 19, 2015

Yes.

Why are all Apache projects are written in Java?

dang · on Nov 19, 2015

All: this is a major project (with a long history: http://geode.incubator.apache.org/about) and a major release. It deserves a substantive thread, so please let's not get sidetracked by a language troll.

zidad · on Nov 19, 2015

Apache isn't written in Java :)

https://en.wikipedia.org/wiki/Apache_HTTP_Server

rylee · on Nov 19, 2015

That's Apache httpd. Apache is a software organization.

nkozyra · on Nov 19, 2015

There are a few good reasons, but more importantly, why does it matter?

timClicks · on Nov 19, 2015

Can't speak for others, but I find deploying the JVM alongside non-JVM software to create a full product pretty frustrating.