Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Apache Geode: Distributed, in-memory database (apache.org)
187 points by espeed on Nov 19, 2015 | hide | past | favorite | 62 comments


Didn't understand much from the linked page, but I found this website (from Pivotal, the commercial entity behind Geode) quite informative. Perhaps it's useful to others.

https://pivotal.io/big-data/pivotal-gemfire

===

I found this interesting deployment:

    China National Railways use Geode to run railway ticketing for the entire 
    country with a 10 node cluster, managing 2 TB of "hot data" in memory, 
    and 10 backup nodes for high availability and elastic scale.
    
    Holiday travel periods [Chinese New Year's] create peaks of 15,000 tickets 
    sold per minute, 1.4 billion page views per day and 40,000 visits per second.
http://pivotal.io/big-data/case-study/scaling-online-sales-f...


to understand where it comes historically and ideologically/architecturally from - the keyword here is Smalltalk https://en.wikipedia.org/wiki/Gemstone_%28database%29

The guys have been for 3 decades sitting on the forested Oregon river bank (the place is called Beavertown for reason :) thinking straight and clear in Smalltalk... :)


Geode and its predecessor GemFire are pure Java.

The original engineering expertise came from the team that built the Gemstone OODB.

Beaverton isn't so forested anymore, but lots of indigenous object oriented expertise for sure.


>Geode and its predecessor GemFire are pure Java.

i know (was talking at some point about joining, already here in SV), and this is why i wrote only "3 decades" as i don't know precise dates of GemFire only suppose it to be beginning of 200x. My simple arithmetics mistake though - 2002 - 1982 is 2 decades, sorry, i see where confusion comes from :)


FWIW, GemFire is all written in Java and has native clients for c++ and .net ( as well as REST clients). The product was built from the ground up in Java and runs highly scaled up , low latency systems across the globe. The place is called Beaverton and while there are many beautiful forested riverbanks, the Pivotal team does not sit next to one.


Also Indian railways:

     200,000 Concurrent ticket Purchases
https://pivotal.io/big-data/case-study/distributed-in-memory...



Landing page and documentation should be reconsidered. lots of unanswered questions,

- distributed, in-memory database - how does it compare to SAP HANA? or Redis? if "... database" where is the querying language? if you have own querying language other than SQL please show us.

- Performance is key - tell us about some benchmarks

- Consistency is a must - CA or CP in CAP theorem?

- Can I fully drop RDBMS in favor of Geode?

- web site built as Landing page/PR for product, then all of a sudden starts Community/Contributors/Getting started columns, with loooooong list of mailing lists, smaller content for contributors and very important part(getting started) presented as tiny small link :(, please put there some useful info about product, you already have Community/Contribute menus.


> - Consistency is a must - CA or CP in CAP theorem?

There is no CA. So, they must mean CP.

Remember, kids: if they promise CA, run away.


CA is possible if its read only, essentially, with all updates synchronized when a partition is not present [which is the majority of the time in the real world].


It's semantics, but I'd argue if you're read only than you're changing the definition of available.


Geode is write-intensive, and generally optimizes for consistency over availability. That being said - there's a lot built into Geode to ensure availability as well.


Fair enough but read-only in the event of a network partition is a viable real world use case.


Thanks for the feedback. We're currently updating the website and the work is being tracked here -> https://issues.apache.org/jira/browse/GEODE-53

Here is a quick preview of how the new one will look like: markito.github.io/geode-website/

Any feedback is welcome!


Complex objects can be stored by key and automatically partitioned and replicated. Query the data with OQL (an ODMG standard). For example:

select * from /person p where p.name = 'foo' and p.age > 42


from docs at[1] they only support SELECT statements in OQL, calling object method in query reminds me Esper complex event processing engine.

[1] - http://geode-docs.cfapps.io/docs/getting_started/querying_qu...


Feedback on the documentation:

It's not immediately obvious to me how Geode is different from Redis. When would I want to use Geode over Redis, and vice versa?


Redis and in-memory data grids are pretty different animals. I would characterize IMDG's like Geode to be concurrent write intensive, and have flexible data models. It also scales out better than Redis in a more automated fashion.

Redis is a great read-intensive cache. It also has a powerful data model, but you have to use their data models. Example: If you want to run calculations on lists or sets, they have powerful operations you can call.

IMDG's such as Geode were built with the rise of automated trading in the finance industry.


Geode also understands Redis protocol, so you can point your Redis clients to Geode. This is interesting because the problem with Redis data structures is that they cannot scale beyond the memory available to the Redis server. The further limitation that Redis server is single threaded and cannot really be scaled up, only makes the problem worse.

With Geode your Redis data structures can scale horizontally.


This

> Data is persisted in write-optimized disk storage. Consistency checking is configurable between highest performance caching and ACID transactions.

seems pretty different to redis. I don't think Redis does ACID transactions.


Redis is single-threaded, so everything it does on a per-node basis is implicitly atomic. You can also force the single thread to handle a block of commands from a single socket at once by using MULTI (otherwise commands can be interleaved with other commands from other sockets).


It's not guaranteed to be atomic in failure conditions. Also, the biggest difference between Redis and Geode would be the "distributed" part, which involves maintaining these guarantees across a cluster of machines (which Redis demonstratingly doesn't do).


That's pretty hand-wavey. All databases can be made to lose all guarantees under failure conditions.


Here's an old post showing how to create a custom indexing schemes in Geode (QuadTreeIndex for spatial data): http://blogs.vmware.com/vfabric/2012/12/gemfire-patternspart...


Two Videos:

1. "Open Sourced GemFire In-Memory Distributed Database and Apache Contributors" (https://www.youtube.com/playlist?list=PL62pIycqXx-TTMXsq09BE...)

2. "Creating a Highly Scalable Stock Prediction System with R, Geode & Spring XD" (https://www.youtube.com/playlist?list=PL62pIycqXx-Rzd_HcjU7Y...)


I'm starting to get lost in the new Apache animals.

How does Apache Goede compares to Apache Ignite (advertised as "in-memory data fabric"?


Ignite vs Pivotal Gemfilre (Apache Geode) https://ignite.apache.org/use-cases/compare/gemfire.html

This is cool: "Ignite allows for most of the data structures from java.util.concurrent framework to be used in a distributed fashion." https://ignite.apache.org/features/datastructures.html


Apache Geode and Apache Ignite are more similar than they are different.

Apache Ignite, based off the commercial distribution Grid Gain is newer to market.

Apache Geode, based off the commercial distribution GemFire, has a long history in the market.


I wonder why Apache would need to have two "more similar than different" products.


Apache accepts anything that companies donate. (They say they don't, but it's hard to find anything they've rejected.)


That's not quite the right emphasis.

Apache is happy to provide a home for any community that is willing to adhere to our governance rules and traditions. Competing projects are OK.

Projects are almost never rejected because preparing a proposal for incubation is rigorous and many projects who would be a poor fit self-select out.

Source: former VP Apache Incubator, who has both helped prepare successful proposals and privately counseled projects who decided not to come to Apache.


because people are interested in maintaining each


It looks better. Since 2002 in dev.


Can anyone comment on Geode's non-Java support?

I'm asking because a lot of Java "big data" stuff tend to prioritize Java clients (ZooKeeper, Kafka, Hadoop HDFS, Storm, VoltDB and HBase come to mind), and while there are sometimes clients in other languages, they tend to be second-class citizens that take years to reach feature/performance parity with the Java stuff.

For example, last I checked there still wasn't a mature, feature-complete Kafka client (consumer and producer with built in offset management) for Go.


Gemfire (on which geode is based) has a fully featured c, c++, c# client which has feature parity with the Java client. I don't know if pivotal is going to open source these clients too.

There however is a REST api and a python client https://github.com/gemfire/py-gemfire-rest


Sorry for the stupid question: how do in-memory DBs deal with power failures? e.g. someone walking in the server room and trip power cable.

I can understand for read-only in-memory, but what about writes?


In the case of Geode, you can make a "Region" (think table) persistent on disk and using the concept of shared nothing architecture [1] to avoid SPOFs.

What's also interesting is that we offer a very efficient way to recover data from disk as well[2] in the case of a crash of a single node or the entire cluster.

[1] https://en.wikipedia.org/wiki/Shared_nothing_architecture [2] http://gemfire.docs.pivotal.io/docs-gemfire/latest/managing/...


Also, geode supports redundancy zones so data is replicated such that a cluster can survive a rack failure without data loss.


FWIW, Pivotal is hiring in our Big Data team, largely based in Palo Alto. Geode (incubating), HAWQ (incubating), Greenplum, Pivotal HD, MADlib etc are all mostly developed with engineering effort that we donate.

Hit me up with an email (jchester@pivotal.io) or visit pivotal.io/careers if you're interested.


> Gemcached > Geode servers can be configured to talk memcached protocol.

hey this is very interesting, it could work as a persistent acid memcached drop-in replacement!


Yep. Also some work underway to talk redis protocol at https://cwiki.apache.org/confluence/display/GEODE/Geode+Redi....


There's already memcacheDB for that https://github.com/LMDB/memcachedb


That one doesn't seem distributed. But I just glanced at the readme.


Distribution is usually handled by the memcache client libraries.


Some advantages of gemcached over memcached are here:

https://cwiki.apache.org/confluence/display/GEODE/Moving+fro...



I believe this competes with another well known open-source In-Memory Data Grid, Hazelcast. worth checking out.


Last time I checked Hazelcast couldn't be run standalone. Also I wouldn't use it as a database for anything more than a few MB of data.

This is probably closer to Apache Ignite aka Gridgain.


Can you explain why ? I remember reading a ton about off-heap memory work they were doing



If you are interested to join the developer community around Apache Geode, subscribe via: dev-subscribe@geode.incubator.apache.org


It looks similar to Hazelcast / Oracle Coherence?


Yes, both Geocode, Coherence, Ignite and Hazelcast are quit similar.Grid computation.


Would you choose it over Datomic?


Datomic's immutable storage and time-travel query capabilities are awesome, and I often miss them in other DBs. But Datomic currently isn't designed for write-intensive workloads. And while you can shard Datomic's transactor and then combine multiple DBs in a query (http://nosql.mypopescu.com/post/19310504456/thoughts-about-d...), that's only going to get you so far.

However, Apache Geode lets you add custom indexes so it might not be too hard to add Clojure's persistent data structures as a custom index scheme and hook in Apache Geode as a backend to Clojure Datalog:

Clojure Datalog: https://github.com/fogus/bacwn

Datscript: https://github.com/tonsky/datascript

Clojure's Persistent Data Structures for Java: https://github.com/grignaak/clj-ds


Let me guess: it's written in Java?


Yes.

Why are all Apache projects are written in Java?


All: this is a major project (with a long history: http://geode.incubator.apache.org/about) and a major release. It deserves a substantive thread, so please let's not get sidetracked by a language troll.



That's Apache httpd. Apache is a software organization.


There are a few good reasons, but more importantly, why does it matter?


Can't speak for others, but I find deploying the JVM alongside non-JVM software to create a full product pretty frustrating.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: