Ex-GlusterFS person here (used to work at Red Hat on the project side, leaving mid last year).
"Small file access", and "lots of files in a directory" have been a pain point with GlusterFS for ages. The 3.7.0 release had some important improvements in it, specificially designed to help fix that:
The latest Gluster release is 3.7.8 (the same series as 3.7.0), and is worth looking at if you're needing a good distributed file system. If you have something like 1Mill files in a single directory though... hrmmm... NFS or other technologies might still be a better idea. ;)
Hi. I'm one of the GlusterFS developers. Hi, Justin. ;)
The basic answer to your question is that networks are slow compared to local storage. In order to get decent performance, you must either avoid network round trips or amortize their cost over many operations. We - not just GlusterFS but distributed filesystems in general - can do this pretty well for some operations. We can batch, buffer, cache, etc. This works great for plain old reads and writes to large files (for example). It doesn't work so well for operations that have to touch many small files. For those, in order to ensure the required level of consistency/currency, we have two choices.
(1) Send a request per file to get current metadata.
(2) Cache metadata, and participate in some sort of consistency protocol to make sure we don't serve stale cached information.
Both approaches have workloads where they perform better and workloads where they perform worse. In addition, the second approach adds a lot of complexity, especially in a system where failures are common and a loosely coordinated set of servers must respond (systems with a single master server have an easier time here but are less resilient). The inherent difficulty of this approach is why e.g. CephFS has taken so long to mature.
GlusterFS took the first route instead. It does mean that "ZOT" (Zillions Of Tiny files) workloads will perform poorly. I won't deny that. On the other hand, it's easier to test or prove correct, and the time not spent on solving the hard version of that problem - often for little eventual benefit - can instead be spent on other kinds of improvements. Some people are happy with that. Some are not. Some spread FUD. Some try to implement the practical equivalent of a distributed filesystem on top of some alternative (e.g. object stores) with their own even more serious limitations, and experience even more pain as a result. Some are initially unhappy with these tradeoffs, but work with us and learn to work more effectively with these limitations to enjoy other benefits. That's life in the big city.
It's not uncommon with distributed storage to err one way or another, because you have to store large amounts of data and metadata. Designing a system that does both extremely well is non-trivial. You can optimize to store a few very large files or many very small files, but doing both while scaling reliably is harder than you might think.
Naively it seems weird since metadata is just data, right? So if you can store a large amount of data in a file, you should be able to store the metadata
But I don't know shit about filesystem design, soooooo...
Well the metadata I'm referring to is the metadata required to locate a file: mappings from directory structure, filename, possibly permissions / ownership to the blocks and replicas distributed across a cluster. If you get to the point where you need start locating metadata because it might also be arbtirarily distributed you now need meta-metadata. And I'm sure you can see how complex that's going to get :)
From memory, it was something about not having a central metadata server combined with needing to talk with each of the nodes to get their idea of "the latest version" of file info. Especially tricky when taking into account edge cases, such as when some nodes are failed/down/etc.
Unfortunately, I don't remember most of the details. It's fallen out of my headspace in the months since I've left. ;)
Well, if you use XFS/ext4/etc (local file systems), they do decently. You can make them reasonably resilient by using multiple volumes, split across hosts using SRP (eg infiniband block storage interconnect). That's pretty strictly not WAN-suitable though. ;)
I worked with a GlusterFS deployment in production about 2 years ago, and it was such a nightmare that I both feel compelled to write about it and never touch anything made by that team ever again.
It was the whole shebang: Kernel panics, inconsistent views, data loss, very slow performance, split-brain problems all the time. Our set up IIRC was very simple: two bricks in a replicated volume. It worked so poorly that we had to take it out of production. Some of our experience can be explained by GlusterFS performing poorly under network partitions, but nothing could justify kernel panics. It blew my mind that Redhat acquired that company and product.
Edit: I hope there's been a big improvement to the reliability and performance of GlusterFS. Can anyone with more recent experience running it in production comment?
I'm not a GlusterFS expert, and haven't used it before, but you should know that most consensus algorithms (Paxos, Raft, etc) only function reliably with an odd number of nodes. I have to wonder if your problems were mostly self-inflicted from having 2 nodes. Of course, any network partition in a 2-node cluster has a huge potential for data corruption, as each node now thinks it is the master (split-brain).
In a 3-node cluster, any system with a decent consensus algorithm (to be clear, I'm not sure if GlusterFS has one) would know that during a partition the cluster can only continue to operate if at least 2 nodes can communicate with each other to elect a new master.
Uh, well if you base your GlusterFS experience on that, no wonder you have a bad view on the technology. I can only assume that your setup only used two nodes, which is not really supported, and of course will cause split-brain problems. And two bricks, really? GlusterFS is best aimed for scale both on the server and client side, two bricks is not really scale.
There's probably a good case study to be made here on why so many people have had bad experiences with GlusterFS. If the problem is that it's too easy to set it up in a fragile way, then they need to fix that, because otherwise you're going to end up with a lot unhappy users (as you can see in the other comments).
Still, if I can't make it work well on a reasonably small scale, why would I expect it to work any better when I scale it up?
I would suggest reading the documentation and understanding the architecture. It works well for many applications but some are less ideal. Maybe yours was a corner case? I don't know. But it's used by many, for instance Facebook (https://www.socallinuxexpo.org/scale/14x/presentations/scali...).
> Of course, any network partition in a 2-node cluster has a huge potential for data corruption, as each node now thinks it is the master (split-brain).
This is not a given, the cluster can and should refuse to operate if it can not get a majority vote for the master, if that is required to prevent data corruption. With two nodes that means both nodes must be active and reachable, with three nodes one node can fail.
Typically, clusters without a majority (2 out of 3, 3 out of 5, 4 out of 7, etc) of nodes present will shut themselves down to prevent data corruption.
You're right: nothing justifies kernel panics. There is nothing that GlusterFS or any other user-space program should be able to cause one. We (yes, I'm a GlusterFS developer) don't really do anything that any application shouldn't able to do as far as the kernel is concerned. If what we do on behalf of our callers causes a crash, it's the kernel developers' fault and you should engage with them instead of blaming your peers out in user-land.
As far as performing poorly under network partitions, I'd love to hear more. That is our responsibility, and sounds like something we can/should fix.
My experience was not as bad as yours, but the problems I saw at a similarly small scale (6 servers, 3 2X replica sets) make me glad I'm not having to scale up a GlusterFS infrastructure. The biggest problem I saw was with absolutely atrocious performance when healing after a temporary node loss. Even if it was just a loss of a minute or two (server reboot), it wouldn't lose availability or corrupt files, but performance would get so bad for the next 2 hours that it might as well have been down.
If I had to make a comparison, I'd say GlusterFS reminds me a lot of MongoDB in the beginning. It wins a lot of kudos at the outset based on ease of setup, management and CLI UI, plus it has a good "story" on ability to scale up that gradually begins to fray when pushed. Hopefully there have been big improvements.
based on another comment. I forgot to add that I always had an odd number of servers in the GlusterFS cluster to help with consensus, even if that server did not actually hold any data bricks.
I feel compelled to write a disagreement. I've been running GlusterFS for 6 months with 15TB of data on a 43TB cluster using 5 servers with zero issues. I have no idea what your particular combination of bad luck was, but I don't think your experience is truly reflective of the product, the team, or Red Hat's sensibilities.
You're living the dream then, and I'm glad to hear this is possible. :)
Do you mind sharing which GlusterFS version you're on, which kernel, and roughly what your load profile is? (eg. lots of small files, read-heavy, write-heavy, etc.)
Also, are you using Red Hat Gluster Storage or the open source version?
Last time I tried GlusterFS was in 2012. The way it worked was very impressive back then and I would have loved to actually put it into production.
Unfortunately, I hit a roadblock in relation to enumeration of huge directories: Even with just 5K files in a directory, performance started to drop really badly to the point where enumerating a directory containing 10K files would take longer than 5 minutes.
Yes. You're not supposed to store many files in a directory, but this was about giving third parties FTP upload access for product pictures and I can't possibly ask them to follow any schema for file and folder naming. These people want a directory to put stuff to with their GUI FTP client and they want their client to be able to not upload files if the target already exists. So having all files in one directory was a huge improvement UX-wise.
So in the end, I had to move to nfs on top of drbd to provide shared backend storage. Enumerating 20K files over NFS still isn't fast but completes within 2 seconds instead of more than 5 minutes.
Of course, now that we're talking about GlusterFS, I wonder whether this has been fixed since?
I started using Gluster this last fall. At first it was not a contender - the articles I read were not encouraging. However many of this articles were older and it appears a lot of progress has been made with Gluster since version 3.5
It maybe worth a second look - however a directory with a large file count still might be a problem
Couldn't your FTP server have handled this in a clever way? For example, sort in directories by first letter, then by first two letters. While still providing a virtual flat view to the FTP client user. It'd be a simple mapping away.
I'm not sure what is announced here. Gluster FS is for few years already (version 3.6 now), while the article doesn't mention that there is started any managed service based on it. It is more like reminder that you can set up distributed file system on your cloud servers using Gluster. Not even any step-by-step tutorial how to do that.
That's what I find confusing. Was it not available before now? Couldn't you just spin up some Red Hat instances and run it? It's not clear what's actually being announced - it doesn't sound like a managed service.
How this availability will be implemented then? I can install it now, so it is already available for me. Will they provide a service like Amazon Elastic File System? Or there will be some pre-configured images?
I think, though I'm not sure, that its an official, active support relationship between Google and Red Hat for GlusterFS on GCE, related to Red Hat's vendor certification programs.
If so, this seems like the kind of thing that has the most impact on people with support contracts with Red Hat, but should also have peripheral impacts on the stability, quality of support, etc., of the product on the GCE platform generally.
Basically, GlusterFS is trying to solve a hard problem: make distributed/remote filesystem to feel like a local filesystem for applications built on top of it. For the client, you can choose from NFS, SMB or its homemake fuse client, which makes the remote system accessible as if everything is on local file system. I used to build similar systems in house and find it extremely painful to design and maintain, we did lots of custom hacks to make our system to suit our need. GlusterFS, as a general solution, won't have that much flexibility and may or may not suit your custom needs.
Overall, I feel AWS S3 is a better (or at least simpler) approach. Just acknowledge that files are not locally stored and use them as is. AWS is experimenting EFS as well, which we found not as desirable as well.
Edit: I am not saying that you cannot make GlusterFS or EFS perform great. My appoint it that it's hard to do so, and might not worth the effort to develop such a system given that S3 can serve most needs of distributed file storage.
aren't you comparing apples to oranges? S3 is an object store (and non POSIX.. Also only eventually consistent).. GlusterFS is neither of those. They simply solve different problem spaces
Parent is saying that it might be simpler for many applications to forgo the need for a file system abstraction and just use S3's more limited API directly. That way, you don't get bit by thorny edge cases like "ls" of 10K files taking orders of magnitude more time than you'd expect give you're used to the speed of a local disk.
I needed a shared volume across multiple EC2 instances in a VPC. My use case is that multiple "ingress" boxes write files to the shared volume, and then a single "worker" box processes those files. This is a somewhat unusual use case in that it means one box is responsible for 99% of IO heavy operations, and the other boxes are responsible only for writing to the volume, with no latency requirements.
My solution was to mount an EBS on the "worker box," along with an NFS server. Each "ingress box" runs an NFS client that connects to the server via its internal VPC IP address, and mounts the NFS volume to a local directory. It works wonderfully. In three months of running this setup, I've had no downtime or issues, not even minor ones. Granted I don't need any kind of extreme I/O performance, so I haven't measured it, but this system took less than an hour to setup and fit my needs perfectly.
It's a bit light for a press release. Considering RedHat is officially promoting AWS on their website, providing more information to let people know whether the offering on Google Cloud will be better or similar would have been better.
Ugh. So it's literally still RH Gluster, needing manual setup.
Also, there's a sentence on the end of the GlusterFS section there saying:
"If you want to deploy a Red Hat Gluster Storage cluster on Compute Engine, see this white paper for instructions on how to provision a multi-node cluster that includes cross-zone and cross-region replication:"
Apart from the typo (pedant alert!), the URL on "white paper" goes to a non-public document only Red Hat subscribers have access to. That should probably be that fixed, so non-Red-Hat-subscribers can read the doc and know what they'll need to do up front.
If people need to subscribe to RH in order to get that info... just to know what they need to do... that's probably going to hinder adoption. Potentially by a lot. ;)
Before reading the article, I was going to ask if it solves the "high read access of many small files" I/O problem, but alas, it's on GlusterFS, so only insomuch as Gluster has been making improvements these last few minor releases.
Is anyone here running a GlusterFS setup with high read/write volume on small files successfully? If so, what's your secret?
If you are looking for a POSIX compatible file system for GCE or EC2, we think our ObjectiveFS[1] is the easiest way to get started and use. It is a log structured filesystem using GCS or S3 for backend storage and with a ZFS like interface to manage your filesystems.
Glad to see Gluster is still making waves. I was an early customer. It's impressive when a brand survives acquisition much less a transition into a new type of offering like this. Kudos to everyone who helped make Gluster special!
I wonder why they even did this. They already have a state of the art distributed filesystem (Colossus) which doesn't have any scalability problems at all, since they use it for everything.
Disclaimer: I work on GCP.
GlusterFS works best on RHEL, and consumes normal GCP resources like GCE and PD-SSD storage. To host a rocking fast, best practices, HA, 3TB all-SSD filer, it'd be less than $900 on GCP: https://cloud.google.com/products/calculator/#id=e76e9a5a-bf...
I see that the GlusterFS FAQ says it is fully POSIX compliant. That's a pretty good trick. Ten years ago or so I had a suite of compliance tests I would use to embarrass salesmen from iBrix and Panasas. The only actually POSIX-compliant distributed filesystem I could find in those days was Lustre (unrelated to Gluster, despite the naming). Lustre works well but it almost impossible to install and operate.
I remember at around the same time I was playing with GlusterFS, and it was interesting, but painful to configure as well. It was prone to have client/server configuration mismatches, and you had to be careful to reason about where you enabled a feature, the client or server, to make sure it failed in a safe way. It did have capabilities for Fcntl and flock file locking though, which was interesting at the time. Unfortunately ist was also somewhat unstable, and I would see segfaults every few weeks. My focus for work projects moved into different areas, and I didn't keep up with it's development, so I'm not sure what they've done in the last decade, but it was promising and refreshingly new back then. I should take another look.
I invite you to check out our product, Quobyte (www.quobyte.com). Not open-source, but a parallel high-performance POSIX file system, with split-brain safe quorum replication of all components, can also do erasure coding for files, with policy-based data placement, running on standard server hardware.
We designed it to yield top performance both for (parallel) file system workloads and block storage workloads. So you can run VMs and databases on it with a performance better than any other partition-tolerant software storage system. The goal is to provide customers with a scalable automated general storage platform for all workloads, a la Google, but for real world applications.
In HPC/HTC, Lustre is very common, especially in DOE labs. I've never tried to install it, but I don't think my colleagues who have are some kind of special genius.
Most HPC labs run distributions that go to quite a lot of pain to ensure Lustre works well.
The issue with Lustre is that it usually requires kernel patches if you are not running the aforementioned distributions that are designed for HPC. Specifically the Lustre OSD used a modified version of ext2 among other issues.
That could be getting better now with the new ZFS based OSD though I wouldn't put money on it being "easy" to install.
Lustre also doesn't provide any means of replication. If you want to achieve HA with Lustre you need to make each OSD individually HA. This can be done with multi-pathed SAS arrays and a ton of scripting but it's still not exactly a walk in the park.
Hopefully one day we will see a real high performance distributed filesystem that also bundles replication, tiering and some semblance of POSIX compatiblity.
I doubt Ceph is going to be it so we are probably still 5-10 years from a solution to the problem.
Well most HPC installation I know are just running RHEL6/7 (except for NERSC which is using Cray linux based on SUSE still I think?) which I wouldn't peg as particularly exotic. The HA thing is know, typically Lustre is just a piece of the file system, along with local disks and GPFS (and now towards Ceph), and HPSS/tape (which can be a tier of GPFS).
I'm not so sure the hope for "one filesystem to rule them all" will ever really work out, but Ceph is the best positioned.
Not mentioned yet is that nearly all object stores (S3, GCS, etc.) are pretty awful for small, random reads and writes. For workloads like rendering, the renderer is often reading a small tile of a texture (an EXR or tiff) with about 32x32 pixels (so anywhere between 1 KiB and 16 KiB depending on precision and compression). Object stores on the other hand are tuned for big files and so often work in 1 to 4 MiB chunks at best.
Because if you have apps that weren't written at a time that S3 pricing made any sense at all (it wasn't that long ago you could afford to buy a new file-server and JBOD every month for what it would've cost to store a few dozen terabytes on S3) migration is a huge burden and expense.
Really looking forward to EFS to take a big bite out of that commitment since it has reasonable pricing for projects with smaller storage requirements.
Just switching a portion (photos) of the project with the most files to S3 took on the order of 800 (mostly developer) hours.
At some point you have to decide what you can do with the hand you're dealt and wether you want to distribute those tens of thousands of dollars to your employees, pursuing new business or your own pocket; or reinvest it into an existing project despite having no mandate to do so. It's not an easy or cut and dry a decision to make (even today) for legacy projects IME.
If you're starting a new project. Yes. S3 is the most reliable service in the AWS stack (AFAIK) by a big margin. The pricing is very reasonable today unless you're in the petabyte range maybe, and you have no operations overhead. Agreed it's a no-brainer.
Ex-GlusterFS person here (used to work at Red Hat on the project side, leaving mid last year).
"Small file access", and "lots of files in a directory" have been a pain point with GlusterFS for ages. The 3.7.0 release had some important improvements in it, specificially designed to help fix that:
https://www.gluster.org/community/documentation/index.php/Fe...
The latest Gluster release is 3.7.8 (the same series as 3.7.0), and is worth looking at if you're needing a good distributed file system. If you have something like 1Mill files in a single directory though... hrmmm... NFS or other technologies might still be a better idea. ;)