Migrating from GFS to GlusterFS

Date : July 12, 2011
Tags : projects ,reviews

This week I started the process of migrating from GFS to GlusterFS. The hardware running my GFS cluster is older and I decided it was better to replace it than continue maintaining it.

Background Information

Back in 2003 I needed to find a storage solution that was fast, reliable, and fault-tolerant. It also needed to be accessed by multiple clients simultaneously. After doing much research, I ended up going with an IBM FAStT 600 2GB Fibre Channel (FC) based storage system. The intent was to replace a single server running a JBOD SCSI disk chassis and RAID-5 card acting as an NFS server.

So, with 10 new servers, 6 FC HBA’s, and a pair of 8-port FC switches, and RedHat Enterprise Linux Advanced Server 2.1 with the GFS add-on, I began configuring the new cluster setup. The primary purpose of this cluster was to serve several thousand customer home directories to web and mail servers. Mail was delivered and stored in Maildir format and stored in their home directory. Each user had a disk quota, so with everything contained in one directory per user, things were organized very cleanly. I split the home directories up logically, creating 3 base home directories -- one per server as the “primary”, and then broken down by first letter of the login, i.e. /homes1/a, /homes2/a, /homes3/a, homes1/b, etc. When all was said and done, my home directory was in /homes1/m/marc. So far so good.

The next process was to get the servers talking to the FAStT. So, using the Q-Logic configuration tools, I set up the multi-pathing through redundant switches and connected everything up. I partitioned the storage array and got GFS up and running. Since I had never worked with a shared storage system like this before, it took me several weeks of trial and error to get things going. Once everything was setup, I began the migration process, which went off without a hitch. Time to make it live.

What an abysmal failure that was. It ran for only a day or two before I migrated everything back. The problem was the GFS distributed locking mechanism, or DLM. In order to perform an action on a file, all of the active nodes had to agree that the file was locked before the operation could begin. This was a mail store for thousands of users, moving hundreds of thousands of email messages on and off of it each day. In addition, web pages were served from this same store. The DLM couldn’t keep up with the sheer number of file operations that I needed.

So, the FAStT sat mostly idle for a few years.

Current Cluster Configuration

Several years ago, I sold the ISP business and started focusing on consulting and my VoIP products. I had the FAStT setup and several of the servers as a starting point. RHEL was now up to version 5, so I decided to test again after 5 more years of development in shared storage and GFS. This time my needs were very different. I still needed highly available shared storage for the VoIP system. The intent was to share voicemail files and other configuration files between servers in a way that there was no single point of failure. So, four systems each with dual port HBA’s, multi-pathing, and two FC switches provided the no single point of failure in accessing the FAStT. I set the FAStT up to run RAID1+0 with two hot spares. Very fast, and very fault tolerant.

GFS in this environment works really well. It is very fast, and I can exceed 200MB per second throughput for reading and writing. A single system can fail and the others all still have access to shared storage. Perfect. Most of the time. There have been a couple of instances in the last 3 years where I have had to shut down the entire cluster and bring it back up because of DLM and quorum issues. This requires a multi-system failure, though, and is fairly rare. Over the last three years, this configuration has provided me with five nines of reliability, i.e. 99.999% uptime.

Time to Expand the Cluster

So now its time to expand the cluster. I need to add several more machines into it to meet the growing need for capacity. The new servers are in a different cabinet on the other side of the data center and have no HBA’s in them. That means I need network based access to the shared storage. This part is not really a problem, as I have an HA NFS setup running on the cluster.

The FAStT system is also now quite old. The disks need to be replaced and the replacement cost for FC disks is very high compared to commodity SATA disks.

Where the problem comes in is that I also need to create a backup cluster at another physical location using the same shared storage. This is where things get more complicated. If I had an unlimited budget and an unlimited amount of bandwidth, I would have several options available to me from vendors like EMC and IBM just to name two. Since I don’t have either of those two, I have to explore other avenues.

Looking at GlusterFS

Early last year a colleague of mine suggested I look at GlusterFS as a possible solution. The GlusterFS approach is very different than traditional network storage solutions. At the time I didn’t feel that it was quite ready. That has changed in the last year. It has become much more polished, better documented and feature complete.

GlusterFS uses the concept of “storage bricks”, which are not much more than local storage on each Gluster server. You then combine bricks into volumes, which determines how those bricks are to be accessed by the clients using what Gluster calls translators. The latest GlusterFS release contains four different translators: distributed, replicated, striped and distributed replicated. Each has their place in different types of networks. Gluster also allows you to stack volumes as well, which is a very cool feature.

GlusterFS also stores all of its files using standard file systems with extended attributes. This means that normal backup software can easily backup and restore data to the storage bricks.

The biggest deviation from “traditional” network file systems as I see it is the fact that the translators are all client driven. What that means is that it is the client that decides which servers to write files to and how to distribute it. The servers do coordinate things by tracking their own storage bricks, but the heavy lifting is done by the client themselves. With the way technology has been, LAN speeds will almost certainly outperform hard drive speeds. Gigabit networks are the norm in data centers these days, and 10-gig networks are right around the corner. Processors have been following Moore’s Law, but hard drives have not -- they have mechanical parts. SSD is still far to expensive to be practical, and even then SSD’s would still not be cost effective over a gigabit or 10-gigabit network. So with today’s technology this approach makes sense.

This same approach also helps it to scale. Rather than having expensive file servers using all of their resources coordinating disk access, file distribution, replication, etc., between themselves, it shifts that task to the client. This does a very effective job of distributing resources in a cost-effective way and makes it massively scalable.

The latest GlusterFS (3.2.1) also includes geo-replication, which is intended to keep storage volumes in sync, even if they are running in different physical locations. This was the last piece that I needed GlusterFS to do before it was feature complete enough for my application.

Deploying GlusterFS

Actually deploying GlusterFS was very straightforward. I had two servers with freshly installed CentOS 5.5 x86_64 on them, I downloaded the RPM’s installed them and 10 minutes later I had a replicated volume setup. I installed the glusterfs-core and fuse packages on a third server that will act as a client, and just like the server setup, was off and running in a matter of minutes.

Each of the two “server” machines has a pair of 500GB drives running hardware RAID1 on the first and software RAID1 on the second. While not necessary, it does add one additional level of protection since the servers won’t be singly tasked with just running GlusterFS. Creating the volumes was simple:

server1# mkdir -p /exports/shared
server2# mkdir -p /exports/shared
server2# mkdir -p /exports/shared
server1# gluster peer probe server2
server1# gluster volume create shared replica 2 trsnport tcp server1:/exports/shared server2:/exports/shared
server1# gluster volume start shared
client1# mkdir /shared
client1# mount -t glusterfs server1:/shared /shared
client1# df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/md0             149408868   2414152 139282720   2% /
tmpfs                  1028712         0   1028712   0% /dev/shm
glusterfs#server1:/shared
                     466539648  66220160 376238336  15% /shared

That was all it took to get GlusterFS up and running. Total time spent was just a few minutes. The GFS setup took many hours to get things just right.

There was one last thing I needed to do. Since GlusterFS prefers the 64-bit architecture and I have a mixture of 32 and 64 bit systems, I decided that 64-bit clients will run the native Gluster client (as illustrated above) and that the 32-bit clients will access it via Gluster’s built in NFS server. So, I needed to tune the volume to have the NFS server return 32-bit inode addresses for NFS access. This was also very simple:

server1# gluster volume set shared.nfs.enable-ino32 on

From any 32-bit clients, I can now use:

mount -t nfs server1:/shared /shared

Conclusions

I’m quite impressed with the latest GlusterFS. The latest GlusterFS release made transitioning away from GFS quite painless. It performs very well and is tunable to different workloads. Any future projects I work on that require highly available shared storage will almost certainly be using GlusterFS.

Comments

Author: James
Date: 2011-10-18 23:01:35

How has this been going for you? I’ve been keeping an eye on GlusterFS but haven’t given it a spin yet because I didn’t think it was quite ready. Looks like I’ll be evaluating it within the next couple of weeks.

Author: marc
Date: 2011-10-19 10:54:08

Its actually been working very well if things are all GlusterFS. I did have a server fail (nothing to do with GlusterFS) that happened to be the system that served NFS to 32 bit systems.

The native GlusterFS clients all continued to perform fine while the server was done, but the NFS systems hung on I/O. That was to be expected, though.

I haven’t tested it under extremely high disk loads, though. I don’t know yet how it would perform as, say, a mail store for a busy system. But for serving up web content and voicemail files, its been working very, very well.

Author: Michael
Date: 2012-05-03 09:21:10

Thought I would toss my experiences into this thread, though it’s fairly old. I have the luck of having a 10gb ring between two data centers 100 miles apart, with average latency around 4ms (I know, I’m spoiled), so I decided to test glusterfs as a distributed replicated volume between sites.

We run a very busy mail service on a netapp NFS share, which is fine, but failing this share from one data center to another is a clunky process that involves too much downtime. I am always looking for a better way to cluster the storage for zero downtime so I put together a glusterfs 3.2 cluster with two storage nodes running centos 6, one at each data center with a replication factor of 2 (copy of files on each node). We created a mail botnet which is used to generate a test mail load, and hammered on a cluster of mail servers that mounted the glusterfs file system (using the native client) as the storage for the maildirs with good results. The client certainly runs a higher load due to the fact that it’s responsible for doing the distribution of writes to each storage node, but that’s the basis of how glusterfs is able to scale, so no worries. The performance was decent, but it still does not feel as snappy as straight NFS off the netapp. Gluster 3.3 is in beta right now and is supposed to improve performance of accessing small files, which I look forward to testing.

A feature I think would be useful for glusterfs would be the concept of node groups, or groups of storage servers that a copy of a file should reside within based on replica count. If you have a replica count of three, and three node groups, you would have a single copy of the file in each node group. I believe it would make administration of large clusters of storage nodes simpler as you know exactly where your data is, and you could separate node groups between data centers if the latency between was acceptable.

Author: Stuart james
Date: 2012-10-04 05:54:34

Biggest problem I see with glusterfs is client side mount has no fault tolerance. In our setup we have 4 bricks with distributed / replicated volume , we then present this to 4 different clients that all in turn mount one of the bricks. If brick1 hard fails, then there is of course locking on open file handles , and there is no cleanup task that nicely closes open file handles and remounts a working brick. I realise this is maybe expecting to much, unless I am missing something, but reality is servers do once and awhile fail, and having no failover on client mount means the service is ultimately affected.

Author: John Mark
Date: 2013-09-02 08:21:07

Hi Marc,

Thought I would check in here - I’m curious how well GlusterFS 3.4 works for your use case. Do you have any updates?

-John Mark

Gluster Community Leader

If this post helped you, consider buying me a coffee or even just dropping me a note.
I'd love to hear about what you've built or if you've got a topic you'd like me to post about.
You can also leave a comment below.