This week I started the process of migrating from GFS to GlusterFS. The hardware running my GFS cluster is older and I decided it was better to replace it than continue maintaining it.
Back in 2003 I needed to find a storage solution that was fast, reliable, and fault-tolerant. It also needed to be accessed by multiple clients simultaneously. After doing much research, I ended up going with an IBM FAStT 600 2GB Fibre Channel (FC) based storage system. The intent was to replace a single server running a JBOD SCSI disk chassis and RAID-5 card acting as an NFS server.
So, with 10 new servers, 6 FC HBA’s, and a pair of 8-port FC switches, and RedHat Enterprise Linux Advanced Server 2.1 with the GFS add-on, I began configuring the new cluster setup. The primary purpose of this cluster was to serve several thousand customer home directories to web and mail servers. Mail was delivered and stored in Maildir format and stored in their home directory. Each user had a disk quota, so with everything contained in one directory per user, things were organized very cleanly. I split the home directories up logically, creating 3 base home directories — one per server as the “primary”, and then broken down by first letter of the login, i.e. /homes1/a, /homes2/a, /homes3/a, homes1/b, etc. When all was said and done, my home directory was in /homes1/m/marc. So far so good.
The next process was to get the servers talking to the FAStT. So, using the Q-Logic configuration tools, I set up the multi-pathing through redundant switches and connected everything up. I partitioned the storage array and got GFS up and running. Since I had never worked with a shared storage system like this before, it took me several weeks of trial and error to get things going. Once everything was setup, I began the migration process, which went off without a hitch. Time to make it live.
What an abysmal failure that was. It ran for only a day or two before I migrated everything back. The problem was the GFS distributed locking mechanism, or DLM. In order to perform an action on a file, all of the active nodes had to agree that the file was locked before the operation could begin. This was a mail store for thousands of users, moving hundreds of thousands of email messages on and off of it each day. In addition, web pages were served from this same store. The DLM couldn’t keep up with the sheer number of file operations that I needed.
So, the FAStT sat mostly idle for a few years.
Current Cluster Configuration
Several years ago, I sold the ISP business and started focusing on consulting and my VoIP products. I had the FAStT setup and several of the servers as a starting point. RHEL was now up to version 5, so I decided to test again after 5 more years of development in shared storage and GFS. This time my needs were very different. I still needed highly available shared storage for the VoIP system. The intent was to share voicemail files and other configuration files between servers in a way that there was no single point of failure. So, four systems each with dual port HBA’s, multi-pathing, and two FC switches provided the no single point of failure in accessing the FAStT. I set the FAStT up to run RAID1+0 with two hot spares. Very fast, and very fault tolerant.
GFS in this environment works really well. It is very fast, and I can exceed 200MB per second throughput for reading and writing. A single system can fail and the others all still have access to shared storage. Perfect. Most of the time. There have been a couple of instances in the last 3 years where I have had to shut down the entire cluster and bring it back up because of DLM and quorum issues. This requires a multi-system failure, though, and is fairly rare. Over the last three years, this configuration has provided me with five nines of reliability, i.e. 99.999% uptime.
Time to Expand the Cluster
So now its time to expand the cluster. I need to add several more machines into it to meet the growing need for capacity. The new servers are in a different cabinet on the other side of the data center and have no HBA’s in them. That means I need network based access to the shared storage. This part is not really a problem, as I have an HA NFS setup running on the cluster.
The FAStT system is also now quite old. The disks need to be replaced and the replacement cost for FC disks is very high compared to commodity SATA disks.
Where the problem comes in is that I also need to create a backup cluster at another physical location using the same shared storage. This is where things get more complicated. If I had an unlimited budget and an unlimited amount of bandwidth, I would have several options available to me from vendors like EMC and IBM just to name two. Since I don’t have either of those two, I have to explore other avenues.
Looking at GlusterFS
Early last year a colleague of mine suggested I look at GlusterFS as a possible solution. The GlusterFS approach is very different than traditional network storage solutions. At the time I didn’t feel that it was quite ready. That has changed in the last year. It has become much more polished, better documented and feature complete.
GlusterFS uses the concept of “storage bricks”, which are not much more than local storage on each Gluster server. You then combine bricks into volumes, which determines how those bricks are to be accessed by the clients using what Gluster calls translators. The latest GlusterFS release contains four different translators: distributed, replicated, striped and distributed replicated. Each has their place in different types of networks. Gluster also allows you to stack volumes as well, which is a very cool feature.
GlusterFS also stores all of its files using standard file systems with extended attributes. This means that normal backup software can easily backup and restore data to the storage bricks.
The biggest deviation from “traditional” network file systems as I see it is the fact that the translators are all client driven. What that means is that it is the client that decides which servers to write files to and how to distribute it. The servers do coordinate things by tracking their own storage bricks, but the heavy lifting is done by the client themselves. With the way technology has been, LAN speeds will almost certainly outperform hard drive speeds. Gigabit networks are the norm in data centers these days, and 10-gig networks are right around the corner. Processors have been following Moore’s Law, but hard drives have not — they have mechanical parts. SSD is still far to expensive to be practical, and even then SSD’s would still not be cost effective over a gigabit or 10-gigabit network. So with today’s technology this approach makes sense.
This same approach also helps it to scale. Rather than having expensive file servers using all of their resources coordinating disk access, file distribution, replication, etc., between themselves, it shifts that task to the client. This does a very effective job of distributing resources in a cost-effective way and makes it massively scalable.
The latest GlusterFS (3.2.1) also includes geo-replication, which is intended to keep storage volumes in sync, even if they are running in different physical locations. This was the last piece that I needed GlusterFS to do before it was feature complete enough for my application.
Actually deploying GlusterFS was very straightforward. I had two servers with freshly installed CentOS 5.5 x86_64 on them, I downloaded the RPM’s installed them and 10 minutes later I had a replicated volume setup. I installed the glusterfs-core and fuse packages on a third server that will act as a client, and just like the server setup, was off and running in a matter of minutes.
Each of the two “server” machines has a pair of 500GB drives running hardware RAID1 on the first and software RAID1 on the second. While not necessary, it does add one additional level of protection since the servers won’t be singly tasked with just running GlusterFS. Creating the volumes was simple:
server1# mkdir -p /exports/shared
server2# mkdir -p /exports/shared
server2# mkdir -p /exports/shared
server1# gluster peer probe server2
server1# gluster volume create shared replica 2 trsnport tcp server1:/exports/shared server2:/exports/shared
server1# gluster volume start shared
client1# mkdir /shared
client1# mount -t glusterfs server1:/shared /shared
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/md0 149408868 2414152 139282720 2% /
tmpfs 1028712 0 1028712 0% /dev/shm
466539648 66220160 376238336 15% /shared
That was all it took to get GlusterFS up and running. Total time spent was just a few minutes. The GFS setup took many hours to get things just right.
There was one last thing I needed to do. Since GlusterFS prefers the 64-bit architecture and I have a mixture of 32 and 64 bit systems, I decided that 64-bit clients will run the native Gluster client (as illustrated above) and that the 32-bit clients will access it via Gluster’s built in NFS server. So, I needed to tune the volume to have the NFS server return 32-bit inode addresses for NFS access. This was also very simple:
server1# gluster volume set shared.nfs.enable-ino32 on
From any 32-bit clients, I can now use:
mount -t nfs server1:/shared /shared
I’m quite impressed with the latest GlusterFS. The latest GlusterFS release made transitioning away from GFS quite painless. It performs very well and is tunable to different workloads. Any future projects I work on that require highly available shared storage will almost certainly be using GlusterFS.