Avoiding SPOFS with GlusterFS and Windows

glusterfshigh-availabilitywindows 7

We have a GlusterFS cluster we use for our processing function. We want to get Windows integrated into it, but are having some trouble figuring out how to avoid the single-point-of-failure that is a Samba server serving a GlusterFS volume.

Our file-flow works like this:

GlusterFS Document Flow

Files are read by a Linux processing node.
The files are processed.
Results (can be small, can be quite large) are written back to the GlusterFS volume as they're done.
- Results can be written to a database instead, or may include several files of various sizes.
The processing node picks up another job off of the queue and GOTO 1.

Gluster is great since it provides a distributed volume, as well as instant replication. Disaster resilience is nice! We like it.

However, as Windows doesn't have a native GlusterFS client we need some way for our Windows-based processing nodes to interact with the file store in a similarly resilient way. The GlusterFS documentation states that the way to provide Windows access is to set up a Samba server on top of a mounted GlusterFS volume. That would lead to a file flow like this:

GlusterFS doc-flow via Winders

That looks like a single-point-of-failure to me.

One option is to cluster Samba, but that appears to be based on unstable code right now and thus out of the running.

So I'm looking for another method.

Some key details about the kinds of data we throw around:

Original file-sizes can be anywhere from a few KB to tens of GB.
Processed file-sizes can be anywhere from a few KB to a GB or two.
Certain processes, such as digging in an archive file like .zip or .tar can cause a LOT of further writes as the contained files are imported into the file-store.
File-counts can get into the 10's of millions.

This workload does not work with a "static workunit size" Hadoop setup. Similarly, we've evaluated S3-style object-stores, but found them lacking.

Our application is custom written in Ruby, and we do have a Cygwin environment on the Windows nodes. This may help us.

One option I'm considering is a simple HTTP service on a cluster of servers that have the GlusterFS volume mounted. Since all we're doing with Gluster is essentially GET/PUT operations, that seems easily transferable to an HTTP-based file-transfer method. Put them behind a loadbalancer pair and the Windows nodes can HTTP PUT to their little blue heart's content.

What I don't know is how GlusterFS coherency would be maintained. The HTTP-proxy layer introduces enough latency between when the processing node reports that it is done with the write and when it is actually visible on the GlusterFS volume, that I'm worried about later processing stages attempting to pick up the file won't find it. I'm pretty sure that using the direct-io-mode=enable mount-option will help, but I'm not sure if that is enough. What else should I be doing to improve coherency?

Or should I be pursuing another method entirely?

As Tom pointed out below, NFS is another option. So I ran a test. Since the above mentioned files have client-supplied names that we need to keep, and can come in any language, we do need to preserve the file-names. So I built a directory with these files:

NFS directory with good names, on the server

When I mount it from a Server 2008 R2 system with the NFS Client installed, I get a directory listing like this:

NFS directory with bad names, on the client

Clearly, Unicode is not being preserved. So NFS isn't going to work for me.

Best Answer

I like GlusterFS. Actually, I adore GlusterFS. As long as you can give it some dedicated bandwidth everything's fine.

One of the best things about GlusterFS is using it with NFS. One of the surprising things I've been working with lately is NFS on Windows 7 and 2k8R2.

Here's what I'd do.

Set up 2 GlusterFS servers that can export NFS.
Set up a heartbeat link between them.
Deploy something like Heartbeat/Pacemaker perhaps?
Set up a virtual IP (VIP) between your Gluster Nodes.
Connect the Windows boxen's mapped network drives using the IP address of the VIP.
Test everything you can possibly imagine.

Clustering Samba sounds scary, and even if you do do that, Samba still lacks the ability to behave reliably in some windows networks (all that NT4 domain compatibility, never seem to be able to get past that).

I think that because each gluster node is in distributed,replicated mode then you should theoretically be able to connect to either and allow it to worry about moving your data around. As a result, the heartbeatd should be the thing that does the redirection and control which one you're talking to.

As for your

File-counts can get into the 10's of millions.

I suggest that you investigate using XFS as the underlying file system, as it's pretty good with big filesystems, and supported under GlusterFS

Related Solutions

GlusterFS snapshot backup solution

You need to put your gluster bricks on thinly provisioned LVM-Images.

Create LVM thinpool
Create LVM image for brick within thinpool
Creeate brick and setup gluster volume etc.

The easiest solution in my knowledge would be:

sudo gluster snapshot create volume_name snapshot_name
unmount snapshot from file-system
create lz4 image of snapshot with dd and lz4
mount image back

The total backup of a imagge with over 2M files and 18G takes about 90s on a dedicated server.

Pseudo-Code:

# create snapshot
echo $(date)" Creating glusterfs snapshot" >> $LOG
gluster snapshot create $SNAP_NAME $GS_VOLUME no-timestamp 2>>$LOG
echo $(date)" [OK]" >> $LOG

# get snapshot volume name
SNAP_VOL_NAME=$(gluster snapshot info $SNAP_NAME | grep "Snap\ Volume\ Name" | sed -e 's/.*S.*:.//g') MOUNT_OBJECT="/dev/"$VG"/"$SNAP_VOL_NAME"_0" 
MOUNT_POINT="/run/gluster/snaps/$SNAP_VOL_NAME/"$BRICK BACKUP_FS=$DIR_BA"/"$SNAP_NAME".ddimg.lz4"

# umount the image
umount $MOUNT_POINT

# create backup
echo $(date)" Creating lz4 of LVM image" >> $LOG
sudo dd if=$MOUNT_OBJECT 2>>$LOG | lz4 > $BACKUP_FS 2>>$LOG
echo $(date)" [OK]" >> $LOG

# mount image back
#mount $MOUNT_OBJECT $MOUNT_POINT

# delete (all) snapshots and umount
yes | gluster snapshot delete volume $GS_VOLUME 2>> $LOG

Linux – How to properly configure a 2 node glusterfs system

The problem any cluster-node has when coming up from a full stop is:

Do I have the latest state, or not? I don't want to claim latest if I'm behind the other down nodes.

This is why clustering very often includes some kind of quorum mechanism, so existing nodes can vote on state and converge on a consensus. Two node clusters can't use this mechanism, since there will never be a 'majority' partition. In the 3.7 release, Gluster gained a quorum feature.

http://gluster.readthedocs.org/en/release-3.7.0beta1/Features/server-quorum/

In that document, they state that 2-node clusters can't use it for the very reason I described above.

In your case, you may want to consider creating some management-only nodes in your Gluster setup. These would be peers that are probed into the cluster, but don't host any storage. Their entire reason to exist would be to maintain cluster-state. These can exist in different racks, datacenters, power phases, to try and ensure they are in a different fault-domain than the storage bricks. This will increase the number of members in a cluster, and will improve your chances of having a majority partition if one of the storage-bricks comes up without the other one.

Unfortuonately, the behavior you're seeing is working as designed and can't be changed without adding more servers into the cluster.

Best Answer

Related Solutions

GlusterFS snapshot backup solution

Linux – How to properly configure a 2 node glusterfs system

Related Topic