Skip to main content

A not so short guide to ZFS on Linux

Updated Oct 16 2013: shadow copies, memory settings and links for further learning.
Updated Nov 15 2013: shadow copies example, samba tuning.

Unless you've been living under a rock you should have by now heard many stories about how awesome ZFS is and the many ways it can help with saving your bacon.

The downside is that ZFS is not available (natively) for Linux because the CDDL license under which it is released is incompatible with the GPL. Assuming you are not interested in converting to one of the many Illumos distributions or FreeBSD this guide might serve you as a starting point if you are attracted  by ZFS features but are reluctant to try it out on production systems.

Basically in this post I note down both the tought process and the actual commands for implementing a fileserver for a small office. The fileserver will run as a virtual machine in a large ESXi host and use ZFS as the filesystem for shared data.

Scenario

A small office consisting of under 10 Windows clients. The reasons for choosing ZFS over other filesystems (already tried and tested on linux) are:
  1. snapshots: we want to take daily snapshots so that users can easily and autonomously recover previous or deleted versions of documents/files
  2. compression: well...to save space and perhaps improve i/o. A quick test with lz4 showed a compression ratio between 2.3X and 1.27X with no performance loss
Deduplication will not be activated because it is heavy, data that will sit on the disks is highly varied, and IMHO simply not stable enough yet for first tier storage.

The plan

Create a new VM on the ESXi host to which we need to assign 4 CPUs and 8GB of RAM (as ZFS is CPU and RAM hungry). The VM will use a small disk for the OS and then a number of larger disks for the ZFS pool.

This one knows about plans
OS is CentOS x64, ZFS will be built from source rpms from zfsonlinux.org.

Plan b

Very important: as an emergency recovery plan we also want to make sure that we can boot the VM from an Illumos live CD, mount the zfs pool and access the data. After lurking on the zfsonlinux Google Group for a while I can tell you that mounting your pool on an Illumos derivative to fix errors or just to regain access to data is a suggestion I have seen far too often to ignore it.

1. OS and ZFS installation

Install a base CentOS server. I started with a small 8GB disk for the root, boot and swap volumes.

Tip n.1: do not use the default VMWare paravirtual SCSI adapter because it is not supported by Illumos/Solaris: opt for LSI Logic Parallel instead.

Tip n.2: Since ZFS recommends referencing disks by id we will have to edit the vm and set an advanced option to enable that feature (VMWare does not support it by default). See the image below for directions or follow this link.
Enable disk id support in VMWare before creating zpools

During the install I resized the swap partition to 2GB instead of the default 4GB (half the RAM).
After the installer is done remember to disable SELinux (edit /etc/sysconfig/selinux and set SELINUX to disabled). I also usually run yum -y update && reboot to bring the system up to date and then move on with configuration.

When the system comes up online again install vmware tools, then proceed with setting up zfs.
Since we need to compile zfs there are a number of packages to install. The following command should install all required dependencies in two shots:

yum -y groupinstall "Development Tools"
yum -y install wget zlib-devel e2fsprogs-devel libuuid-devel libblkid-devel bc lsscsi mdadm parted mailx

ZOL documentation only mentions the "Development Tools" dependency, but I found out that the other mentioned in the second command are required further in the build process. Mailx is not exactly a dependency for ZFS, but we will need it later to send periodic email reports on the ZFS pool status.

To build ZFS we need to download the source rpms first. At the time of this writing 0.6.0 is the latest stable version. Remember to update the version numbers/urls!

cd /root
mkdir zfs
cd zfs
wget http://archive.zfsonlinux.org/downloads/zfsonlinux/spl/spl-0.6.0-rc14.src.rpm
wget http://archive.zfsonlinux.org/downloads/zfsonlinux/spl/spl-modules-0.6.0-rc14.src.rpm
wget http://archive.zfsonlinux.org/downloads/zfsonlinux/zfs/zfs-0.6.0-rc14.src.rpm
wget http://archive.zfsonlinux.org/downloads/zfsonlinux/zfs/zfs-modules-0.6.0-rc14.src.rpm

We can now build the binary packages with the following commands, YMMV:

rpmbuild --rebuild spl-0.6.0-rc14.src.rpm
rpm -ivh /root/rpmbuild/RPMS/x86_64/spl-0.6.0-rc14.el6.x86_64.rpm
rpmbuild --rebuild spl-modules-0.6.0-rc14.src.rpm
rpm -ivh /root/rpmbuild/RPMS/x86_64/spl-modules-*
rpmbuild --rebuild zfs-modules-0.6.0-rc14.src.rpm
rpmbuild --rebuild zfs-0.6.0-rc14.src.rpm
rpm -ivh /root/rpmbuild/RPMS/x86_64/zfs-*

If everything went well the zfs kernel module and utilities should have been installed. Let's try loading the kernel module:
modprobe zfs
if it loads correctly we should also be able to run the zpool/zfs commands like:
zpool list

which should report no available pools.

Have ZFS load on boot

As it is ZFS will not load automatically on boot which means that your data will not be available, but the following script takes care of loading the ZFS module. Pools and filesystems will be automatically detected by the kernel module and mounted.

cat > /etc/sysconfig/modules/zfs.modules <<EOF
#!/bin/bash
if [ ! -e /sys/module/zfs ] ; then
   modprobe zfs;
fi
EOF
chmod +x /etc/sysconfig/modules/zfs.modules

Restart the server and verify that ZFS is correctly loaded with:
lsmod | grep zfs

If it's there then we can start creating our fist pool.

2. Creating a pool

To create a pool we first need to add a disk to the vm. I chose to hot-add a thin-provisioned 100GB drive. To activate the drive we need to issue a rescan command to the SCSI controller:

echo "- - -" > /sys/class/scsi_host/host0/scan

the new disk should now be ready for use, inspect dmesg or list /dev/disk/by-id to confirm.
Supposing the disk id is scsi-36000c2978b3f413efb817a086ccfd31b the new pool can be created with the following command:

zpool create officedata1 scsi-36000c2978b3f413efb817a086ccfd31b

when the command returns the pool will be automatically mounted under /officedata1, ready for use.
Let's now create the filesystem which we will share with Samba later on:
zfs create -o casesensitivity=mixed -o compression=lz4 officedata1/data1

I have enabled compression and also mixed case sensitivity support which is needed to correctly support Windows clients. Important: casesensitivity can only be set at fs creation time and cannot be changed later.

Install samba and configure it to share /officedata1/data1 as usual, then copy some data on the share. You should be able to review compression stats with the following command:
[root@server ~]# zfs get all officedata1/data1 | grep compressratio
studioente1/data1  compressratio         1.28x                  -
studioente1/data1  refcompressratio      1.27x                  -

You can see that in my case compression yelds a reasonable 27% save on disk space. Not bad.

3. Maintenance

ZFS gurus suggest that zfs pools be periodically scrubbed to detect data corruption before it's too late.
Scrubbing can be performed with a cron job like the following:

  0  5  *  *  0 root /sbin/zpool scrub officedata1 > /dev/null 2>&1

Note: the command returns immediately, and the scrub process continues in background.
While we are at it we also want to receive a monthly report on the pool status.

4. When all else fails

When for whatever reason you cannot mount your ZFS pool(s) on Linux, all else fails and just before recovering from backups, one would often try with an Illumos build. In order to be ready for this situation download an unstable build of OmniOS, reboot your virtual machine from it and then make sure that you can access your zpools (import and export them when you're done). At this point I really hope you followed tip n.1 and used an LSI Logic Adapter instead of the VMWare Paravirtual which is offered to you by default.

This might sound paranoid but booting and importing your pools from Solaris/Illumos is an often heard last-resort suggestion on the ZOL mailing list to regain access to otherwise lost data.

5. Daily (and possibily more frequent) snapshots

ZFS without snapshots is just not cool. To enable automated snapshotting features we will use the zfs auto snapshot script which can be found here. Install/copy the script on your system and make it executable.
To start taking daily snapshots with a seven days retention policy place a script in /etc/cron.daily with the following content (adjust the path to zfs-auto-snapshot.sh):

#!/bin/sh
exec /root/bin/zfs-auto-snapshot.sh --quiet --syslog --label=daily --keep=7 //

Users should be able to access the snapshots directly through samba by manually typing the (normally hidden) path:
\\server\data1\.zfs\snapshots
in Windows Explorer and autonomously retrieve previous file versions without bothering the sysadmin.

Update 16/10/2013: Samba can be told to access ZFS snapshots and exposes them as Shadow Copies. Microsoft clients (XP requires an add-on, newer OSes support them natively) can then browse previous file versions directly from the properties tab of each file. Samba must be configured to use the vfs_shadow_copy2 module. This comment explains how.

6. Memory issues

ZFS does not use the Linux VM for caching, but instead implements its own, or better ported a Solaris Compatibility Layer (SPL, the other kernel module that must be installed with ZFS). This explains the issues that some users experience with metadata-heavy operations, like rsync.
Depending on the number of files and their size you might not hit memory issues once, howver to err on the safe side I applied the following configuration to my systems:

set vm.min_free_kbytes to 512MB (on a 36GB RAM system)
limit ZFS ARC usage by imposing a lower limit than default (1/2 of physical RAM). This link provides pretty good instructions on how to do it.
Last resort: if all else fail schedule

echo 1 > /proc/sys/vm/drop_caches
to run regularly from crontab.

7. Samba performance

You might find that Samba performs poorly undes ZFS on Linux, especially while browsing directories. Throughput is generally good, but browsing directories (evan small trees) can occasionally stall Windows Explorer.
The following settings improve Samba (and ZFS) performance in general:

zfs set xattr=sa tank/fish
zfs set atime=off tank/fish
The first one (source) tells ZFS to store extended attributes in the inodes instead of a hidden folder which, rather surprisingly, is the default! Performance improvement after applying the first one should be immediately visible! The second one disables atime, which you should always do on any filesystem.

Also apply the following modifications to smb.conf :

socket options = IPTOS_LOWDELAY TCP_NODELAY
max xmit = 65536
For those interested issue 1773 on github tracks Samba/ZoL performance problems.

8. Additional resources

Links to resources that provide in-depth explanation on ZFS:

Others:
http://www.matisse.net/bitcalc/ (to facilitate bit/bytes/kbytes conversions)

Comments

Popular posts from this blog

Indexing Apache access logs with ELK (Elasticsearch+Logstash+Kibana)

Who said that grepping Apache logs has to be boring?

The truth is that, as Enteprise applications move to the browser too, Apache access logs are a gold mine, it does not matter what your role is: developer, support or sysadmin. If you are not mining them you are most likely missing out a ton of information and, probably, making the wrong decisions.
ELK (Elasticsearch, Logstash, Kibana) is a terrific, Open Source stack for visually analyzing Apache (or nginx) logs (but also any other timestamped data).

From 0 to ZFS replication in 5m with syncoid

The ZFS filesystem has many features that once you try them you can never go back. One of the lesser known is probably the support for replicating a zfs filesystem by sending the changes over the network with zfs send/receive.
Technically the filesystem changes don't even need to be sent over a network: you could as well dump them on a removable disk, then receive  from the same removable disk.