Introduction to ZFS
Table of Contents
1 Summary
2 What is ZFS?
ZFS is an alternativ to traditional filesystems like UFS/ext2fs/etc. It is also a replacement for Logical Volume Managers and we will discuss the reasons later.
3 Feature Overview
ZFS uses 128bit pointers to address all kinds of objects, be it data, files (inodes), volumes, filesystems, etc. It also checksums all data on your filesystem - not only metadata like traditional filesystems do.
- 128bit pointers
- checksums on all objects
- uses transactions instead of logs
- all pool and filesystem configurations (including mountpoints) are stored within the pool, which makes it easier to hand over a pool to another system
4 Terms
4.1 Pools
ZFS Pools (short zpools, or just pools) are groups of disks. A disk can be a whole physical disk (recommended), a slice (partition), or a disk image. A zpool uses so called VDEVs (Virtual Devices) to create various RAID Levels.
4.2 VDEV (Virtual Devices)
Virtual Devices are a way of organizing physical devices into logical groups that support various features.
VDEV | Explanation |
---|---|
disk | a regular block device, a whole disk or a slice |
file | a image file, use it only for testing |
mirror | a mirror of two or more devices |
raidz (raidz1..3) | a variation of RAID5 that distributes data and parity on the set |
raidz1 (or just raidz) is single, raidz2 is double and raidz3 is triple parity | |
spare | specifies hot-spare devices for this pool |
log | specifies a device (or a mirror) for the ZFS Intent Log |
cache | specifies one or multiple devices for the Level2 ARC Cache |
4.3 ARC
The Adaptive Replacement Cache is a cache type in memory that ZFS uses to cache data and metadata in memory. While traditional filesystem caches can be invalidated (and hence freed) at any given time, the ARC cannot be freed immediatly. On illumos systems the ARC size is dynamic, and it’s maximum is at 75% of the free memory. The size of the ARC can be configured by tuning kernel settings (on-line).
4.4 Cache (L2ARC)
If your memory is not large enough, you can configure a l2arc - or cache - vdev, which provides a cache for asynchronous writes and all reads. This cache should not be on a media with the same speed where your data is on, but on faster disks. For example put a cache vdev on a SSD, while have your actual pool data on spinning disks.
4.5 ZFS Intent Log (ZIL)
All IO operations on ZFS are asynchronously written on the disk. This increases performance for the writes. For applications (like databases) that demand synchronous writes, these are simulated by writing into a special cache called the ZFS Intent Log. If no ZIL was explicitly configured, some disk-space of the ZPool will be taken for it. The recommendation is to set up an explicit ZIL on mirrored SSDs (NVMe preferred at time of this writing). After data was committed into the ZIL, the sync-IO write operations returns for the application as the data is persisted on disk. Later an asynchronous IO job will pick up the ZIL data and transfer it to the pool devices.
5 Command line tools
Since ZFS is always consistent due to its transactional behaviour, there is no “fsck” for ZFS.
5.1 zpool
The zpool(1M)
command is used to create and manage ZPools, so the “Volume Manager” side of ZFS.
5.2 zfs
The zfs(1M)
command configures all filesystem related objects, also snapshots, clones, etc.
5.3 zdb
The zdb(1M)
is the ZFS debugger, which enables you not to modify data, but to inspect on-disk and in-memory data.
6 Basic Usage
6.1 Single disk pool
We will start with a very simple task - we want to set up a single disk called /dev/dsk/c4t2d0 as a zpool, name it as “testpool”, create a filesystem on it and mount it to /testpool.
# zfs create testpool /dev/dsk/c4t2d0
That’s it.
Or as a simple demo, we will use a image (created by mkfile(1)
or dd(1)
):
(600) x230:/root# mkfile 128M testfile (601) x230:/root# zpool create testpool /root/testfile (602) x230:/root# zpool list testpool NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT testpool 112M 516K 111M - - 2% 0% 1.00x ONLINE - (603) x230:/root# zpool status testpool pool: testpool state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM testpool ONLINE 0 0 0 /root/testfile ONLINE 0 0 0 errors: No known data errors (604) x230:/root# df -h /testpool Filesystem Size Used Available Capacity Mounted on testpool 55.8M 23K 55.5M 1% /testpool (605) x230:/root#
6.2 Create a simple mirror
We want to create a similiar pool like in the first example, but it should be redundant to disk failure and we prefer a mirror in this case. Again we use images, this time of course two (or more!).
(605) x230:/root# mkfile 128M mirrfile1 mirrfile2 (607) x230:/root# zpool create mirrpool mirror /root/mirrfile1 /root/mirrfile2 (608) x230:/root# zpool list mirrpool NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT mirrpool 112M 90.5K 112M - - 2% 0% 1.00x ONLINE - (609) x230:/root# zpool status mirrpool pool: mirrpool state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM mirrpool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 /root/mirrfile1 ONLINE 0 0 0 /root/mirrfile2 ONLINE 0 0 0 errors: No known data errors (610) x230:/root# df -h /mirrpool Filesystem Size Used Available Capacity Mounted on mirrpool 56.0M 23K 55.9M 1% /mirrpool
6.3 create an additional filesystem on an existing pool
A pool bundles filesystems, but so far we have not explicitly created an filesystem. When we set up a new pool, we get the root filesystem of this pool implicit created and mounted.
So you can create a new filesystem by simply issuing
(612) x230:/root# zfs create testpool/samplefs (613) x230:/root# df -h | grep samplefs testpool/samplefs 56.0M 23K 55.8M 1% /testpool/samplefs
Creating filesystems looks a bit like just creating directories - but of course they are complete filesystems with unique I-node number ranges, a size, etc.
But we can actually create complete structures of filesystems like mkdir -p
creates sub-directory structures:
(614) x230:/root# zfs create -p testpool/struct/sub_a/sub_b/sub_c (615) x230:/root# df -h | grep sub testpool/struct/sub_a 56.0M 23K 55.7M 1% /testpool/struct/sub_a testpool/struct/sub_a/sub_b 56.0M 23K 55.7M 1% /testpool/struct/sub_a/sub_b testpool/struct/sub_a/sub_b/sub_c 56.0M 23K 55.7M 1% /testpool/struct/sub_a/sub_b/sub_c
As we can see here, for every level a new filesystem has been created, also all have been mounted.
Also you see that all have the same size and the same free amount of space. This is because we have
not set a filesystem size, or quota as we call it. Before we discuss how to change attributes to filesystems
we will have a look at the zfs list
and zfs get
commands.
6.4 inspecting your filesystems
6.4.1 listing filesystems
If you run zfs list
without any arguments, it will list all filesystems on your system.
(616) x230:/root# zfs list NAME USED AVAIL REFER MOUNTPOINT mirrpool 77K 55.9M 23K /mirrpool rpool 124G 325G 33K /rpool rpool/ROOT 37.9G 325G 23K legacy rpool/ROOT/openindiana-10 21.8M 325G 17.5G / [...] rpool/swap 8.34G 333G 175M - testpool 250K 55.7M 23K /testpool testpool/samplefs 23K 55.7M 23K /testpool/samplefs testpool/struct 92K 55.7M 23K /testpool/struct testpool/struct/sub_a 69K 55.7M 23K /testpool/struct/sub_a testpool/struct/sub_a/sub_b 46K 55.7M 23K /testpool/struct/sub_a/sub_b testpool/struct/sub_a/sub_b/sub_c 23K 55.7M 23K /testpool/struct/sub_a/sub_b/sub_c
You may specify a list of filesystems to list as arguments to zfs list
.
(617) x230:/root# zfs list testpool/samplefs testpool/struct/sub_a NAME USED AVAIL REFER MOUNTPOINT testpool/samplefs 23K 55.7M 23K /testpool/samplefs testpool/struct/sub_a 69K 55.7M 23K /testpool/struct/sub_a
Or you can specify a recursive listing from any filesystem:
(618) x230:/root# zfs list -r testpool/struct/sub_a NAME USED AVAIL REFER MOUNTPOINT testpool/struct/sub_a 69K 55.7M 23K /testpool/struct/sub_a testpool/struct/sub_a/sub_b 46K 55.7M 23K /testpool/struct/sub_a/sub_b testpool/struct/sub_a/sub_b/sub_c 23K 55.7M 23K /testpool/struct/sub_a/sub_b/sub_c
In all cases the output has 5 columns:
- NAME: specifies the filesystem name
- USED: shows the “used” space of the filesystem
- AVAIL: show the available space
- REFER: shows the space that is actually referred
- MOUNTPOINT: shows either the mountpoint, “legacy”, “none” or a “-”
6.4.2 inspecting and setting attributes for a filesystem
We now know how to retrieve a list of filesystems with some very basic attributes.
But there are a whole lot more, which we can retrieve with the zfs get
command.
The basic syntax is zfs get attribute filesystem
, a special attribute is “all” which
will show us all attributes for the specified filesystem:
(619) x230:/root# zfs get all testpool/struct/sub_a NAME PROPERTY VALUE SOURCE testpool/struct/sub_a type filesystem - testpool/struct/sub_a creation Wed Dec 12 20:16 2018 - testpool/struct/sub_a used 69K - testpool/struct/sub_a available 55.7M - testpool/struct/sub_a referenced 23K - testpool/struct/sub_a compressratio 1.00x - testpool/struct/sub_a mounted yes - testpool/struct/sub_a quota none default testpool/struct/sub_a reservation none default testpool/struct/sub_a recordsize 128K default testpool/struct/sub_a mountpoint /testpool/struct/sub_a default testpool/struct/sub_a sharenfs off default testpool/struct/sub_a checksum on default testpool/struct/sub_a compression off default testpool/struct/sub_a atime on default testpool/struct/sub_a devices on default testpool/struct/sub_a exec on default testpool/struct/sub_a setuid on default testpool/struct/sub_a readonly off default testpool/struct/sub_a zoned off default testpool/struct/sub_a snapdir hidden default testpool/struct/sub_a aclmode discard default testpool/struct/sub_a aclinherit restricted default testpool/struct/sub_a createtxg 250 - testpool/struct/sub_a canmount on default testpool/struct/sub_a xattr on default testpool/struct/sub_a copies 1 default testpool/struct/sub_a version 5 - testpool/struct/sub_a utf8only off - testpool/struct/sub_a normalization none - testpool/struct/sub_a casesensitivity sensitive - testpool/struct/sub_a vscan off default testpool/struct/sub_a nbmand off default testpool/struct/sub_a sharesmb off default testpool/struct/sub_a refquota none default testpool/struct/sub_a refreservation none default testpool/struct/sub_a guid 15442038872638995515 - testpool/struct/sub_a primarycache all default testpool/struct/sub_a secondarycache all default testpool/struct/sub_a usedbysnapshots 0 - testpool/struct/sub_a usedbydataset 23K - testpool/struct/sub_a usedbychildren 46K - testpool/struct/sub_a usedbyrefreservation 0 - testpool/struct/sub_a logbias latency default testpool/struct/sub_a dedup off default testpool/struct/sub_a mlslabel none default testpool/struct/sub_a sync standard default testpool/struct/sub_a refcompressratio 1.00x - testpool/struct/sub_a written 23K - testpool/struct/sub_a logicalused 34.5K - testpool/struct/sub_a logicalreferenced 11.5K - testpool/struct/sub_a filesystem_limit none default testpool/struct/sub_a snapshot_limit none default testpool/struct/sub_a filesystem_count none default testpool/struct/sub_a snapshot_count none default testpool/struct/sub_a redundant_metadata all default
As you can see, the “mountpoint” is an attribute, like the “quota”. Not all of these attributes can be set, but we will set now the quota and have a look at the effect.
(620) x230:/root# zfs set quota=32M testpool/struct/sub_a
(623) x230:/root# zfs list -r testpool/struct
NAME USED AVAIL REFER MOUNTPOINT
testpool/struct 92K 55.7M 23K /testpool/struct
testpool/struct/sub_a 69K 31.9M 23K /testpool/struct/sub_a
testpool/struct/sub_a/sub_b 46K 31.9M 23K /testpool/struct/sub_a/sub_b
testpool/struct/sub_a/sub_b/sub_c 23K 31.9M 23K /testpool/struct/sub_a/sub_b/sub_c
(624) x230:/root# df -h | grep struct
testpool/struct 56.0M 23K 55.7M 1% /testpool/struct
testpool/struct/sub_a 32M 23K 31.9M 1% /testpool/struct/sub_a
testpool/struct/sub_a/sub_b 32M 23K 31.9M 1% /testpool/struct/sub_a/sub_b
testpool/struct/sub_a/sub_b/sub_c 32M 23K 31.9M 1% /testpool/struct/sub_a/sub_b/sub_c
This filesystems and it’s children have now an upper quota of total 32M.