FreeBSD and 65TB of ZFS

Fri 08 December 2000 by incin

SERVER

Supermicro SuperStorage Server 6049P-E1CR36L - 36x SATA/SAS - LSI 3008 
12G SAS - Dual 10-Gigabit Ethernet - 1200W Redundant Processor 
2 x Intel Xeon Silver 4110 Processor 8-core 2.10GHz 11.00MB Cache (85W) 
Memory 12 x 16GB PC4-21300 2666MHz DDR4 ECC Registered DIMM 
Storage Drive 36 x 4.0TB SAS 3.0 12.0Gb/s 7200RPM - 3.5" - Hitachi Ultrastar 7K6000 (512e) Controller Card Supermicro AOC-S3008L-L8E SAS 3.0 12Gb/s 8-port Host Bus Adapter 
Chipset Intel C624 - Socket LGA3647 
I/O Processor LSI SAS3008

Install Notes

  1. FreeBSD 11.1 on USB and went through installer

  2. From the partitioning screen selected ‘Auto (ZFS) Guided Root-on-ZFS’

  3. From the looks of it there is a new ZFS option to select a raid10 configuration along with all other others (raidz1, raidz2, mirror, etc…) I selected the new raid10 option. a. It took some time to probe all the devices but after that I could see all 36 drives and the flash drive from the probe list. b. I selected all 36 drives to be in the mirrors and continued the installer.

  4. The install finished and I rebooted also unplugging the USB.

  5. During the boot up process it would halt on:

Photo

  1. I booted the server into the USB installer so I could get to a live shell.

  2. When in the live shell I could see all 36 drives and I could import the pool. “gpart show” listed all the drives with the partitioning (data below is from a different server but same output)

=>        40  5860533088  da4  GPT  (2.7T)
          40        1024    1  freebsd-boot  (512K)
        1064         984       - free -  (492K)
        2048     4194304    2  freebsd-swap  (2.0G)
     4196352  5856336776    3  freebsd-zfs  (2.7T)
  1. The main reason I wanted to use the raid10 option from the FreeBSD installer was each of the 36 drives would have a boot partition, swap, and a partition I could put into ZFS. I found a note on the Internet yesterday that said ZFS can only put boot code on 12 drives or maybe I am reading this article wrong but here it is. https://blog.multiplay.co.uk/2012/01/zfs-io-error-all-block-copies-unavailable-on-large-disk-number-machines/

  2. Since I was having issues I was thinking of doing it the old fashioned way. I booted the FreeBSD installer again and from the partitioning screen selected ‘mirror’. I selected da0 and da1 to be the mirror and installed FreeBSD.

  3. When the server rebooted I got to a screen where it was trying to boot root and failed. The server instantly rebooted. I was able to grab a screen shot before it fully started to reboot.

Photo

  1. Some things to note here are: a. The motherboard BIOS has a option in the boot menu to be ‘Legacy, Daul, or UEFI) trying different options made no differences when trying to install FreeBSD b. The LSI card was set to boot in ‘BIOS and OS mode’ I did try OS only with no difference c. In the firmware of the LSI card it shows two enclosures, one for the 24 front disks and one for the back 12 disks.

  2. The SuperMicro chassis we purchased has two 2.5in hard drive bays as well separate from the original 36 bays. My thought was to try and install FreeBSD to these 2 drives as the boot drives, so I did for testing. a. During the FreeBSD installer I selected a ‘mirror’ and installed FreeBSD to the 2 SSDs.

  3. FreeBSD installed just fine and I could log in:

    zroot     109G  1.67G   107G         -     0%     1%  1.00x  ONLINE  -
pool: zroot
 state: ONLINE
  scan: none requested
config:

    NAME        STATE     READ WRITE CKSUM
    zroot       ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        ada0p3  ONLINE       0     0     0
        ada1p3  ONLINE       0     0     0

    =>       40  234441568  ada0  GPT  (112G)
         40       1024     1  freebsd-boot  (512K)
       1064        984        - free -  (492K)
       2048    4194304     2  freebsd-swap  (2.0G)
    4196352  230244352     3  freebsd-zfs  (110G)
  234440704        904        - free -  (452K)

=>       40  234441568  ada1  GPT  (112G)
         40       1024     1  freebsd-boot  (512K)
       1064        984        - free -  (492K)
       2048    4194304     2  freebsd-swap  (2.0G)
    4196352  230244352     3  freebsd-zfs  (110G)
  234440704        904        - free -  (452K)
diskinfo -v ada0
ada0
    512             # sectorsize
    120034123776    # mediasize in bytes (112G)
    234441648       # mediasize in sectors
    4096            # stripesize
    0               # stripeoffset
    232581          # Cylinders according to firmware.
    16              # Heads according to firmware.
    63              # Sectors according to firmware.
    S21TNXAG703407K    # Disk ident.
    Not_Zoned       # Zone Mode
#zdb
zroot:
    version: 5000
    name: 'zroot'
    state: 0
    txg: 101
    pool_guid: 5556269825277037670
    hostid: 3026903576
    hostname: ''
    com.delphix:has_per_vdev_zaps
    vdev_children: 1
    vdev_tree:
        type: 'root'
        id: 0
        guid: 5556269825277037670
        create_txg: 4
        children[0]:
            type: 'mirror'
            id: 0
            guid: 12760901219743727848
            metaslab_array: 38
            metaslab_shift: 30
            ashift: 12
            asize: 117880389632
            is_log: 0
            create_txg: 4
            com.delphix:vdev_zap_top: 35
            children[0]:
                type: 'disk'
                id: 0
                guid: 7583594044857238745
                path: '/dev/ada0p3'
                whole_disk: 1
                create_txg: 4
                com.delphix:vdev_zap_leaf: 36
            children[1]:
                type: 'disk'
                id: 1
                guid: 3550890253934674866
                path: '/dev/ada1p3'
                whole_disk: 1
                create_txg: 4
                com.delphix:vdev_zap_leaf: 37
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data
  1. Now I can play with the 36 disks in a different pool. Few notes here: a. I didn’t want ZFS to use the entire disk, giving it da0, da1, etc… - Worried about replacement drives being exactly the same size, vendors, etc. b. Partitions are easier to play with and I feel like I have more control over my disk

  2. 36 disk mirrors:

    gpart create -s GPT da0-da35

    gpart add -t freebsd-zfs -s 3700G da0-da36

    zpool create storage mirror da0p1 da1p1 mirror da2p1 da3p1 mirror da4p1 da5p1 mirror da6p1 da7p1 mirror da8p1 da9p1 mirror da10p1 da11p1 mirror da12p1 da13p1 mirror da14p1 da15p1 mirror da16p1 da17p1 mirror da18p1 da19p1 mirror da20p1 da21p1 mirror da22p1 da23p1 mirror da24p1 da25p1 mirror da26p1 da27p1 mirror da28p1 da29p1 mirror da30p1 da31p1 mirror da32p1 da33p1 mirror da34p1 da35p1
pool: storage
 state: ONLINE
  scan: none requested
config:

    NAME         STATE     READ WRITE CKSUM
    storage      ONLINE       0     0     0
      mirror-0   ONLINE       0     0     0
        da0p1    ONLINE       0     0     0
        da1p1    ONLINE       0     0     0
      mirror-1   ONLINE       0     0     0
        da2p1    ONLINE       0     0     0
        da3p1    ONLINE       0     0     0
      mirror-2   ONLINE       0     0     0
        da4p1    ONLINE       0     0     0
        da5p1    ONLINE       0     0     0
      mirror-3   ONLINE       0     0     0
        da6p1    ONLINE       0     0     0
        da7p1    ONLINE       0     0     0
      mirror-4   ONLINE       0     0     0
        da8p1    ONLINE       0     0     0
        da9p1    ONLINE       0     0     0
      mirror-5   ONLINE       0     0     0
        da10p1   ONLINE       0     0     0
        da11p1   ONLINE       0     0     0
      mirror-6   ONLINE       0     0     0
        da12p1   ONLINE       0     0     0
        da13p1   ONLINE       0     0     0
      mirror-7   ONLINE       0     0     0
        da14p1   ONLINE       0     0     0
        da15p1   ONLINE       0     0     0
      mirror-8   ONLINE       0     0     0
        da16p1   ONLINE       0     0     0
        da17p1   ONLINE       0     0     0
      mirror-9   ONLINE       0     0     0
        da18p1   ONLINE       0     0     0
        da19p1   ONLINE       0     0     0
      mirror-10  ONLINE       0     0     0
        da20p1   ONLINE       0     0     0
        da21p1   ONLINE       0     0     0
      mirror-11  ONLINE       0     0     0
        da22p1   ONLINE       0     0     0
        da23p1   ONLINE       0     0     0
      mirror-12  ONLINE       0     0     0
        da24p1   ONLINE       0     0     0
        da25p1   ONLINE       0     0     0
      mirror-13  ONLINE       0     0     0
        da26p1   ONLINE       0     0     0
        da27p1   ONLINE       0     0     0
      mirror-14  ONLINE       0     0     0
        da28p1   ONLINE       0     0     0
        da29p1   ONLINE       0     0     0
      mirror-15  ONLINE       0     0     0
        da30p1   ONLINE       0     0     0
        da31p1   ONLINE       0     0     0
      mirror-16  ONLINE       0     0     0
        da32p1   ONLINE       0     0     0
        da33p1   ONLINE       0     0     0
      mirror-17  ONLINE       0     0     0
        da34p1   ONLINE       0     0     0
        da35p1   ONLINE       0     0     0

storage  64.7T  1.94T  62.7T         -     1%     2%  1.00x  ONLINE  -

Disk layout for all 36 disks are as follows:

=>        40  7814037088  da0  GPT  (3.6T)
          40  7759462400    1  freebsd-zfs  (3.6T)
  7759462440    54574688       - free -  (26G)
--- Part of the zdb output 
storage:
    version: 5000
    name: 'storage'
    state: 0
    txg: 5
    pool_guid: 2885632977191828883
    hostid: 302690357
    com.delphix:has_per_vdev_zaps
    vdev_children: 18
    vdev_tree:
        type: 'root'
        id: 0
        guid: 2885632977191828883
        create_txg: 4
        children[0]:
            type: 'mirror'
            id: 0
            guid: 3273745293975073811
            metaslab_array: 108
            metaslab_shift: 35
            ashift: 12
            asize: 3972840030208
            is_log: 0
            create_txg: 4
            com.delphix:vdev_zap_top: 35
            children[0]:
                type: 'disk'
                id: 0
                guid: 13368448578724005097
                path: '/dev/da0p1'
                phys_path: 'id1,enc@n500304801867787d/type@0/slot@1/elmdesc@Slot00'
                whole_disk: 1
                create_txg: 4
                com.delphix:vdev_zap_leaf: 36
            children[1]:
                type: 'disk'
                id: 1
                guid: 13789828341468859637
                path: '/dev/da1p1'
                phys_path: 'id1,enc@n500304801867787d/type@0/slot@2/elmdesc@Slot01'
                whole_disk: 1
                create_txg: 4
                com.delphix:vdev_zap_leaf: 37
diskinfo -v da0-da35
da0
    512             # sectorsize
    4000787030016    # mediasize in bytes (3.6T)
    7814037168      # mediasize in sectors
    4096            # stripesize
    0               # stripeoffset
    486401          # Cylinders according to firmware.
    255             # Heads according to firmware.
    63              # Sectors according to firmware.
    K4KR2YRL        # Disk ident.
    id1,enc@n500304801867787d/type@0/slot@1/elmdesc@Slot00    # Physical path
    Not_Zoned       # Zone Mode
  1. On Monday: a. A Samsung NVME stick will be added to the motherboard (since there are two dedicated slots) for a cache device. b. Two Intel SSD will be used to the SLOG in a mirror https://www.newegg.com/Product/Product.aspx?Item=9SIA8PV5VV1499

  2. Looking on the web I seems like SLC SSDs are not produced any more like they once were. I wanted to use SLC drives for the SLOG. When trying to figure out what drives to use for the SLOG the enterprise Intel ones listed above were interesting since they have a ‘Enhanced Power Loss Data Protection’ which in the event of a power failure it will write whatever it has in its RAM to its disk. Most of all Intel’s products are MLC now.

  3. Things to consider? a. Do I need to add a swap partition to each of the 36 drives? Does this benefit me? b. Are the Intel SSDs from NewEgg the correct way to go for a SLOG? Should they not be on a PCI slot in case one fails, but this server can be shutdown if needed. c. There are pros and cons for having the boot only on two drives, but should we really try to get the 36 drives working without the 2 SSDs? e. Maybe I don’t know what the hell I’m doing but this is why I am testing. :)

  4. Testing: Testing power failure while writing data over NFS shares