Terraforming NixOS hosts

I’ve made a provider to deploy Nixos hosts with Terraform.

Here is a list of features it support at this moment:

  • configuration deployment
  • secrets deployment
  • SSH bastions
  • provider, Nix, SSH settings overriding on per-host basis
  • host addresses prioritization

I’ll update this article in case something new will become available.

Before we begin…

Requirements:

  • NixOS
  • Nix version >= 2.10.3 (lower may also work, but thats what I use)
  • QEMU (to run example deployment)

This article is based on terraform-provider-nixos >= 0.0.14.

Why

Why making new Terraform provider? There are already:

Because:

  • there was nothing I was comfortable with
  • I love to make things

I’ve used Tweag HCL module for some time in production. Some cons I’ve realized during use:

  • bad support for SSH bastions
  • poor secrets support
  • it is a HCL module with all HCL limitations

Haven’t used provider from Andrew Chambers (I’ve discovered it in the middle of making my own). From what I found:

  • no support for SSH bastions
  • no support for secrets

I want more control over configuration of the provider, how it connects to the servers, what known hosts file it uses (preferably on per-host basis). Almost every time I have deployed something into the cloud I need support for bastion SSH servers but not so many tools implement this in a controllable and hackable way.

Also secret management schemes are a thing in NixOS because every user in a system could read /nix/store. I am not a big fan of solving this problem with encryption (have done some crunching around this problem in past). There is no secrets support (or it is very poor) in mentioned providers.

Why not just use NixOps?

NixOps is OK. I am not a fan of Terraform (actually I hate Terraform and Hashicorp with passion!), but it has so many integrations. So… it is worth to have a way to deploy NixOS configurations with it.

Installation

There are two ways to install it:

  • with Nix
  • with Terraform

This article is about NixOS, so I will show how to install provider with Nix first.
Installing with Nix is optional, you could skip to initialization.
Also it is good to know that Hashicorp has blocked access for some countries because of government sanctions, installing with Nix is a way to bypass such nonsense.

I have release.nix in the repo which updates with every new version. It shows how to build Terraform with NixOS provider.

Let’s write shell.nix which will use Terraform with predefined set of providers.

let
  inherit (pkgs)
    stdenv
  ;

  nixpkgs = <nixpkgs>;
  config = {};
  pkgs = import nixpkgs { inherit config; };

  mkProvider = pkgs.terraform_1.plugins.mkProvider;
  terraform = pkgs.terraform_1.withPlugins (p: [
    p.vultr
    p.linode
    (mkProvider rec {
      owner = "corpix";
      repo = "terraform-provider-nixos";
      rev = "0.0.14";
      version = rev;
      sha256 = "sha256-4QATev3WtwpEwc4/+JjOBfvUVzUre15VZT7tXLkSrXM=";
      vendorSha256 = null;
      provider-source-address = "registry.terraform.io/corpix/nixos";
    })
  ]);
in stdenv.mkDerivation {
  name = "nix-shell";
  buildInputs = with pkgs; [
    terraform
  ];
}

Btw, other providers could be added to the list, just like I did with vultr and linode in the example. Available providers could be listed using REPL:

 λ  nix repl '<nixpkgs>'
nix-repl> pkgs.terraform-providers.<TAB>
...

Then issue a nix-shell in the directory where shell.nix is stored and boom! Terraform with preinstalled providers is available in the shell.

This is working using a directory as a Terraform registry, so providers will be installed from a source code, not downloaded from Terraform registry.

Initialization

But providers is still need an initialization. And here we are, moving forward to the second point of the list: installing providers with Terraform itself.

At this step HCL file is required, name is not important, but let’s call it main.tf:

Versions are optional, it will install “latest” by default.

terraform {
  required_providers {
    nixos = {
      source = "corpix/nixos"
      # version = "0.0.14"
    }
    vultr = {
      source = "vultr/vultr"
    }
    linode = {
      source = "linode/linode"
    }
  }
}

Issue a terraform init in the shell:

I’ve cut the output to make it shorter.

 λ  terraform init
Initializing provider plugins...
- Finding latest version of corpix/nixos...
- Finding latest version of vultr/vultr...
- Finding latest version of linode/linode...
- Installing corpix/nixos v0.0.14...
- Installed corpix/nixos v0.0.14
- Installing vultr/vultr v2.11.3...
- Installed vultr/vultr v2.11.3
- Installing linode/linode v1.28.1...
- Installed linode/linode v1.28.1

Terraform has been successfully initialized!

It will create:

Updating providers which was installed using Nix will require deletion of this artifacts with following re-running of the terraform init.

  • .terraform.lock.hcl lock file with provider hashes
  • .terraform directory where provider executables are stored

Configuration deployment

We have provider installed. It is time to write some configuration and deploy it.

Save this example configuration into configuration.nix, replacing SSH my public key with yours. It should run SSH server with disabled root user password (can’t login using TTY).

{ pkgs, lib, ... }: {
  imports = [
    <nixpkgs/nixos/modules/profiles/qemu-guest.nix>
  ];

  config = {
    users = rec {
      mutableUsers = false;
      extraUsers.root = {
        isNormalUser = false;
        hashedPassword = users.root.hashedPassword;
        openssh.authorizedKeys.keys = [
          "ecdsa-sha2-nistp521 AAAAE2VjZHNhLXNoYTItbmlzdHA1MjEAAAAIbmlzdHA1MjEAAACFBACa4D4ycVdMtyIt1WUeoG3S/cdCARlyffhn6LsogFLHURvKtoMVV4cgZBrexju4SjpO/nAlHio8y8T1U0nV5WKDJAAIH0PhPt79HWQOi6HB4d/7UUncMndktyVYar0Mneir/Ci2yQEVmq6vYKKPTuwVynCB2r6yG1IzD1rhFEAG5OUeSg=="
        ];
      };
      users.root.hashedPassword = "!";
    };

    services = {
      openssh.enable = true;
      openssh.passwordAuthentication = false;
      # haveged.enable = true;
    };

    i18n.defaultLocale = "en_US.UTF-8";
    time.timeZone = "UTC";

    # NOTE: just to build faster & use less space
    documentation.nixos.enable = false;
    documentation.man.man-db.enable = false;

    fileSystems."/" = {
      device = "/dev/disk/by-label/nixos";
      autoResize = true;
      fsType = "ext4";
    };

    # NOTE: this are QEMU-specific settings
    boot.growPartition = true;
    boot.kernelParams = ["console=ttyS0"];
    boot.loader.grub.device = "/dev/sda";
    boot.loader.timeout = 0;
  };
}

To run a virtual machine QEMU will be used. I assume reader to have it the system. Before running VM a disk image should be built, this is where nixos-generate tool will help. Run shell to get it and build an image:

 λ  nix-shell -p nixos-generators
 λ  nixos-generate -f qcow -c configuration.nix
...
/nix/store/f0wwhg3vh1h6n06913hd4h763w3nzz5m-nixos-disk-image/nixos.qcow2

It have printed path to the disk image before exit. Copy it to the working directory and chmod, because it will be read-only after copying from Nix store:

 λ  cp /nix/store/f0wwhg3vh1h6n06913hd4h763w3nzz5m-nixos-disk-image/nixos.qcow2 ./
 λ  chmod 644 nixos.qcow2

For BTRFS it could be a good thing to disable CoW using chattr +C nixos.qcow2.

Dispatch, we are ready to launch! Run QEMU using this command which will passthrough SSH port from VM, making it available at 2222/tcp on host machine:

 λ  qemu-kvm -boot d -m 2048 -net nic -net user,hostfwd=tcp::2222-:22 -hda ./nixos.qcow2

If everything went smoothly here is how QEMU window will look like:

To enter VM shell issue:

  λ  ssh -p 2222 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no root@127.0.0.1
[root@nixos:~]#

Next step is to describe an instance, making Terraform know how to connect to the host and what configuration to apply. Time to append some stuff to main.tf which was created previously.

Need to mention: disabled userKnownHostsFile & strictHostKeyChecking is just for testing purposes, this is insecure to use this settings for real things.

ssh could be defined globally or per-instance.

resource "nixos_instance" "vm" {
  address = ["127.0.0.1"]
  configuration = "./configuration.nix"

  ssh {
    port = 2222
    config = {
      userKnownHostsFile = "/dev/null"
      strictHostKeyChecking = "no"
    }
  }
}

Ready to apply! Run Terraform, which will ask whether changes are expected or not.

Other things may be asked, for example: if enable userKnownHostsFile & strictHostKeyChecking then SSH client would ask to approve host key fingerprint interactively.

Terraform may be started with environment variable TF_LOG set to the value INFO, this will make a lot of noise with details about the Nix derivation build progress and upload (there is no clear way to print something on the screen from Terraform provider).

  λ  terraform apply -target nixos_instance.vm
Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with
the following symbols:
  + create

Terraform will perform the following actions:

  # nixos_instance.vm will be created
  + resource "nixos_instance" "vm" {
      + address            = [
          + "127.0.0.1",
        ]
      + configuration      = "./configuration.nix"
      + id                 = (known after apply)
      + secret_fingerprint = (known after apply)
      + settings           = jsonencode({})
      + system             = "x86_64-linux"

      + derivations {
          + outputs = (known after apply)
          + path    = (known after apply)
        }

      + ssh {
          + config = {
              + "strictHostKeyChecking" = "no"
              + "userKnownHostsFile"    = "/dev/null"
            }
          + port   = 2222
          + user   = "root"
        }
    }

Plan: 1 to add, 0 to change, 0 to destroy.

After typing yes and pressing Enter it should say something like:

nixos_instance.vm: Creating...
nixos_instance.vm: Creation complete after 7s [id=e64be02d-a4b6-7a0a-1cb2-23cc3cfab449]

Apply complete! Resources: 1 added, 0 changed, 0 destroyed.

Proove VM has a new generation deployed and switched:

[root@nixos:~]# nix-env --list-generations --profile /nix/var/nix/profiles/system
   1   2022-08-10 12:07:01
   2   2022-08-10 12:13:36   (current)

We deployed our first configuration with Terraform.

Secrets handling

Provider has support for provisioning secrets onto the hosts. It supports reading secrets from various places, but by default it uses local filesystem.

Here is how to transfer secret from filesystem to vm instance (more settings):

secret could be defined globally or per-instance.

resource "nixos_instance" "vm" {
  # ...

  secret {
    source = "./secrets/key"
    destination = "/root/secrets/key"
  }
}

When using different secret providers source used an identifier
for the secret concrete provider could retrieve.
For example, there may be secret example.com/key in GoPass which will
have source = "example.com/key" if NixOS provider configured to use GoPass as a secret provider.

This will upload ./secrets/key file to the host to the path /root/secrets/key. If /root/secrets does not exists then it will be created.

NixOS provider use tar to deliver secrets via SSH, parent directory creation is handled transparently.

To specify multiple secrets just repeat secret multiple times:

resource "nixos_instance" "vm" {
  # ...

  secret {
    source = "./secrets/key"
    destination = "/root/secrets/key"
  }
  secret {
    source = "./secrets/another"
    destination = "/root/secrets/anotherkey"
  }
}

In addition to source and destination access information could be specified:

  • group name of the group file should belong to (root)
  • owner name of the user file should belong to (root)
  • permissions octal representation (600 = rw- --- ---)

Known problem:
if group/owner does not exists at the moment of the secrets provisioning,
which happens before derivation deployment,
then it will become root/root.

I have some thoughts, but no solution at the moment.

Let’s try to deploy a sample secret. This is not NixOS instance definition inside main.tf should look now:

resource "nixos_instance" "vm" {
  address = ["127.0.0.1"]
  configuration = "./configuration.nix"

  ssh {
    port = 2222
    config = {
      userKnownHostsFile = "/dev/null"
      strictHostKeyChecking = "no"
    }
  }

  secret {
    source = "./secrets/key"
    destination = "/root/secrets/key"
  }
}

Create sample secret with this commands:

  λ  mkdir -p secrets
  λ  echo "hello world" > secrets/key

Apply configuration:

  λ  terraform apply -target nixos_instance.vm
nixos_instance.vm: Refreshing state... [id=e64be02d-a4b6-7a0a-1cb2-23cc3cfab449]

Terraform will perform the following actions:

  # nixos_instance.vm will be updated in-place
  ~ resource "nixos_instance" "vm" {
        id                 = "e64be02d-a4b6-7a0a-1cb2-23cc3cfab449"
      ~ secret_fingerprint = {
          - "kdf_iterations" = "35"
          - "salt"           = "acd73fe9592089ec514626720afe7f29949ad38394fd934c149c6b2b8f3faa53"
          - "sum"            = "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
        } -> (known after apply)
        # (4 unchanged attributes hidden)

      + secret {
          + destination = "/root/secrets/key"
          + group       = "root"
          + owner       = "root"
          + permissions = 600
          + source      = "./secrets/key"
        }

        # (2 unchanged blocks hidden)
    }

Plan: 0 to add, 1 to change, 0 to destroy.

...

nixos_instance.vm: Modifying... [id=e64be02d-a4b6-7a0a-1cb2-23cc3cfab449]
nixos_instance.vm: Modifications complete after 9s [id=e64be02d-a4b6-7a0a-1cb2-23cc3cfab449]

Apply complete! Resources: 0 added, 1 changed, 0 destroyed.

NixOS provider maintains a salted fingerprint of all secret contents.
This is made to speedup deployment and save some traffic:
no need to transfer secrets if they doesn’t change.

Proove secret is really reached the destination:

  λ  ssh -p 2222 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no root@127.0.0.1
[root@nixos:~]# ls -la /root/secrets/
total 12
drwxr-xr-x 2 root root 4096 Aug 10 20:28 .
drwx------ 4 root root 4096 Aug 10 20:28 ..
-rw------- 1 root root   12 Aug 10 20:28 key

[root@nixos:~]# cat /root/secrets/key
hello world

Other secret stores could be used (for example GoPass). See available settings.

Here is an example which transfer secret identified by path secrets/key from GoPass to vm instance:

secrets setting may be defined globally or per-instance.

resource "nixos_instance" "vm" {
  # ...

  secret {
    source = "secrets/key"
    destination = "/root/secrets/key"
  }
  secrets {
    provider = "gopass"
    gopass = {
      store = "./secrets"
    }
  }
}

Bastion

Sometimes called "jump host".

Bastion servers are edge servers which are responsible for the network resources access control, in this case SSH, which is not exposed to the wild internet, only bastion has it exposed. People connect to bastion after that connect to hosts inside the network.

Many companies use bastions to control and journalize access, so it is a “must have” for a good tool.

Bastion settings (bastion schema) could be defined:

  • globally, for whole provider
  • locally, for instance
  • mixed, override global bastion settings for single instance

This semantics are shared between most NixOS provider settings, it should become more clear as you read, there is separate paragraph just about this.

Bastion section supports same keys as ssh extending them with host key, which contains a remote address of the bastion server:

bastion could be defined globally or per-instance.

  bastion {
    host = "127.0.0.1"
    port = 2222
  }

For demonstration purposes NixOS VM (which was deployed previously) will be used as a bastion to deploy configuration onto itself. This settings will tell NixOS provider to connect to 127.0.0.1:2222 which will forward SSH connection to 127.0.0.1:22 - SSH port on VM.

But first to proove vm instance is working as SSH bastion (forwarding connections to self). Turn on DEBUG1 logging level for SSHd:

# ...
    services = {
      # ...
      openssh.logLevel = "DEBUG1";
    };
# ...

This will make SSHd write connection forwarding information to log. Run:

  λ  terraform apply -target nixos_instance.vm
...

Now change instance settings in main.tf to look like this:

Bastion will inherit settings defined for ssh under config key.

resource "nixos_instance" "vm" {
  address = ["127.0.0.1"]
  configuration = "./configuration.nix"

  ssh {
    port = 22
    config = {
      userKnownHostsFile = "/dev/null"
      strictHostKeyChecking = "no"
    }
  }

  bastion {
    host = "127.0.0.1"
    port = 2222
  }

  secret {
    source = "./secrets/key"
    destination = "/root/secrets/key"
  }
}

Then do another configuration apply:

  λ  terraform apply -target nixos_instance.vm
...
│ Error: subcommand "/run/current-system/sw/bin/ssh -F /run/user/1000/ssh_config.356167380 127.0.0.1 tar -x -C /" exited with: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
│ @    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
│ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
│ IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
│ Someone could be eavesdropping on you right now (man-in-the-middle attack)!
│ It is also possible that a host key has just been changed.
...

I should say it loud: this apply will fail because of the fragile nature of the Terraform and poor engineering of the SDK.
I don’t know how to solve this, if you do then please let me know, fill an issue. Thanks!

Here is why it is failing. This part of the plan diff causes an error:

It deletes ssh block then re-adds it. This makes provider act like ssh block does not exist at all during terraform apply so it uses the defaults. And default is StrictHostKeyChecking=yes. This is why it fails.

This bug is not new. Nested datastructures is a huge pain in Terraform SDK. And one of the many reasons why I hate Terraform. I’ve wasted unacceptable amount of time to debug this and need quadruple (?) more to find out how to fix this (whole SDK is a one big chunk of what we call «over-engineering»).

Second apply should make everything consistent. If Terraform says “everything is up to date” then change configuration.nix, for instance uncomment haveged. After apply this command journalctl -u sshd.service --follow should print following lines when bastion is used:

I’ve highlighted lines of the interest, pay attention to debug1: server_request_direct_tcpip.

Address matching and filtering

Sometimes hosts have multiple addresses. For instance: IPv4 & IPv6. In this case some priorities for address families or subnets could be defined.

Here is an example where IPv6 addresses have precedence over IPv4 addresses, in this example ::1 will be used to connect to the vm instance:

address_priority may be defined only globally.

provider "nixos" {
  # ...
  address_priority = {
    "0.0.0.0/0" = 0,
    "::/0"      = 1,
  }
}

resource "nixos_instance" "vm" {
  address = [
    "127.0.0.1",
    "::1",
  ]
  # ...
}

Larger numbers raises priority of the subnet making matched addresses to be closer to the start of the instance addresses list, so this addresses will be used when connecting to host with SSH.

This will give reordering capability on top of the set of addresses. Despite nearly 40% of IPv6 adoption people sometimes strugle from IPv6 missconfiguration on ISP level, etc. What if somebody want to use just IPv4 or IPv6 to connect to the hosts and does not care about priorities?

There is address_filter setting which will filter addresses used by instances with CIDR, in this example known vm addresses will contain just 127.0.0.1:

address_filter may be defined only globally.

provider "nixos" {
  # ...
  address_filter = ["0.0.0.0/0"]
}

resource "nixos_instance" "vm" {
  address = [
    "127.0.0.1",
    "::1",
  ]
  # ...
}

Multiple filters could be defined. In this case they will use or semantics. This basically means: address must match at least one CIDR to be presented in the instance addresses list.

Filters defined in address_filter applied before sorting using address_priority data.

Retries

Network may be not reliable, so it is a good thing to have a way to retry a broken connection.

Here is an example which tells provider to retry SSH connection 3 times with a delay of 1 second:

retry and retry_wait could be defined only globally.

provider "nixos" {
  # ...
  retry = 3
  retry_wait = 1
}

If limits of retries exceeded then

Nix settings

There are some settings which could be altered for Nix package manager:

  • activation_action Activation script action, one of: switch|boot|test|dry-activate
  • activation_script Path to the system profile activation script
  • build_wrapper Path to the configuration wrapper in Nix language (function which returns drv_path & out_path)
  • cores Number of CPU cores which Nix should use to perform builds
  • mode Nix mode (0 - compat, 1 - default)
  • output System derivation output name
  • profile Path to the current system profile
  • show_trace Show Nix package manager trace on error
  • use_substitutes Whether or not should Nix use substitutes

I will stop on 3 of them:

  • activation_action
  • build_wrapper
  • mode

The activation_action setting is an action which should be passed as a first argument to the activation script of the system profile:

λ  /nix/var/nix/profiles/system/bin/switch-to-configuration
Usage: /nix/var/nix/profiles/system/bin/switch-to-configuration [switch|boot|test]

switch:       make the configuration the boot default and activate now
boot:         make the configuration the boot default
test:         activate the configuration, but don't make it the boot default
dry-activate: show what would be done if this configuration were activated

It could be empty, for provider this basically means: do not run activation. This was done for tests but probably may be used for some other things.

The build_wrapper setting may contain a path to the system derivation builder. This wrapper is written in Nix language and should return at least two keys in attribute set:

  • drv_path
  • out_path

Here how looks built into provider wrapper.

The mode settings controls wether to use experimental Nix CLI or not. By default it is 0 which means “try to not use experimental CLI flags” and 1 means the opposite “use experimental CLI flags”. I’ve introduced this setting to make provider work with Nix versions older than 2.10.x

More older version may still work even with mode=1, but haven’t test this.

Configuration override

Provider gives a user way to define settings for:

  • nix Nix package manager
  • ssh SSH client
  • bastion SSH tunneling
  • secrets secrets providers

Each of them could appear both globally or on instance level.

There are more, but some of this settings may be defined only globally.

Settings defined on the instance level override global settings on instance level:

provider "nixos" {
  nix {
    cores = 2
  }
  ssh {
    port = 2222
  }
  # secrets { ... }
}

resource "nixos_instance" "vm" {
  nix {
    cores = 4
  }

  ssh {
    port = 22
  }

  bastion {
    host = "bastion.example.com"
    port = 2222
  }

  secrets {
    provider = "gopass"
    gopass = {
      store = "./secrets"
    }
  }
}

This will result in the following settings for vm:

  • global amount of CPU cores Nix will use for build is 2, but for vm instance it is 4
  • global SSH port 2222 defined, but for vm instance 22 will be used
  • globally no bastion defined, but for vm instance bastion.example.com will be used
  • globally no secrets provider defined (filesystem will be used), but for vm instance gopass provider is defined (with store location in ./secrets directory)

Each section (set, like ssh or bastion, …) will not be merged deeply. This is a limitation of the Terraform SDK (it has no way to distinguish user-providen values from default values, correct merge is not possible). Because of this following will not work as expected:

Bastion port for vm instance will be 22 instead of expected 2222.

provider "nixos" {
  bastion {
    host = "bastion.example.com"
    port = 2222
  }
}

resource "nixos_instance" "vm" {
  # ...
  bastion {
    host = "other-bastion.example.com"
  }
  # ...
}

Licensing

I don’t like the concept of intellectual property on source code and other intangible things.

Thats why this project and this article is public domain, feel free to do anything with this code, this is Internet, I don’t care much. But would be glad to be mentioned.