Integrating Vault With Apache Airflow
Storing Airflow connections and variables with HashiCorp Vault
In my previous article, I discussed about six different ways for creating and managing variables in Apache Airflow. In production deployments, we need to ensure that they way we store variables (and connections) is secure as well as portable.
By default, Apache Airflow read connections and variables from the metastore database and/or environment variables. However, Airflow has the capability for reading variables and connections from Secret Backends rather than from its own metastore database.
In most of the cases, large organisations already have secret stores where teams can essentially store their secrets in a secure and accessible way. This means that it is usually easier to use the same secret stores for Airflow deployments too, such that some secrets can also be shared across different technologies.
Airflow, provides several different integrations for Secrets Backend, including Amazon (Secrets Manager & Systems Manager Parameter Store), Google (Cloud Secret Manager), Microsoft (Azure Key Vault) and HashiCorp (Vault).
In the next following sections, we will go through a step by step process you can follow in order to integrate Airflow with HashiCorp Vault. In fact, the tutorial can be useful to anyone looking to integrate any of the aforementioned Secrets Backend supported by Airflow. Let’s get started!
Step 1: Update airflow.cfg file
In order to integrate HashiCorp Vault with Airflow such that the latter will retrieve the connections and variables from the former, we first need to specify VaultBackend
as the backend
in [secrets]
section of airflow.cfg
.
[secrets]
backend = airflow.providers.hashicorp.secrets.vault.VaultBackend
backend_kwargs = {
"connections_path": "connections",
"variables_path": "variables",
"url": "http://127.0.0.1:8200",
"mount_point": "airflow",
}
The above configuration assumes that your Airflow connections are stored as secrets in the airflow
mount_path and under the path connections
(i.e. airflow/connections
). Likewise, your variables are stored under airflow/variables
.
You should also ensure that you specify the additional parameters that will allow you to perform authentication between Airflow and Vault. For instance, you may need to pass approle
, role_id
and secret_id
arguments (or perhaps the token
argument). You can see the full list of available arguments here.
Step 2: Add connections as secrets to Vault
The connections Vault, should be stored in the connections_path
specified in the Airflow configuration as illustrated in the previous step.
For every connection, you would have to create one Vault secret having at least one of the following keys:
conn_id
(str
) -- The connection ID.conn_type
(str
) -- The connection type.description
(str
) -- The connection description.host
(str
) -- The host.login
(str
) -- The login.password
(str
) -- The password.schema
(str
) -- The schema.port
(int
) -- The port number.extra
(Union[str, dict]
) -- Extra metadata. Non-standard data such as private/SSH keys can be saved here. JSON encoded object.uri
(str
) -- URI address describing connection parameters.
This is because the secrets’ content should align with provided should align with the expected parameters for airflow.models.connections.Connection
class.
Step 3: Add variables as Vault secrets
Now going into variables, you should be a bit careful when adding them as secrets in vault as the expected format may not be as straight forward as you’d normally expect it to be.
Let’s suppose that in airflow.cfg
you’ve specified variables_path
as variables
and mount_point
as airflow
. If you wish to store a variable called my_var
having the value hello
then you would have to store the secret as:
vault kv put airflow/variables/my_var value=hello
Note that the secret Key
is value
, and secret Value
is hello
!
Step 4: Accessing connections in Airflow DAGs
You can use the following code in order to access Airflow connections which are stored as Vault Secrets and observe their details:
import json
import logging
from airflow.hooks.base_hook import BaseHook
conn = BaseHook.get_connection('secret_name')
logging.info(
f'Login: {conn.login}'
f'Password: {conn.password}'
f'URI: {conn.get_uri()}'
f'Host: {conn.host}'
f'Extra: " {json.loads(conn.get_extra())}'
# ...
)
Note that Vault secrets will be searched first, followed by environment variables, then metastore (i.e. the connections added through Airflow UI). This search ordering is not configurable.
Step 5: Accessing variables in Airflow DAGs
Likewise, the following code snippet will help you retrieve the values of variables stored in Vault:
import logging
from airflow.models import Variable
my_var = Variable.get('var_name')
logging.info(f'var_name value: {my_var}')
Step 6: Testing Vault Integration using a DAG
In the code snippet below you can find an example Airflow DAG that you can use in order to test that Airflow can correctly read variables and connections from the Vault that is specified in airflow.cfg
file.
import logging
from datetime import datetime
from airflow import DAG
from airflow.models import Variable
from airflow.hooks.base_hook import BaseHook
from airflow.operators.python_operator import PythonOperator
def get_secrets(**kwargs):
# Test connections
conn = BaseHook.get_connection(kwargs['my_conn_id'])
logging.info(
f"Password: {conn.password}, Login: {conn.login}, "
f"URI: {conn.get_uri()}, Host: {conn.host}"
)
# Test variables
test_var = Variable.get(kwargs['var_name'])
logging.info(f'my_var_name: {test_var}')
with DAG(
'test_vault_connection',
start_date=datetime(2020, 1, 1),
) as dag:
test_task = PythonOperator(
task_id='test-task',
python_callable=get_secrets,
op_kwargs={
'my_conn_id': 'connection_to_test',
'var_name': 'my_test_var',
},
)
Final Thoughts
Ensuring that your Airflow variables and connections are stored securely it’s extremely important, especially when deploying the tool on production environments. Secrets Backend integrations, make it possible for Airflow to connect to third-party secret stores in order to read configurations, variables and connections.
Apart from security, this approach offers portability, since it is extremely easy to plug the same secret store to other Airflow deployments and environments while also making it possible for other tools to be integrated with the same secret store. This reduces the need for developers to define the same definitions across different stores.