Databricks destination for batch exports
Contents
Batch exports can be used to export data to a Databricks Delta table.
The Databricks destination is currently in beta. This means the configuration and features are subject to change.
You can request access via the in-app support form.
We love feature requests and feedback. Please tell us what you think using the same link above.
Requirements
To create a batch export to Databricks, you need the following:
- A Databricks account with serverless SQL warehouses enabled
- A Databricks workspace with Unity Catalog enabled
- A SQL warehouse
- A catalog and schema you wish to use
- A service principal with OAuth Machine-to-Machine (M2M) authentication configured. This is the only authentication type we support currently. This service principal must have the appropriate permissions (see below)
Setup instructions
1. SQL warehouse
Batch exports uses SQL warehouses to export data into your Delta tables.
You can use either an existing warehouse, or you can create a new one.
You will need to make a note of the Server hostname and HTTP path for your warehouse:
- In Databricks, navigate to SQL Warehouses.
- Select your SQL warehouse.
- Go to the Connection details tab.
- Copy the Server hostname (e.g.
abc123.cloud.databricks.com). - Copy the HTTP path (e.g.
/sql/1.0/warehouses/abc123def456).
2. Set up catalog and schema
- In your Databricks workspace, create a catalog (or use an existing one) where PostHog will export data.
- Create a schema within that catalog (or use an existing one).
- Note the names of both the catalog and schema – you'll need these when configuring the batch export.
Note: You don't need to create tables manually. PostHog will create destination tables for you to ensure the data schemas are correct.
3. Create a service principal
Service principals give automated tools and scripts API-only access to Databricks resources, providing greater security than using user accounts.
- In your Databricks workspace, navigate to Settings > Identity and access > Service principals.
- Click Add service principal.
- Click Add new.
- Enter a name for your service principal (e.g. "PostHog Batch Exports").
- Click Add.
4. Generate OAuth credentials
- Select the service principal you just created.
- Navigate to the Secrets tab.
- Click Generate secret.
- Choose a lifetime for the new secret.
- Copy and save both the Client ID (also the same as the service principal UUID) and Secret securely - you'll need these for PostHog configuration.
Important: The client secret is only shown once. Make sure to save it securely before closing the dialog.
5. Grant permissions to the service principal
The service principal needs the following permissions:
At the catalog level:
USE CATALOG- Required to access the catalog
At the schema level:
USE SCHEMA- Required to access the schemaCREATE TABLE- Required to create new tables (if the destination table doesn't exist)CREATE VOLUME- Required to create temporary volumes for staging data
At the table level:
(Only needed if your table already exists. In general it is recommended to allow PostHog to create tables for you to ensure the data schemas are correct)
SELECT- Required to read the table schemaMODIFY- Required to insert and update data
You can grant these permissions using SQL commands in Databricks:
Note: Replace
<catalog_name>,<schema_name>,<table_name>, and<service_principal_id_or_group>with your actual values. You can either assign permissions to the service principal directly or via a group.
Creating the batch export
- In PostHog, click Data pipelines in the navigation and go to the Destinations tab.
- Click + New destination in the top-right corner.
- Search for Databricks.
- Click the + Create button.
- Fill in the necessary configuration details
- Create a Databricks integration (if you haven't already):
- Finalize the creation by clicking on Create.
- Done! The batch export will schedule its first run on the start of the next period.
Databricks configuration
You'll need to specify the following configuration settings in PostHog:
Connection settings:
- Integration: Select the Databricks integration you created (or create a new one). The integration stores your server hostname and OAuth credentials securely.
- HTTP Path: The HTTP path value for your SQL warehouse or all-purpose compute cluster (e.g.
/sql/1.0/warehouses/abc123def456).
Destination settings:
- Catalog: The name of the Databricks catalog where the table will be created.
- Schema: The name of the schema within the catalog where the table will be created.
- Table name: The name of the table where data will be exported. If the table doesn't exist, it will be created automatically (this is the recommended approach).
Advanced settings (optional):
- Use VARIANT type: Whether to use Databricks'
VARIANTtype for JSON fields (likepropertiesandperson_properties). Enabled by default. TheVARIANTtype is optimized for semi-structured data and provides better querying performance. However, it requires Databricks Runtime 15.3 or above. If you're using an older runtime version, disable this setting to useSTRINGtype instead.
Models
This section describes the models that can be exported to Databricks.
Note: New fields may be added to these models over time. PostHog uses Databricks' automatic schema evolution feature, which means these new fields will be automatically added to your destination tables. Existing columns are never removed or modified.
Events model
This is the default model for Databricks batch exports. The schema of the model as created in Databricks is:
| Field | Type | Description |
|---|---|---|
| uuid | STRING | The unique ID of the event within PostHog |
| event | STRING | The name of the event that was sent |
| properties | VARIANT or STRING | A JSON object with all the properties sent along with an event |
| distinct_id | STRING | The distinct_id of the user who sent the event |
| team_id | BIGINT | The team_id for the event |
| timestamp | TIMESTAMP | The timestamp associated with an event |
| databricks_ingested_timestamp | TIMESTAMP | An additional metadata column, indicating when the event was ingested into Databricks |
Persons model
The schema of the model as created in Databricks is:
| Field | Type | Description |
|---|---|---|
| team_id | BIGINT | The id of the project (team) the person belongs to |
| distinct_id | STRING | A distinct_id associated with the person |
| person_id | STRING | The id of the person associated to this (team_id, distinct_id) pair |
| properties | VARIANT or STRING | A JSON object with all the latest properties of the person |
| person_version | BIGINT | Internal version of the person properties, used by batch export in merge operation |
| person_distinct_id_version | BIGINT | Internal version of the person to distinct_id mapping, used by batch export in merge operation |
| created_at | TIMESTAMP | The timestamp when the person was created |
The Databricks table will contain one row per (team_id, distinct_id) pair, and each pair is mapped to their corresponding person_id and latest properties.
FAQ
How does the batch export process work?
The batch export process follows a consistent pattern regardless of the model being exported, with variations depending on whether the model is mutable (like persons) or immutable (like events).
For immutable models (e.g. events):
- Create a temporary volume to store Parquet files.
- Create the destination table if it doesn't exist.
- Query data from PostHog and transform it into Parquet format.
- Upload Parquet files to the temporary volume.
- Copy data directly from the volume into the destination table using Databricks'
COPY INTOcommand. - Delete the temporary volume.
For mutable models (e.g. persons, sessions):
- Create a temporary volume to store Parquet files.
- Create the destination table if it doesn't exist.
- Create a temporary stage table with the same schema as the destination table.
- Query data from PostHog and transform it into Parquet format.
- Upload Parquet files to the temporary volume.
- Copy data from the volume into the stage table using
COPY INTO. - Execute a
MERGEoperation between the stage table and the destination table:- Update existing rows where the stage table has newer versions.
- Insert new rows that don't exist in the destination table.
- Delete the stage table and temporary volume.
The merge operation for mutable models ensures that:
- Rows that match in the destination table and have newer version fields in the stage table are updated.
- New rows not found in the destination table are inserted.
- Your data always reflects the latest state while maintaining historical records that haven't been updated.
Should I use VARIANT or STRING type for JSON fields?
We recommend using VARIANT (the default) for the following reasons:
- Better query performance
- Native support for JSON operators and functions
However, you should use STRING if:
- You're using Databricks Runtime older than 15.3
- You need to export the data to another system that doesn't support VARIANT
- You prefer to handle JSON parsing in your application code