Linking S3 as a source

Last updated:

|

The data warehouse can link to data in your object storage system like S3. To start, you'll need to:

  1. Create a bucket in S3
  2. Set up an access key and secret
  3. Add data to the bucket (we'll use Airbyte)
  4. Create the table in PostHog

Step 1: Creating a bucket in S3

  1. Log in to AWS.
  2. Open S3 in the AWS console and create a new bucket. We suggest us-east-1 if you're using PostHog Cloud US, or eu-central-1 if you're using PostHog Cloud EU.

created bucket

Make sure to note both the name and region of your bucket, we'll need these later.

Step 2: Set up access policy and key

Next, we need to create a new user in our AWS console with programmatic access to our newly created bucket.

  1. Open IAM and create a new policy to enable access to this bucket
  2. On the left under Access management, select Policies and click Create policy
  3. Under the service, choose S3
  4. Under Actions, select:
  5. List -> ListBucket and ListBucketMultipartUploads
  6. Read -> GetBucketLocation and GetObject
  7. Write -> AbortMultipartUpload and PutObject
  8. Permission Management -> PutObjectAcl
  9. Under Resources, select Specific. Under object, click Add ARNs
  10. Specify your bucket name and choose Any object name. In the example below, replace posthog-s3-export with the bucket name you chose in the previous section

bucket arn config

  1. Your policy in JSON should look like this:
JSON
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:PutObjectAcl",
"s3:ListBucket",
"s3:ListBucketMultipartUploads",
"s3:AbortMultipartUpload",
"s3:GetBucketLocation"
],
"Resource": "arn:aws:s3:::posthog-s3-export/*"
}
]
}
  1. Click "Next" until you end up on the "Review Policy" page
  2. Give your policy a name and click "Create policy"

The final step is to create a new user and give them access to our bucket by attaching our newly created policy.

  1. Open IAM and navigate to Users on the left
  2. Click Create user, add a user name, and click Next
  3. Select Attach policies directly
  4. Search for the policy you just created and click the checkbox on the far left to attach it to this user
  5. Click Next and then click Create user

User

  1. Once created, search and click on the user name and then click Create access key
  2. Select Application running outside AWS and then click Next
  3. Add a description tag value and click Create access key
  4. Copy the access key and secret access key values to somewhere safe. You will need to recreate these values if you lose them.

AWS access keys

Step 3: Add data to the bucket

For this section, we'll be using Airbyte. However, we accept any data in CSV or Parquet format, so if you already have data in S3 you can skip this section.

  1. Go to Airbyte and sign up for an account if you haven't already.
  2. Go to connections and click "New connection"
  3. Select a source. For this example, we'll grab data from Stripe, but you can use any of Airbyte's sources.
  4. Click "Set up a new destination"
  5. Select "S3" as the destination
  6. Fill in the "S3 Bucket Name", "S3 Bucket Region" with the name and region you created earlier.
  7. For "S3 Bucket Path", use airbyte.
  8. For the "Output Format", pick Parquet. You can use the default settings
  9. Under "Optional fields", you'll want to add the access key and secret from step 2.
  10. In the next step, pick the streams you want to fill. Given you'll manually need to create a table for each stream, we suggest being selective.
  11. Wait for the sync to finish

creating the s3 destination

Step 4: Create the table in PostHog

  1. Go to the Data pipeline page and the sources tab in PostHog
  2. Click New source and under self managed, look for S3 and click Link
  3. Fill in the table name.
  4. For the URL pattern, copy the URL from S3, up to the bucket name. Replace the file name with *, as Airbyte will split large streams out into multiple files.
    • For example: https://your-bucket.s3.amazonaws.com/airbyte/invoices/*
  5. For file format, select Parquet. Fill in the access key and secret key.

You'll want to repeat this step for each "stream" or folder that Airbyte created in your S3 bucket.

creating table in posthog

Step 5: Query the table.

Amazing! You can now query your new table.

Troubleshooting

  • Create table failed: Could not get columns: Check that your Files URL pattern is correct and that your file format is correct. For example, make sure your columns don't have spaces and there aren't commas in cells in your .csv file.

  • Create table failed: Access was denied when reading the provided file: Make sure your access policies are correct.

  • Create table failed: The provided file is not in Parquet format: Make sure you've selected the correct file format.

Questions? Ask Max AI.

It's easier than reading through 793 pages of documentation

Community questions

Was this page useful?

Next article

Querying the data warehouse with SQL

PostHog provides the full flexibility of SQL to query your data warehouse using the SQL editor . To create a query, go to the SQL editor . Here you can see and search the schema of all available sources and PostHog tables as well as saved views. Here you can: Write your SQL query using your table like SELECT * FROM stripe.prod.charge Click Run to see the results. Modify your query using SQL commands and functions as needed to get the data you want like select amount / 100 as dollar…

Read next article