Ingest historical data

Last updated:

Historical data ingestion (or importing data), opposed to live data ingestion, is the process of transporting data from external sources into PostHog so you can benefit from PostHog product analytics on historical data. It may be that you have historical data that you want to analyze along with new live data or that you have a requirement to periodically import data from third-party sources to augment your live data.

Whatever the reason for the historical data ingestion, this guide covers what to consider during that process.

The three main factors to consider are:

  • Data ingestion process: how to get the event data from the third-party source into PostHog
  • Importing events: Sending the events captured in the third-party data source into PostHog as custom events
  • User identification: How to identify users within PostHog and ties those users back to the user within original data source

Historical data is sent to PostHog using either a server library or the PostHog API. For more information see the importing events section.

Data ingestion process

Since the third-party data source will offer an API you can use the power of software to import the data from one or more sources to PostHog using the PostHog API.

The following factors are important in the export and then import process:

  • The volume of data
  • API rate limits of both the data source and PostHog
  • Ensure that only the data required for events is exported
  • Handle error scenarios allowing the process to resume form the last successful point

With the above factors in mind it's recommended that you break the process up into steps such as the following:

  1. Sequentially export selective data from the data source that represent key events keeping track of where you are in the sequence so that you can restart the process from the last successful point if any problems occur

  2. Store the selected exported data to a new data storage for faster access in future steps

  3. Transform the data to the format you will use with PostHog and again save to a storage mechanism for faster access in later steps.

    The data format should be:

    {
    "event": "[event name]",
    "distinct_id": "[your users' distinct id]",
    "properties": {
    "key1": "value1",
    "key2": "value2"
    },
    "timestamp": "[optional timestamp in ISO 8601 format]"
    }

    At this stage it's also important to consider the following:

    1. Use the same event name that you're going to use with your live data ingestion so the historical and live events are seen as the same type within PostHog
    2. Use the same unique identifier within the distinct_id field as you are within your live data ingestion so historical events and live events are associated with the same user
    3. Convert the old event property names to the new event property names you are using within the events in your live data ingestion
    4. Ensure that the timestamp is a converted version of the original timestamp is ISO 8601 format so that PostHog correctly identifies when the original event occurred
    5. You may want to set an additional property within properties that identifies the original event within the data source
  4. Sequentially import the events into PostHog keeping track of the last successfully imported event so that you can restart the process from the last successful point if any problems occur

Importing events

Once you are ready to import the data into PostHog you can use one of the following:

As mentioned above, the data should be in the following format:

{
"event": "[event name]",
"distinct_id": "[your users' distinct id]",
"properties": {
"key1": "value1",
"key2": "value2"
},
"timestamp": "[optional timestamp in ISO 8601 format]"
}

The server libraries handle batching capture requests. If you decide to use the API directly you will need to manage this yourself.

client.capture({
distinctId: 'distinct id',
event: 'movie played',
properties: {
movieId: '123',
category: 'romcom'
}
})

For more information see the Node.js docs.

User identification

As discussed within the data ingestion process section, a unique user identifier distinct_id should be set for each event. In addition to setting the user with each event you can enrich information about that user by adding more properties:

client.identify({
distinctId: "user:123",
properties: {
email: 'john@doe.com',
proUser: false
}
})

For more information see the Node.js docs.