Migrate from Google Analytics to PostHog

Last updated:

|Edit this page

Prior to starting a historical data migration, ensure you do the following:

  1. Create a project on our US or EU Cloud.
  2. Sign up to a paid product analytics plan on the billing page (historic imports are free but this unlocks the necessary features).
  3. Set the historical_migration option to true when capturing events in the migration.

Migrating data from Google Analytics is a three step process:

  1. Setting up the Google Analytics BigQuery streaming export
  2. Querying Google Analytics data from BigQuery
  3. Converting Google Analytics event data to the PostHog schema and capturing in PostHog

Want a higher-level overview of PostHog? Check out our introduction to PostHog for Google Analytics users.

1. Setting up the Google Analytics BigQuery export

Unfortunately, Google Analytics' historical data exports are limited. The best way to get data from Google Analytics is to set up the BigQuery export.

To do this, start by creating a Google Cloud account and project. Afterwards, do the basic setup for BigQuery, including enabling the BigQuery API. The BigQuery sandbox works for this.

Once done, go to the Admin section of your Google Analytics account. Under Product links, select BigQuery links and then click the Link button.

Google Analytics BigQuery link

Choose your project, choose your export type (you can make either daily or streaming work), click Next, and then Submit. This sets up a link to your BigQuery account, but it can take more than 24 hours to see data in BigQuery.

2. Querying Google Analytics data from BigQuery

After the link is set up, data is automatically added to BigQuery under a resource with your Google Analytics property prepended by analytics_ like analytics_123456789. Within this is an events table based on your export type. If you used daily exports, events tables have a name like events_20240731.

BigQuery events table

To query this data, use the BigQuery Python client. This requires setting up and authenticating with the Google Cloud CLI.

Once done, we can then query events like this:

Python
from google.cloud import bigquery
from google.cloud.bigquery import Row
# Set the project ID explicitly
project_id = 'cool'
client = bigquery.Client(project=project_id)
QUERY = (
'SELECT * FROM `analytics_123456789.events_20240731` '
'WHERE event_name NOT IN ("session_start", "first_visit") '
'LIMIT 100')
query_job = client.query(QUERY)
rows = query_job.result()

This returns a BigQuery RowIterator object.

3. Converting GA event data to the PostHog schema and capturing

The schema of Google Analytics' exported event data is similar to PostHog's schema, but it requires conversion to work with the rest of PostHog's data. You can see details on the Google Analytics schema in their docs and events and properties PostHog autocaptures in our docs.

For example, many PostHog events and properties are prepended with the $ character.

To convert to PostHog's schema, we need to:

  1. Convert the event_name values like user_engagement to $autocapture and page_view to $pageview.

  2. Convert the event_timestamp to ISO 8601.

  3. Flatten the event data by pulling useful data out of records like event_params and items as well as dictionaries like device and geo.

  4. Create an event properties by looping through Google Analytics event data, dropping irrelevant data, converting some to PostHog's schema, and including data unique to Google Analytics or custom properties.

  5. Do the same conversion with person properties. These are then added to the $set property.

Once this is done, you can capture events into PostHog using the Python SDK with historical_migration set to true.

Here's an example version of a full Python script that gets data from BigQuery, converts it to PostHog's schema, and captures it in PostHog.

Python
from posthog import Posthog
from datetime import datetime
from google.cloud import bigquery
posthog = Posthog(
'<ph_project_api_key>',
host='https://us.i.posthog.com',
debug=True,
historical_migration=True
)
key_mapping = {
'page_title': 'title',
'page_location': '$current_url',
'page_referrer': '$referrer',
'category': '$device_type',
'operating_system': '$os',
'operating_system_version': '$os_version',
'language': '$browser_language',
'web_info.browser': '$browser',
'web_info.browser_version': '$browser_version',
'web_info.hostname': '$host',
'city': '$geoip_city_name',
'country': '$geoip_country_name',
'continent': '$geoip_continent_name',
'region': '$geoip_subdivision_1_name'
}
omitted_keys = [
'event_date',
'event_timestamp',
'event_name',
'event_bundle_sequence_id',
'event_server_timestamp_offset',
'user_id',
'user_pseudo_id',
'stream_id',
'platform',
'batch_event_index',
'batch_page_id',
'batch_ordering_id',
'session_traffic_source_last_click'
]
def get_record_key_value(param):
record_key = param['key']
value_dict = param['value']
record_value = next((v for v in value_dict.values() if v is not None), None)
return record_key, record_value
def flatten_row(row):
flat_row = {}
for key, value in row.items():
print(f"{key}: {value}")
if value == None:
continue
elif isinstance(value, list) and len(value) == 0:
continue
elif key in omitted_keys:
continue
elif key in ['user_properties', 'user_first_touch_timestamp', 'user_ltv']:
# We'll handle user properties separately
continue
elif key in ['event_params', 'items']:
# Flatten records
for param in value:
record_key, record_value = get_record_key_value(param)
flat_row[record_key] = record_value
elif key in ['privacy_info', 'device', 'geo', 'app_info', 'collected_traffic_source', 'traffic_source', 'event_dimensions', 'ecommerce']:
# Flatten dicts and nested dicts
for key, val in value.items():
if isinstance(val, dict):
for sub_key, sub_val in val.items():
# We might not want to do this
flat_row[f"{key}.{sub_key}"] = sub_val
else:
flat_row[key] = val
else:
flat_row[key] = value
return flat_row
def get_person_properties(row):
person_properties = {}
user_properties = row.get('user_properties', [])
for prop in user_properties:
key, value = get_record_key_value(prop)
person_properties[key] = value
user_first_touch_timestamp = row.get('user_first_touch_timestamp')
user_ltv = row.get('user_ltv')
if user_first_touch_timestamp is not None:
person_properties['user_first_touch_timestamp'] = user_first_touch_timestamp
if user_ltv is not None:
person_properties['user_ltv'] = user_ltv
return person_properties
def get_properties(row):
for key, value in flat_row.items():
if value == None:
continue
elif key in omitted_keys:
continue
elif value in ['(not provided)', '(not set)']:
continue
elif key in key_mapping:
properties[key_mapping[key]] = value
else:
properties[key] = value
return properties
# Get events from BigQuery
project_id = 'cool'
client = bigquery.Client(project=project_id)
QUERY = (
'SELECT * FROM `analytics_123456789.events_20240731` '
'WHERE event_name NOT IN ("session_start", "first_visit") '
'LIMIT 100')
query_job = client.query(QUERY)
rows = query_job.result()
# Convert and capture events in PostHog
for row in rows:
distinct_id = row.get('user_id') or row.get('user_pseudo_id')
ph_event_name = row.get('event_name')
if ph_event_name == 'page_view':
ph_event_name = '$pageview'
if ph_event_name == 'user_engagement':
ph_event_name = '$autocapture'
event_timestamp = row.get('event_timestamp')
ph_timestamp = datetime.fromtimestamp(event_timestamp / 1000000)
flat_row = flatten_row(row)
properties = get_properties(flat_row)
properties['$set'] = get_person_properties(row)
posthog.capture(
distinct_id=distinct_id,
event=ph_event_name,
properties=properties,
timestamp=ph_timestamp
)

This script may need modification depending on the structure of your Google Analytics data, but it gives you a start.

Questions?

Was this page useful?

Next article

Migrate from Heap to PostHog

Migrating data from Heap is a three step process: Exporting data via Heap Connect or a one-time export Converting Heap event data to the PostHog schema and capturing in PostHog Converting Heap user data to the PostHog schema and capturing in PostHog 1. Exporting data from Heap Heap Connect enables you to export your data to S3, Redshift, BigQuery, or Snowflake. Data in each of these is much easier to access and capture into PostHog. See how to set this up in their docs . This is only available…

Read next article