When we join different data sources on the user level, on many occasions the user has different types of identifiers( or might be multiple identifiers of the same type). Think of different devices (cookies)-IDs, Customer IDs, or lead IDs. Grouping all these identifiers into a matching or mapping table (identity graph) can be useful when you want to stitch online behavior to customers and build a 360° customer view. GA4 Bigquery provides a great solution.
This blog will show you how to create an identity graph containing all default and custom user identifiers that can be collected in Google Analytics 4, using BigQuery exports.
- Group all available user identifiers in GA4 into one BigQuery table.
- Create sub-tables based on the master table for joining other data easily.
Identity Resolution: Master Data/Golden record
An identity graph is a unified collection of user IDs (like customer or device IDs) that can be associated with a user, based on their interactions with a website, app, or platform.
Platforms like Facebook or Google have their own identity graphs, but the context of this article is to build an identity graph based on your first-party sources, so the identity graph will be specific to your organization.
Different types of identifiers
The mapping table we are creating in this blog will relate all the (configured) default and custom user IDs to a single user. Generic identifiers will be of the following types:
As we’re basically grouping all online behavior based on cookie IDs (device/browser) to (deterministic) user identifiers such as a customer ID, online behavior can be joined with other data sets like CRM data, marketing cloud, and service cloud data. Consent management is the most important aspect of any behavior tracking of customers.
There are some things you should keep in mind when working with online behavioral data in GA4:
- Devices and browsers can be shared between multiple users. You can’t be 100% sure that you’re dealing with the same user for every session in Google Analytics 4. Devices/browsers can be shared for example.
- Practical example: when customer #1 logs in on browser #1, a customer ID is tied to the cookie ID of browser #1. If that same browser is used by another user, this behavior is tied to customer #1 as well.
- IDs can be shared (unintentionally). For example, sharing links with click-IDs (?gclid=) or some other form of user-specific identifiers (commonly used in email links)
- Cookies (device IDs) can be deleted. Both by tracking prevention mechanisms such as Apple’s ITP or manually.
Use cases & business value
The identity graph can be used to;
- Join historical (browsing) behavior, thus giving insights into the path leading to a conversion
- Provide insights into cross-device user behavior (and attribution)
- Input for modeling like the purchase or churn predictions
- Joining CRM/Service/commerce/marketing or other data sources with online behavior
GA 4 BigQuery– Prerequisites & definitions
End user identifiers that are collected in GA4:
- The
user_pseudo_id
is the device/browser ID identifier and is collected automatically. - You need to ‘manually’ populate the
user_id
variable with a business-specific user identifier like a customer ID. This identifier can tie together cross-device and cross-browser behavior. Not collected automatically. - Additionally, you could add other identifiers in custom parameters such as secondary customer IDs, lead IDs, or click IDs (and make sure you have the appropriate user consent).
- The e-commerce.transaction_id parameter is populated when you use the e-commerce parameters of GA4 within the purchase event.
Mapping table details
The mapping table consists of two parts:
- Collecting all identifiers > Collecting all the configured user identifiers hidden away within the events of the GA4 dataset. The query has a variable called
lookback_window
to set the window in which to look for identifiers. You could query the complete dataset every time (resource intensive) or daily update a table with this data. - Grouping the identifiers > Make one identifier the unique key (in this tutorial, the customer ID). Other identifiers will be aggregated into an array using the
ARRAY_AGG
function, including the timestamp of the last recognition.
Query: Mapping Table
This query will collect all the (configured) user IDs within the GA4 dataset for all the configured user-ID and group the results on the customer_id
.
{
"customer_id": "123",
"ga_client_id": [
{ "id": "789", "timestamp": "123123" },
{ "id": "189", "timestamp": "123123" }
],
"customer_id_secondary": [
{ "id": "897", "timestamp": "123123" }
],
"gclid": [
{ "id": "89134", "timestamp": "123123" },
{ "id": "72347", "timestamp": "123123" }
],
"...": []
}
Query: Unique ID-specific match table
Based on the result of the previous table, you could create a match table that can quickly provide a customer ID for a device ID (user_pseudo_id
).
select
ga_id.id AS ga_client_id,
customer_id
from `<your-project>.<your-dataset>.<your-mapping-table>`,
unnest(ga_client_id) as ga_id
group by 1, 2
Summary & next steps
The above example can serve as input for a lot of interesting use cases as mentioned earlier. Make sure to have appropriate consent.
- Group mapping table on other user identifiers (for example on device ID)
- Automate the generation of the mapping table. As mentioned, my advice would be to add new identification events to a master table (daily) and from there build your mapping table
- Use the mapping table to create a single customer view or single user view, aggregating data into a user profile, which can be used for targeting or personalization.