Big Query Schema for GA4 events tracking
Big Query Schema for GA4 events tracking

Big Query Schema for GA4 events tracking

The most significant improvement in GA4 w.r.t Universal Analytics is its event-based data model and free BigQuery connection to export data to other systems which open up many possibilities.

Big Query Schema is important as the data is event-level, meaning that there’s one record per event, you are going to need formulas for getting the basic stuff you can see in the GA4 UI. As the SQL query always starts from the event level, it means that you will often need the same formulas again and again.

Big Query Schema

Table of Contents

Normalization Nesting an event parameter into a column

The GA4 exports data in a nested form for storing the data. “A BigQuery table contains individual records organized in rows. Each record is composed of columns (also called fields). Every table is defined by a schema that describes the column names, data types, and other information.”

The events parameters in the GA4 export are nested like this. Event_params is a property of the event record and contains key-value pairs for all the related parameters. The key field is the name of the event parameter, and a value field is an object containing the parameter’s value in one of its four fields: string_value, int_value, float_value, or double_value.

To extract the event parameter’s value and turn it into a column, you must write another select statement inside your query on the row level. Then, using the normalization function, you can access the contents of event_paramters.

select
  event_name,
  (select value.string_value from unnest(event_params) where key = 'page_location') as page_location
from
  `<your-project>.<ga4-dataset>.<ga4-table>`

The above example returns the value from the string_value field. If you want to retrieve the value of the numeric parameter, you should pick the value from one of the fields instead.

Setting up Dynamic Query Date Range

The GA4 BigQuery export stores the event data in daily exports tables. Each table name starts with “events_” and ends with that day’s date as the suffix. You can use a wildcard with the table suffix to input a dynamic date range into the query.

In the below query, it counts all events b/w the current date minus seven days and the previous date.

select
  count(*) as events
from
  `<your-project>.<ga4-dataset>.events_*`
where
  _table_suffix between cast(
    current_date() -7 as string format 'YYYYMMDD'
  )
  and cast(
    current_date() -1 as string format 'YYYYMMDD'
  )

Note that the table suffix filter is combining strings, not actual dates. If you set it to include, for example, all tables with suffixes greater than “20220722” you’ll also end up including all the intraday tables. That’s why it’s good to define the filter using a between clause.

You can always check what’s happening by running the below query

select
  'intraday_20220218' > '20220217'

Extraction of page_path info from page_location

The default GA4 page_location field contains the page’s full URL, including any query parameters. However, to make it a bit more readable and more useful in reports, you can exclude the protocol and hostname as well as any query parameters from the string.

select
  regexp_replace(
  regexp_replace(
    (
      select
        p.value.string_value
      from
        unnest(event_params) as p
      where
        p.key = 'page_location'
    ),
    r'^https?://[^/]+',
    ''
  ),
  r'[\?].*',
  ''
) as page_path
from
  `<your-project>.<ga4-dataset>.events_*`

Extraction of Query Parameters from URL

In the previous example, we took out the query string from the page_location. However, the query string can contain valuable information as well. In most cases, though, it’s more useful as a separate event dimension or table column.

Below, I show two examples of how to extract query parameters: The simpler one just extracts the selected parameter value from the page_location string. The other method reads the page’s query parameters into a nested array of key-value pairs similar to the default event_params array. From this array, you can unnest the parameter of choice later.

Method1: Extracting the selected parameter using regexp_sxtract

select
  -- extract utm_source from page_location
  regexp_extract(
    (select value.string_value from unnest(event_params) where key = 'page_location'), 
    r'(?:\?|&)utm_source=([^&]+)'
  ) as utm_source
from
  `<your-project>.<ga4-dataset>.<ga4-table>`

Method2: Reading all parameters into an array

with url_params as (
  select
    -- read all query parameters into an array
    array(
      (
        select
          as struct split(keyval, '=') [safe_offset(0)] as key,
          split(keyval, '=') [safe_offset(1)] as value
        from
          unnest(
            split(
              split((select value.string_value from unnest(event_params) where key = 'page_location'), '?') [safe_offset(1)],
              '&'
            )
          ) as keyval
      )
    ) as url_query,
  from
    `<your-project>.<ga4-dataset>.<ga4-table>`
)
select
  *,
  -- extract utm_source
  (select value from unnest(url_query) where key = 'utm_source') as utm_source
from
  url_params

Defining and calculating sessions

All GA4 events contain a ga_session_id parameter nested inside event parameters. However, this parameter is not a unique session id. Instead, it’s the session’s start time saved as a Unix timestamp. You can create a unique session id by combining the ga_session_id parameter with user_pseudo_id.

select
  concat(
    user_pseudo_id,
    (
      select
        value.int_value
      from
        unnest(event_params)
      where
        key = 'ga_session_id'
    )
  ) as unique_session_id
from
  `<your-project>.<ga4-dataset>.<ga4-table>`

However, calculating the number of sessions using this unique session id will not exactly match the numbers in the GA4 UI. This is because GA4 uses the session_start event instead for session calculation.

If you compare the two, you might notice that the numbers can be a bit off: The same unique session id can have multiple session_start events. That’s why I’ve opted to disregard the session_start events and not try to match the numbers in the UI.

Check how many session_start events each unique_session_id has

select
  concat(
    user_pseudo_id,
    (
      select
        value.int_value
      from
        unnest(event_params)
      where
        key = 'ga_session_id'
    )
  ) as unique_session_id,
  sum(if(event_name = 'session_start', 1, 0)) as session_start_events
from
  `<your-project>.<ga4-dataset>.<ga4-table>`
group by
  1
order by
  2 desc

Turning a hit-level dimension into session-level

At the time of writing this, GA4 doesn’t yet have session-level custom dimensions. In BigQuery, that’s not an issue; you can take any event parameter and turn it into a session or user-level dimension.

Session-level custom dimensions are on the product roadmap for GA4.

The best way to make an event-level field session-level is by utilizing analytic functions. An analytic function uses a window frame to define a group of rows and compute a value over that group. The group of rows could be all rows that share the same unique_session_id (defined earlier).

For example, let’s look at the GA4 default session_engaged dimension. This dimension tells if GA4 measured an engagement during the session or not. However, the dimension is not included in all of the session’s events and the value can change from not engaged to engaged between the session’s events. If you’d like to show all events from engaging sessions, the dimension needs to be converted to the session level first.

Turning any event-level parameter into session-level

with events as (
  select
    concat(
      user_pseudo_id,
      (
        select
          value.int_value
        from
          unnest(event_params)
        where
          key = 'ga_session_id'
      )
    ) as unique_session_id,
    (select value.string_value from unnest(event_params) where key = 'session_engaged') as event_session_engaged
  from
    `<your-project>.<ga4-dataset>.<ga4-table>`
)
select
  unique_session_id,
  event_session_engaged,
  -- the same dimension as session-level included with all session's events
  max(event_session_engaged) over(partition by unique_session_id) as session_session_engaged
from
  events

This kind of a query is also useful when working with custom event parameters that don’t get included in any of the automatically generated events, like user_engagement

In the example, we look for the max value of the session_engaged parameter over the unique_session_id. In other cases, for example, you might want to return the last first value instead. You can reference the BigQuery documentation for other options.

If you want to make the dimension user-level instead, you can use user_pseudo_id in the window’s partition field.

Landing page sessions

We can use an analytic function, similar to the previous example, to get the session’s landing page. However, this time, instead of returning the maximum event parameter value of the session, we would return the first one.

with events as (
  select
    concat(
      user_pseudo_id,
      (
        select
          value.int_value
        from
          unnest(event_params)
        where
          key = 'ga_session_id'
      )
    ) as unique_session_id,
    event_name,
    (select value.string_value from unnest(event_params) where key = 'page_location') as page_location,
    event_timestamp
  from
    `<your-project>.<ga4-dataset>.<ga4-table>`
)
select
  unique_session_id,
  -- session's first page_location retrieved from any event
  first_value(page_location) over(
    partition by unique_session_id
    order by
      event_timestamp asc
  ) as session_landing_page,
  -- session's first page_location only retrieved from page_view events
  first_value(
    if(
      event_name = 'page_view',
      page_location,
      null
    ) ignore nulls
  ) over(
    partition by unique_session_id
    order by
      event_timestamp asc
  ) as session_landing_page2
from
  events

Previous and next page path

Previous and next page paths can help visualize how the visitors navigate the site’s content and how the different pages are related. Again, we can utilize an analytic function.

with events as (
  select
    concat(
      user_pseudo_id,
      (
        select
          value.int_value
        from
          unnest(event_params)
        where
          key = 'ga_session_id'
      )
    ) as unique_session_id,
    event_name,
    event_timestamp,
    regexp_replace(
  regexp_replace(
    (
      select
        p.value.string_value
      from
        unnest(event_params) as p
      where
        p.key = 'page_location'
    ),
    r'^https?://[^/]+',
    ''
  ),
  r'[\?].*',
  ''
) as page_path
  from
    `<your-project>.<ga4-dataset>.<ga4-table>`
)
select
  unique_session_id,
  event_name,
  page_path,
  event_timestamp,
  -- look for the previous page_path
  if(
    event_name = 'page_view',
    coalesce(
      last_value(
        if(event_name = 'page_view', page_path, null) ignore nulls
      ) over(
        partition by unique_session_id
        order by
          event_timestamp asc rows between unbounded preceding
          and 1 preceding
      ),
      '(entrance)'
    ),
    null
  ) as previous_page,
  -- look for the next page_path
  if(
    event_name = 'page_view',
    coalesce(
      first_value(
        if(event_name = 'page_view', page_path, null) ignore nulls
      ) over(
        partition by unique_session_id
        order by
          event_timestamp asc rows between 1 following
          and unbounded following
      ),
      '(exit)'
    ),
    null
  ) as next_page
from
  events

In this example, we define the analytic function’s window with the previous page’s path to consist of all the events before the current event. Then, from that window, the formula extracts the page path of the events. However, we only want the page path from the page_view events, so other event_names are set to return nulls. The last_value function, used with the window, ignores those null values.

This example query returns the preceding and following page paths only if the event itself is a page_view event. It gets a bit more complex if you want to return the same for all event types. Then, you’ll have to take into account that events on the same page can occur before or after the page_view event.

Conclusion

When working with GA4 data in BigQuery, the SQL queries very often share a lot of the same formulas when starting from the event-level data. This article was a collection of some of the most common SQL formulas I find myself repeatedly using. Please let me know in the comments if I missed something important.

(Visited 816 times, 1 visits today)

One comment

  1. Pingback: Events Tracking from GTM-Server Side to BigQuery - Analytics Smart

Leave a Reply

Your email address will not be published. Required fields are marked *