For customers that have their own data warehouse, they want data pushed into an AWS S3 bucket that they can then push into Snowflake, Databricks, Redshift, Athena, etc.
Many users set up multiple exports
- an initial export to match existing data tables or exports from other rank trackers, to maintain reporting continuity during a transition period
- a second full export to opt into richer Nozzle data
Table of Contents
- Choose data format
- Define filename template
- Create an S3 bucket
- Create a custom role
- Submit bucket + role details
- What's next?
Data Format
The best way to keep internal reports synced with Nozzle is to define two separate export streams:
- Keywords: these are configured upstream, and while they can be joined into the rankings data for a single export, that causes problems as keyword groups / tags inevitably change, and you want those to reflect historically. By keeping keyword data separate, we can replace as keywords are added/removed, and as keyword groups evolve, and then joined at query time.
- Rankings: this is the raw SERP data, itself split into two distinct sections:
- SERP data: e.g. search volume, search intent, etc., that only occur once per SERP
- Result data: this is the bulk of the data, in a nested array of objects, with sub-fields containing specifics for each result
Keyword configuration is only exported when modified in the UI / API, while rankings are exported whenever they are collected and parsed.
You can specify whether to export all projects or just specific projects
Keywords
Primary Key: keyword_id
Keywords are not exported incrementally as they change, rather the most recent copy is exported if there were any changes in the prior day. Replace the existing table with the new keyword data.
Below is a description of all the fields in the keywords export. With the exception of keyword_groups
and adwords_criteria_id
, all the other fields are immutable. Keyword groups can and will change regularly, as they are edited in the UI by SEO teams. The criteria id rarely changes, but sometimes Google will deprecate geotargets and replace them with new ones.
Field Name | Description | Type |
keyword_id | PK: represents the unique combination of phrase, locale, and device | INTEGER |
phrase_id | unique id for each phrase | INTEGER |
phrase | keyword search term / phrase | STRING |
device | Desktop, iPhone, Android (mobile) | STRING |
device_code | 1-character abbreviation (d, i, m) | STRING |
locale_id | represents the unique combination of engine, language, and location | INTEGER |
engine | Search engine (currently only Google) | STRING |
engine_code | 1-character abbreviation (g) | STRING |
language | Full language name | STRING |
language_code | 2-letter abbreviation per ISO 639-1, with the exception of Chinese that uses non-standard zht (traditional) and zhs (simplified) | STRING |
location_id | represents a specific place |
INTEGER |
location_type | type defined by Google Ads geotargets, most commonly: Country, City, Postal Code | STRING |
location | Empty when geo-targeting at a country-level, otherwise the name of the location | STRING |
country | Full country name | STRING |
country_code | 2-letter abbreviation per ISO 3166-1 | STRING |
ad_words_criteria_id | locations are mapped to Google Ads geotargets | INTEGER |
keyword_groups | user-defined tags, used in reporting | REPEATED STRING |
Here is a list of supported export file formats, including sample keyword data for testing and evaluation. We recommend parquet, as it includes the full schema, has the best compression, and is widely supported by various query engines. Avro and newline delimited json are also available.
File Type | Compression | Filesize | Full Sample Data |
Parquet | ZSTD | 60 kb | download |
Avro | SNAPPY | 240 kb | download |
JSON (newline delimited) | GZIP | 125 kb | download |
Rankings
Primary Key: ranking_id
Occasionally we will re-parse a SERP. The ranking_id
is unchanged, so only use the most recent inserted_at
version in reports. Unlike keywords, rankings exports are incremental, so only updated SERPs are exported.
There are close to 1,000 fields in the rankings table, so this table does not attempt to list them all. These are the key fields that most users will be interested in, for both metrics creation and data pipeline set up.
Field Name | Description | Type |
ranking_id | PK: represents a single SERP, equivalent to keyword_id + requested | INTEGER |
inserted_at | timestamp when this row was last updated | TIMESTAMP |
keyword_id | join to the keywords table | INTEGER |
requested | timestamp the SERP was scheduled to be collected. Most often this will be at 00:00 (midnight), unless the schedule was hourly, or set for a specific time during the day | TIMESTAMP |
keyword_metrics. |
search volume for the keywor, refreshed monthly. For supported location types, this search volume is localized, meaning it is specific to the city, region, etc. | INTEGER |
keyword_metrics. country_adwords_search_volume |
search volume for this term for the entire country. If the location type is Country, this will be the same as keyword_metrics.adwords_search_volume | INTEGER |
results | this nested field contains the majority of the data, with a single row per result | REPEATED STRUCT |
results.rank | blended rank for all unpaid results (metrics definitions). This is null for paid results. | INTEGER |
results.item_rank | the order the result appeared in a pack, like a carousel, sitelinks, top stories, etc. | INTEGER |
results.paid_adjusted_rank | blends paid results into a combined rank | INTEGER |
results.url.domain | ranking domain | STRING |
results.url.url | ranking url | STRING |
results.title.text | the main title text, usually the link anchor text | STRING |
results.description.text | the results description | STRING |
results.nozzle_metrics. click_through_rate |
calculated CTR for the result, if any | FLOAT |
results.nozzle_metrics. estimated_traffic |
keyword_metrics.adwords_search_volume * results.nozzle_metrics.click_through_rate |
INTEGER |
results.measurements. pixels_from_top |
measures the pixel distance from the top of the SERP to the top of the result | INTEGER |
results.measurements. percentage_of_viewport |
Above the Fold % | FLOAT |
results.measurements. percentage_of_dom |
SERP % | FLOAT |
There is purposefully no result_type
column in the default export. Historically, that has caused issues with breaking changes in downstream dashboards as Google updates SERP features. Instead, there are lots of boolean fields, e.g. featured_snippet.is_featured_snippet
, which you can combine with other fields to target exact what you are looking for.
We understand that it is often easier to work with a single result type, and we can work with you to define that column, targeting the SERP features you are most interested in. That also allows you to be in control of breaking changes, as you can update/change those to be more specific, coordinating with your custom dashboards to prevent breakages.
As Google adds new features, we semi-regularly add new fields to the rankings schema. All changes are backwards compatible, but make sure that your pipeline isn't going to break when new fields are added.
Here is a list of supported export file formats, including sample rankings data for testing and evaluation. We recommend parquet, as it includes the full schema, has the best compression, and is widely supported by various query engines. Avro and newline delimited json are also available.
File Type | Compression | Filesize | Full Sample Data |
Parquet | ZSTD | 2.0 mb | download |
Avro | SNAPPY | 4.0 mb | download |
JSON (newline delimited) | GZIP | 3.2 mb | download |
Define Filename
By default, we set up a daily export, where BigQuery checks for any new data since the last export, and writes only those rows to output files. We will need to set up two file naming templates, which can match most any internal naming conventions. This can also be used in conjunction with AWS permissions to restrict access to a prefix inside of a bucket. Here are some examples that some users have set up:
rankings--2024-04-15--seocompanies-000000000014.parquet
(default)exports/year=2024/month=04/day=05/rankings-000000000014.jsonl.gz
dt=2024-08-14/rankings--000000000014.avro
batch/schema=seo.NozzleKeyword/contentType=parquet/year=2024/month=04/day=05/hour=00/000000000014.parquet
If not specified, the default filename templates are:
- Keywords:
keywords--YYYY-MM-DD--{project_slug}-{file number}.{extension}
- Rankings:
rankings--YYYY-MM-DD--{project_slug}-{file number}.{extension}
If you want to customize the templates, here are the template fields available to you:
- Year: default is 4 digits
YYYY
, but can be customized - Month: default is 2 digits
MM
, but can be customized - Day: default is 2 digits
DD
, but can be customized - Project Slug: this is useful for differentiating multiple project output
- Content Type: one of parquet, jsonl, or avro
- Content Encoding: one of zstd, gzip, snappy
The date in the filename references the date the export was created, not the date of the SERPs inside the file. As called out in the rankings section above, data may be late arriving, due to either Google introducing new SERP features / layout changes, or Nozzle initiating a re-parse of SERPs to improve / update data from the original parse.
Since BigQuery is creating the export files, we have minimal control over how they are emitted. There is no guarantee that there will be a consistent number of rows per file, and while unlikely, it is even possible for files to be created that have zero rows of data.
Create an S3 Bucket
To receive files from Nozzle, create a new bucket dedicated for this data pipeline, to prevent any conflicts with other files or users. You can also to use an existing bucket if desired. See the documentation in the AWS docs below, and keep track of the bucket name and region, as you'll need to submit both to us to start the pipeline running.
https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-overview.html
Create a Custom AWS Role
Nozzle runs on Google Cloud, and so we need permissions in order to have access to your bucket. By creating a role, you can give us the least amount of permissions possible for security purposes. We also take advantage of a trust policy, where Google Cloud and AWS coordinate to ensure that only Nozzle can access that role.
Below are the prerequisites for role creation: the trust policy and permissions. The only change that needs to be made is to replace your-bucket-name/optional-prefix/
with the prefix not required, with the name of the bucket created in the prior step.
For a detailed walkthrough, follow the AWS documentation guide. Copy the full role name, which should look like this: arn:aws:iam::1234567:role/NozzleExportRole
to submit to Nozzle in the form at the end of the page.
https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-idp_oidc.html#idp_oidc_Create
AWS Permissions
AWS Trust Policy
118316257528160653604
is the Google Cloud service account id used by Nozzle for the sole purpose of writing export files to AWS.
Submit AWS Bucket + Role Details
Once you have created a bucket and a role, submit them to us with this form
What's Next?
Once your data is in S3, there are myriad ways to consume it and route it to your own data warehouse, or even query the raw files directly. Here are some links for ways to consume that data in an event-driven way, so that you can immediately process the data once Nozzle pushes it into your bucket. Many of these options work best with parquet as the output format, which is why we strongly recommend it.
- Tutorial: Using an Amazon S3 trigger to invoke a Lambda function
https://docs.aws.amazon.com/AmazonS3/latest/userguide/ways-to-add-notification-config-to-bucket.html - Walkthrough: Configuring a bucket for notifications (SNS topic or SQS queue)
https://docs.aws.amazon.com/AmazonS3/latest/userguide/ways-to-add-notification-config-to-bucket.html - Snowflake: Automating Snowpipe for Amazon S3
https://docs.snowflake.com/en/user-guide/data-load-snowpipe-auto-s3 - Build an ETL process for Amazon Redshift using Amazon S3 Event Notifications and AWS Step Functions
https://aws.amazon.com/blogs/big-data/build-an-etl-process-for-amazon-redshift-using-amazon-s3-event-notifications-and-aws-step-functions/ - Analyzing Data in S3 using Amazon Athena
https://aws.amazon.com/blogs/big-data/analyzing-data-in-s3-using-amazon-athena/ - Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake
https://aws.amazon.com/blogs/big-data/use-apache-iceberg-in-your-data-lake-with-amazon-s3-aws-glue-and-snowflake/ - Databricks: Onboard data from Amazon S3
https://docs.databricks.com/en/ingestion/cloud-object-storage/onboard-data.html
If you have any questions about how to query the data once you've consumed it, what types of materialized views might be useful, please reach out, we'd love to help you get the most out of the data.