Export Data to S3 Bucket

For customers that have their own data warehouse, they want data pushed into an AWS S3 bucket that they can then push into Snowflake, Databricks, Redshift, Athena, etc.

Many users set up multiple exports

an initial export to match existing data tables or exports from other rank trackers, to maintain reporting continuity during a transition period
a second full export to opt into richer Nozzle data

Choose data format
1. Keywords
  1. Schema / fields
  2. Sample data
2. Rankings
  1. Schema / fields
  2. Sample data
Define filename template
Create an S3 bucket
Create a custom role
Submit bucket + role details
What's next?

Data Format

The best way to keep internal reports synced with Nozzle is to define two separate export streams:

Keywords: these are configured upstream, and while they can be joined into the rankings data for a single export, that causes problems as keyword groups / tags inevitably change, and you want those to reflect historically. By keeping keyword data separate, we can replace as keywords are added/removed, and as keyword groups evolve, and then joined at query time.
Rankings: this is the raw SERP data, itself split into two distinct sections:
1. SERP data: e.g. search volume, search intent, etc., that only occur once per SERP
2. Result data: this is the bulk of the data, in a nested array of objects, with sub-fields containing specifics for each result

Keyword configuration is only exported when modified in the UI / API, while rankings are exported whenever they are collected and parsed.

You can specify whether to export all projects or just specific projects

Keywords

Primary Key: keyword_id

Keywords are not exported incrementally as they change, rather the most recent copy is exported if there were any changes in the prior day. Replace the existing table with the new keyword data.

Below is a description of all the fields in the keywords export. With the exception of keyword_groups and adwords_criteria_id, all the other fields are immutable. Keyword groups can and will change regularly, as they are edited in the UI by SEO teams. The criteria id rarely changes, but sometimes Google will deprecate geotargets and replace them with new ones.

Field Name	Description	Type
keyword_id	PK: represents the unique combination of phrase, locale, and device	INTEGER
phrase_id	unique id for each phrase	INTEGER
phrase	keyword search term / phrase	STRING
device	Desktop, iPhone, Android (mobile)	STRING
device_code	1-character abbreviation (d, i, m)	STRING
locale_id	represents the unique combination of engine, language, and location	INTEGER
engine	Search engine (currently only Google)	STRING
engine_code	1-character abbreviation (g)	STRING
language	Full language name	STRING
language_code	2-letter abbreviation per ISO 639-1, with the exception of Chinese that uses non-standard zht (traditional) and zhs (simplified)	STRING
location_id	represents a specific place	INTEGER
location_type	type defined by Google Ads geotargets, most commonly: Country, City, Postal Code	STRING
location	Empty when geo-targeting at a country-level, otherwise the name of the location	STRING
country	Full country name	STRING
country_code	2-letter abbreviation per ISO 3166-1	STRING
ad_words_criteria_id	locations are mapped to Google Ads geotargets	INTEGER
keyword_groups	user-defined tags, used in reporting	REPEATED STRING

Here is a list of supported export file formats, including sample keyword data for testing and evaluation. We recommend parquet, as it includes the full schema, has the best compression, and is widely supported by various query engines. Avro and newline delimited json are also available.

File Type	Compression	Filesize	Full Sample Data
Parquet	ZSTD	60 kb	download
Avro	SNAPPY	240 kb	download
JSON (newline delimited)	GZIP	125 kb	download

Rankings

Primary Key: ranking_id

Occasionally we will re-parse a SERP. The ranking_id is unchanged, so only use the most recent inserted_at version in reports. Unlike keywords, rankings exports are incremental, so only updated SERPs are exported.

There are close to 1,000 fields in the rankings table, so this table does not attempt to list them all. These are the key fields that most users will be interested in, for both metrics creation and data pipeline set up.

Field Name	Description	Type
ranking_id	PK: represents a single SERP, equivalent to keyword_id + requested	INTEGER
inserted_at	timestamp when this row was last updated	TIMESTAMP
keyword_id	join to the keywords table	INTEGER
requested	timestamp the SERP was scheduled to be collected. Most often this will be at 00:00 (midnight), unless the schedule was hourly, or set for a specific time during the day	TIMESTAMP
keyword_metrics. adwords_search_volume	search volume for the keywor, refreshed monthly. For supported location types, this search volume is localized, meaning it is specific to the city, region, etc.	INTEGER
keyword_metrics. country_adwords_search_volume	search volume for this term for the entire country. If the location type is Country, this will be the same as keyword_metrics.adwords_search_volume	INTEGER
results	this nested field contains the majority of the data, with a single row per result	REPEATED STRUCT
results.rank	blended rank for all unpaid results (metrics definitions). This is null for paid results.	INTEGER
results.item_rank	the order the result appeared in a pack, like a carousel, sitelinks, top stories, etc.	INTEGER
results.paid_adjusted_rank	blends paid results into a combined rank	INTEGER
results.url.domain	ranking domain	STRING
results.url.url	ranking url	STRING
results.title.text	the main title text, usually the link anchor text	STRING
results.description.text	the results description	STRING
results.nozzle_metrics. click_through_rate	calculated CTR for the result, if any	FLOAT
results.nozzle_metrics. estimated_traffic	keyword_metrics.adwords_search_volume * results.nozzle_metrics.click_through_rate	INTEGER
results.measurements. pixels_from_top	measures the pixel distance from the top of the SERP to the top of the result	INTEGER
results.measurements. percentage_of_viewport	Above the Fold %	FLOAT
results.measurements. percentage_of_dom	SERP %	FLOAT

There is purposefully no result_type column in the default export. Historically, that has caused issues with breaking changes in downstream dashboards as Google updates SERP features. Instead, there are lots of boolean fields, e.g. featured_snippet.is_featured_snippet, which you can combine with other fields to target exact what you are looking for.

We understand that it is often easier to work with a single result type, and we can work with you to define that column, targeting the SERP features you are most interested in. That also allows you to be in control of breaking changes, as you can update/change those to be more specific, coordinating with your custom dashboards to prevent breakages.

As Google adds new features, we semi-regularly add new fields to the rankings schema. All changes are backwards compatible, but make sure that your pipeline isn't going to break when new fields are added.

Here is a list of supported export file formats, including sample rankings data for testing and evaluation. We recommend parquet, as it includes the full schema, has the best compression, and is widely supported by various query engines. Avro and newline delimited json are also available.

File Type	Compression	Filesize	Full Sample Data
Parquet	ZSTD	2.0 mb	download
Avro	SNAPPY	4.0 mb	download
JSON (newline delimited)	GZIP	3.2 mb	download

Define Filename

By default, we set up a daily export, where BigQuery checks for any new data since the last export, and writes only those rows to output files. We will need to set up two file naming templates, which can match most any internal naming conventions. This can also be used in conjunction with AWS permissions to restrict access to a prefix inside of a bucket. Here are some examples that some users have set up:

rankings--2024-04-15--seocompanies-000000000014.parquet(default)
exports/year=2024/month=04/day=05/rankings-000000000014.jsonl.gz
dt=2024-08-14/rankings--000000000014.avro
batch/schema=seo.NozzleKeyword/contentType=parquet/year=2024/month=04/day=05/hour=00/000000000014.parquet

If not specified, the default filename templates are:

Keywords:
keywords--YYYY-MM-DD--{project_slug}-{file number}.{extension}
Rankings:
rankings--YYYY-MM-DD--{project_slug}-{file number}.{extension}

If you want to customize the templates, here are the template fields available to you:

Year: default is 4 digits YYYY, but can be customized
Month: default is 2 digits MM, but can be customized
Day: default is 2 digits DD, but can be customized
Project Slug: this is useful for differentiating multiple project output
Content Type: one of parquet, jsonl, or avro
Content Encoding: one of zstd, gzip, snappy

The date in the filename references the date the export was created, not the date of the SERPs inside the file. As called out in the rankings section above, data may be late arriving, due to either Google introducing new SERP features / layout changes, or Nozzle initiating a re-parse of SERPs to improve / update data from the original parse.

Since BigQuery is creating the export files, we have minimal control over how they are emitted. There is no guarantee that there will be a consistent number of rows per file, and while unlikely, it is even possible for files to be created that have zero rows of data.

Create an S3 Bucket

To receive files from Nozzle, create a new bucket dedicated for this data pipeline, to prevent any conflicts with other files or users. You can also to use an existing bucket if desired. See the documentation in the AWS docs below, and keep track of the bucket name and region, as you'll need to submit both to us to start the pipeline running.

https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-overview.html

Create a Custom AWS Role

Nozzle runs on Google Cloud, and so we need permissions in order to have access to your bucket. By creating a role, you can give us the least amount of permissions possible for security purposes. We also take advantage of a trust policy, where Google Cloud and AWS coordinate to ensure that only Nozzle can access that role.

Below are the prerequisites for role creation: the trust policy and permissions. The only change that needs to be made is to replace your-bucket-name/optional-prefix/ with the prefix not required, with the name of the bucket created in the prior step.

For a detailed walkthrough, follow the AWS documentation guide. Copy the full role name, which should look like this: arn:aws:iam::1234567:role/NozzleExportRole to submit to Nozzle in the form at the end of the page.

https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-idp_oidc.html#idp_oidc_Create

AWS Permissions

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "s3:PutObject"
            ],
            "Resource": "arn:aws:s3:::your-bucket-name/optional-prefix/*",
            "Effect": "Allow"
        }
    ]
}

AWS Trust Policy

118316257528160653604 is the Google Cloud service account id used by Nozzle for the sole purpose of writing export files to AWS.

{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Effect":"Allow",
         "Principal":{
            "Federated":"accounts.google.com"
         },
         "Action":"sts:AssumeRoleWithWebIdentity",
         "Condition":{
            "StringEquals":{
               "accounts.google.com:sub":"118316257528160653604",
               "accounts.google.com:aud":"118316257528160653604"
            }
         }
      }
   ]
}

Submit AWS Bucket + Role Details

Once you have created a bucket and a role, submit them to us with this form

What's Next?

Once your data is in S3, there are myriad ways to consume it and route it to your own data warehouse, or even query the raw files directly. Here are some links for ways to consume that data in an event-driven way, so that you can immediately process the data once Nozzle pushes it into your bucket. Many of these options work best with parquet as the output format, which is why we strongly recommend it.

Tutorial: Using an Amazon S3 trigger to invoke a Lambda function
https://docs.aws.amazon.com/AmazonS3/latest/userguide/ways-to-add-notification-config-to-bucket.html
Walkthrough: Configuring a bucket for notifications (SNS topic or SQS queue)
https://docs.aws.amazon.com/AmazonS3/latest/userguide/ways-to-add-notification-config-to-bucket.html
Snowflake: Automating Snowpipe for Amazon S3
https://docs.snowflake.com/en/user-guide/data-load-snowpipe-auto-s3
Build an ETL process for Amazon Redshift using Amazon S3 Event Notifications and AWS Step Functions
https://aws.amazon.com/blogs/big-data/build-an-etl-process-for-amazon-redshift-using-amazon-s3-event-notifications-and-aws-step-functions/
Analyzing Data in S3 using Amazon Athena
https://aws.amazon.com/blogs/big-data/analyzing-data-in-s3-using-amazon-athena/
Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake
https://aws.amazon.com/blogs/big-data/use-apache-iceberg-in-your-data-lake-with-amazon-s3-aws-glue-and-snowflake/
Databricks: Onboard data from Amazon S3
https://docs.databricks.com/en/ingestion/cloud-object-storage/onboard-data.html

If you have any questions about how to query the data once you've consumed it, what types of materialized views might be useful, please reach out, we'd love to help you get the most out of the data.