1. Support Home
  2. Data Engineering

Export Data to S3 Bucket

For customers that have their own data warehouse, they want data pushed into an AWS S3 bucket that they can then push into Snowflake, Databricks, Redshift, Athena, etc.

Many users set up multiple exports

  • an initial export to match existing data tables or exports from other rank trackers, to maintain reporting continuity during a transition period
  • a second full export to opt into richer Nozzle data

Table of Contents

  1. Choose data format
    1. Keywords
      1. Schema / fields
      2. Sample data
    2. Rankings
      1. Schema / fields
      2. Sample data
  2. Define filename template
  3. Create an S3 bucket
  4. Create a custom role
  5. Submit bucket + role details
  6. What's next?

Data Format

The best way to keep internal reports synced with Nozzle is to define two separate export streams:

  1. Keywords: these are configured upstream, and while they can be joined into the rankings data for a single export, that causes problems as keyword groups / tags inevitably change, and you want those to reflect historically. By keeping keyword data separate, we can replace as keywords are added/removed, and as keyword groups evolve, and then joined at query time.
  2. Rankings: this is the raw SERP data, itself split into two distinct sections:
    1. SERP data: e.g. search volume, search intent, etc., that only occur once per SERP
    2. Result data: this is the bulk of the data, in a nested array of objects, with sub-fields containing specifics for each result

Keyword configuration is only exported when modified in the UI / API, while rankings are exported whenever they are collected and parsed.

You can specify whether to export all projects or just specific projects

Keywords

Primary Key: keyword_id

Keywords are not exported incrementally as they change, rather the most recent copy is exported if there were any changes in the prior day. Replace the existing table with the new keyword data.

Below is a description of all the fields in the keywords export. With the exception of keyword_groups and adwords_criteria_id, all the other fields are immutable. Keyword groups can and will change regularly, as they are edited in the UI by SEO teams. The criteria id rarely changes, but sometimes Google will deprecate geotargets and replace them with new ones.

Field Name Description Type
keyword_id PK: represents the unique combination of phrase, locale, and device INTEGER
phrase_id unique id for each phrase INTEGER
phrase keyword search term / phrase STRING
device Desktop, iPhone, Android (mobile) STRING
device_code 1-character abbreviation (d, i, m) STRING
locale_id represents the unique combination of engine, language, and location INTEGER
engine Search engine (currently only Google) STRING
engine_code 1-character abbreviation (g) STRING
language Full language name STRING
language_code 2-letter abbreviation per ISO 639-1, with the exception of Chinese that uses non-standard zht (traditional) and zhs (simplified) STRING
location_id represents a specific place

INTEGER
location_type type defined by Google Ads geotargets, most commonly: Country, City, Postal Code STRING
location Empty when geo-targeting at a country-level, otherwise the name of the location STRING
country Full country name STRING
country_code 2-letter abbreviation per ISO 3166-1 STRING
ad_words_criteria_id locations are mapped to Google Ads geotargets INTEGER
keyword_groups user-defined tags, used in reporting REPEATED STRING

 

Here is a list of supported export file formats, including sample keyword data for testing and evaluation. We recommend parquet, as it includes the full schema, has the best compression, and is widely supported by various query engines. Avro and newline delimited json are also available.

File Type Compression Filesize Full Sample Data
Parquet ZSTD 60 kb download
Avro SNAPPY 240 kb download
JSON (newline delimited) GZIP 125 kb download

 

 

Rankings

Primary Key: ranking_id

Occasionally we will re-parse a SERP. The ranking_id is unchanged, so only use the most recent inserted_at version in reports. Unlike keywords, rankings exports are incremental, so only updated SERPs are exported.

There are close to 1,000 fields in the rankings table, so this table does not attempt to list them all. These are the key fields that most users will be interested in, for both metrics creation and data pipeline set up.

Field Name Description Type
ranking_id PK: represents a single SERP, equivalent to keyword_id + requested INTEGER
inserted_at timestamp when this row was last updated TIMESTAMP
keyword_id join to the keywords table INTEGER
requested timestamp the SERP was scheduled to be collected. Most often this will be at 00:00 (midnight), unless the schedule was hourly, or set for a specific time during the day TIMESTAMP

keyword_metrics.
adwords_search_volume

search volume for the keywor, refreshed monthly. For supported location types, this search volume is localized, meaning it is specific to the city, region, etc.  INTEGER
keyword_metrics.
country_adwords_search_volume
search volume for this term for the entire country. If the location type is Country, this will be the same as keyword_metrics.adwords_search_volume INTEGER
results this nested field contains the majority of the data, with a single row per result REPEATED STRUCT
results.rank blended rank for all unpaid results (metrics definitions). This is null for paid results. INTEGER
results.item_rank the order the result appeared in a pack, like a carousel, sitelinks, top stories, etc. INTEGER
results.paid_adjusted_rank blends paid results into a combined rank INTEGER
results.url.domain ranking domain STRING
results.url.url ranking url STRING
results.title.text the main title text, usually the link anchor text STRING
results.description.text the results description STRING
results.nozzle_metrics.
click_through_rate
calculated CTR for the result, if any FLOAT
results.nozzle_metrics.
estimated_traffic
keyword_metrics.adwords_search_volume *
results.nozzle_metrics.click_through_rate
INTEGER
results.measurements.
pixels_from_top
measures the pixel distance from the top of the SERP to the top of the result INTEGER
results.measurements.
percentage_of_viewport
Above the Fold % FLOAT
results.measurements.
percentage_of_dom
SERP % FLOAT

 

There is purposefully no result_type column in the default export. Historically, that has caused issues with breaking changes in downstream dashboards as Google updates SERP features. Instead, there are lots of boolean fields, e.g. featured_snippet.is_featured_snippet, which you can combine with other fields to target exact what you are looking for.

We understand that it is often easier to work with a single result type, and we can work with you to define that column, targeting the SERP features you are most interested in. That also allows you to be in control of breaking changes, as you can update/change those to be more specific, coordinating with your custom dashboards to prevent breakages.

As Google adds new features, we semi-regularly add new fields to the rankings schema. All changes are backwards compatible, but make sure that your pipeline isn't going to break when new fields are added.

 

Here is a list of supported export file formats, including sample rankings data for testing and evaluation. We recommend parquet, as it includes the full schema, has the best compression, and is widely supported by various query engines. Avro and newline delimited json are also available.

File Type Compression Filesize Full Sample Data
Parquet ZSTD 2.0 mb download
Avro SNAPPY 4.0 mb download
JSON (newline delimited) GZIP 3.2 mb download

 

Define Filename

By default, we set up a daily export, where BigQuery checks for any new data since the last export, and writes only those rows to output files. We will need to set up two file naming templates, which can match most any internal naming conventions. This can also be used in conjunction with AWS permissions to restrict access to a prefix inside of a bucket. Here are some examples that some users have set up:

  • rankings--2024-04-15--seocompanies-000000000014.parquet(default)
  • exports/year=2024/month=04/day=05/rankings-000000000014.jsonl.gz
  • dt=2024-08-14/rankings--000000000014.avro
  • batch/schema=seo.NozzleKeyword/contentType=parquet/year=2024/month=04/day=05/hour=00/000000000014.parquet

 

If not specified, the default filename templates are:

  • Keywords:
    keywords--YYYY-MM-DD--{project_slug}-{file number}.{extension}
  • Rankings:
    rankings--YYYY-MM-DD--{project_slug}-{file number}.{extension}

If you want to customize the templates, here are the template fields available to you:

  • Year: default is 4 digits YYYY, but can be customized
  • Month: default is 2 digits MM, but can be customized
  • Day: default is 2 digits DD, but can be customized
  • Project Slug: this is useful for differentiating multiple project output
  • Content Type: one of parquet, jsonl, or avro
  • Content Encoding: one of zstd, gzip, snappy

The date in the filename references the date the export was created, not the date of the SERPs inside the file. As called out in the rankings section above, data may be late arriving, due to either Google introducing new SERP features / layout changes, or Nozzle initiating a re-parse of SERPs to improve / update data from the original parse.

Since BigQuery is creating the export files, we have minimal control over how they are emitted. There is no guarantee that there will be a consistent number of rows per file, and while unlikely, it is even possible for files to be created that have zero rows of data.

 

Create an S3 Bucket

To receive files from Nozzle, create a new bucket dedicated for this data pipeline, to prevent any conflicts with other files or users. You can also to use an existing bucket if desired. See the documentation in the AWS docs below, and keep track of the bucket name and region, as you'll need to submit both to us to start the pipeline running.

https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-overview.html

Create a Custom AWS Role

Nozzle runs on Google Cloud, and so we need permissions in order to have access to your bucket. By creating a role, you can give us the least amount of permissions possible for security purposes. We also take advantage of a trust policy, where Google Cloud and AWS coordinate to ensure that only Nozzle can access that role.

Below are the prerequisites for role creation: the trust policy and permissions. The only change that needs to be made is to replace your-bucket-name/optional-prefix/ with the prefix not required, with the name of the bucket created in the prior step.

For a detailed walkthrough, follow the AWS documentation guide. Copy the full role name, which should look like this: arn:aws:iam::1234567:role/NozzleExportRole to submit to Nozzle in the form at the end of the page.

https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-idp_oidc.html#idp_oidc_Create

AWS Permissions

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "s3:PutObject"
            ],
            "Resource": "arn:aws:s3:::your-bucket-name/optional-prefix/*",
            "Effect": "Allow"
        }
    ]
}

AWS Trust Policy

118316257528160653604 is the Google Cloud service account id used by Nozzle for the sole purpose of writing export files to AWS.

{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Effect":"Allow",
         "Principal":{
            "Federated":"accounts.google.com"
         },
         "Action":"sts:AssumeRoleWithWebIdentity",
         "Condition":{
            "StringEquals":{
               "accounts.google.com:sub":"118316257528160653604",
               "accounts.google.com:aud":"118316257528160653604"
            }
         }
      }
   ]
}

 

Submit AWS Bucket + Role Details

Once you have created a bucket and a role, submit them to us with this form

What's Next?

Once your data is in S3, there are myriad ways to consume it and route it to your own data warehouse, or even query the raw files directly. Here are some links for ways to consume that data in an event-driven way, so that you can immediately process the data once Nozzle pushes it into your bucket. Many of these options work best with parquet as the output format, which is why we strongly recommend it.

If you have any questions about how to query the data once you've consumed it, what types of materialized views might be useful, please reach out, we'd love to help you get the most out of the data.