Partitioning in BigQuery

When appending data to a partitioned table in BigQuery using Python and a DataFrame, you can specify the partition to which the data should be written. Here’s how you can do it step by step:


Prerequisites

1. Install the required libraries:


pip install google-cloud-bigquery pandas



2. Ensure your BigQuery table is partitioned (e.g., by date or integer range).


Code Example


Here’s an example of appending a DataFrame to a BigQuery partitioned table:


from google.cloud import bigquery

import pandas as pd


# Set up BigQuery client

client = bigquery.Client()


# Your project and dataset details

project_id = "your-project-id"

dataset_id = "your-dataset-id"

table_id = "your-table-id" # Replace with your table name


# Full table ID (project.dataset.table)

full_table_id = f"{project_id}.{dataset_id}.{table_id}"


# Sample DataFrame to append

data = {

  "partition_column": ["2025-01-01", "2025-01-02"], # Partition column (e.g., DATE)

  "data_column": [100, 200] # Other columns

}

df = pd.DataFrame(data)


# Define job configuration

job_config = bigquery.LoadJobConfig(

  write_disposition=bigquery.WriteDisposition.WRITE_APPEND, # Append data

  schema_update_options=[

    bigquery.SchemaUpdateOption.ALLOW_FIELD_ADDITION # If needed

  ],

  time_partitioning=bigquery.TimePartitioning( # Specify time partitioning

    field="partition_column" # The column used for partitioning

  ),

)


# Load DataFrame into BigQuery

job = client.load_table_from_dataframe(df, full_table_id, job_config=job_config)

job.result() # Wait for the job to complete


# Print success message

print(f"Data appended to table: {full_table_id}")


Key Points:

1. Partition Column: Ensure that the partition_column in your DataFrame matches the partitioning column of the BigQuery table.

2. Job Configuration:

• Use bigquery.TimePartitioning to specify the partition column if the table is time-partitioned.

• Use bigquery.RangePartitioning for range-partitioned tables.

3. Write Disposition: Set WRITE_APPEND to append data to the table.

4. Schema Update: If the schema changes (e.g., new columns), include SchemaUpdateOption.ALLOW_FIELD_ADDITION.


Partition Types:

• Time Partitioning: Based on a DATE or TIMESTAMP column.

• Range Partitioning: Based on an integer column.


If your table uses integer range partitions, adjust the configuration:


range_partitioning = bigquery.RangePartitioning(

  field="partition_column", # Integer column

  range_=bigquery.PartitionRange(start=0, end=1000, interval=100),

)


Common Errors and Solutions:

• Partition Not Found: Ensure the partition_column values in the DataFrame are compatible with the table’s partitioning.

• Schema Mismatch: Ensure the DataFrame columns match the table schema. Use explicit casting if needed.


Let me know if you need help with a specific part!



From Blogger iPhone client