Stream to S3 as Parquet - GoldRush API Documentation

Use Case

You want to archive Base Mainnet block data as partitioned Parquet files in Amazon S3. Once stored, you can query the data with tools like AWS Athena, Apache Spark, or DuckDB without running any database infrastructure.

Pipeline Configuration

Create a new pipeline

In the GoldRush Platform, navigate to Manage Pipelines and click Create Pipeline. Name it block-archive.

Configure the Object Storage Destination

Select Object Storage as the destination type. Enter your S3 credentials and configure the file format:

Select Your Source

Choose Base Mainnet as the chain and Blocks as the data type. This streams block headers and metadata from every Base block.

Review your configuration

destination:
  type: "object_storage"
  provider: "s3"
  bucket: "my-blockchain-data"
  base_path: "base-mainnet"
  format: "parquet"
  compression: "snappy"
  partition_by:
    - "day"
  batch_size: 50000
  batch_interval_ms: 300000
  region: "auto"
  access_key_id: "${AWS_ACCESS_KEY_ID}"
  secret_access_key: "${AWS_SECRET_ACCESS_KEY}"

Deploy

Review and deploy the pipeline. Data begins writing to S3 as partitioned Parquet files.

File Layout

Once running, files appear in S3 with this structure:

s3://my-blockchain-data/base-mainnet/block-archive/blocks/
  year=2025/
    month=03/
      day=18/
        0-1000_50000-abc123.parquet
        0-50001_100000-def456.parquet
      day=19/
        ...

Each file contains up to 50,000 records. The file ID encodes the partition, offset range, and a UUID for deduplication on retry.

Query with DuckDB

You can query the Parquet files directly without loading them into a database:

-- Install and load the httpfs extension for S3 access
INSTALL httpfs;
LOAD httpfs;
SET s3_region = 'auto';
SET s3_access_key_id = 'your-key';
SET s3_secret_access_key = 'your-secret';

-- Query block data
SELECT height, miner_address, gas_used, gas_limit, transaction_count, signed_at
FROM read_parquet('s3://my-blockchain-data/base-mainnet/block-archive/blocks/year=2025/month=03/day=18/*.parquet')
ORDER BY height DESC
LIMIT 20;

Compression Options

Format	Compression	Best For
Parquet + Snappy	Fast reads, moderate compression	Interactive queries (Athena, DuckDB)
Parquet + Zstd	Higher compression ratio	Long-term archival, storage cost optimization
JSON + Gzip	Human-readable, widely compatible	Debugging, simple consumers

Parquet with Snappy is the best default for most analytics workloads. It balances compression ratio, read speed, and broad tool compatibility.

Production Tips

Batch size: Larger batches (50,000+) produce fewer, larger files - better for query performance. Smaller batches (1,000-5,000) reduce latency to S3.
Partition by day for archival workloads. Use hour if you need finer-grained partitions for time-range queries.
GCS and R2: Change provider to gcs or r2 and update credentials accordingly. R2 requires an endpoint field.

Documentation Index

​Use Case

​Pipeline Configuration

​File Layout

​Query with DuckDB

​Compression Options

​Production Tips

Use Case

Pipeline Configuration

File Layout

Query with DuckDB

Compression Options

Production Tips