Scaling ML Preprocessing: From 8 Days to 1 Day with GCS and Distributed Computing

A technical deep-dive into optimizing large-scale data preprocessing for machine learning

The Challenge
Key Learnings
Final Performance Summary
Takeaways for Your Next Data Pipeline
The Bottom Line

The Challenge

We recently tackled a machine learning project that required preprocessing approximately 160TB of data stored in Google Cloud Storage. The goal was straightforward: download, merge, and transform this data into training-ready formats across hundreds of spatial locations. The execution? Not so simple.

Our initial estimates suggested this would take about 8 days per machine. With three machines running in parallel, we were looking at over a week of preprocessing before we could even start training models. For any ML project, that’s a significant bottleneck. This is the story of how we reduced that time to under 2 days through systematic optimization and learning some hard lessons about cloud storage systems.

The Initial Architecture

The project required processing data split across multiple sources in GCS, with different feature sets stored in different buckets. Our architecture needed to:

Access 160TB+ of data from Google Cloud Storage
Merge data from multiple sources for each processing batch
Write preprocessed outputs to local disk for training
Scale across multiple machines for parallel processing
Handle coordinate-based filtering for different geographic regions

The straightforward approach? Mount GCS buckets with gcsfuse and treat cloud storage like a local filesystem. Simple, elegant, and as we’d learn, surprisingly slow.

First Attempt: The 8-Day Estimate

Our initial implementation was conceptually simple:

# Open GCS-mounted data
dataset1 = xr.open_zarr('/mnt/gcs-bucket/path1')
dataset2 = xr.open_zarr('/mnt/gcs-bucket/path2')

# Merge and process
merged = xr.merge([dataset1, dataset2])
processed = process_data(merged)

# Write to local disk
processed.to_zarr('/local/output')

Running this on our test data (5 variables out of 200+), we measured approximately 40-50 minutes per machine. Simple extrapolation suggested:

200+ variables × 8 minutes per variable = ~1,600 minutes
Plus overhead = approximately 32 hours per machine

With three machines, we could parallelize, but still: ~30-32 hours minimum. Not 8 days, thankfully, but our initial calculations were based on suboptimal configurations we’d soon discover.

The gcsfuse Problem

Here’s where things got interesting. Our initial gcsfuse mount used default settings:

gcsfuse bucket-name /mount/point

This worked, but monitoring revealed concerning patterns:

Repeated metadata lookups for the same files
Sequential reads where parallel would be faster
Cache misses causing re-downloads of frequently accessed data
Network connections far below what our machines could handle

We were treating GCS like a traditional filesystem, but GCS has fundamentally different characteristics:

High latency for metadata operations (100-200ms per stat call)
Rate limits (5,000 requests/second per bucket)
Excellent throughput when properly parallelized
No state between requests (caching is critical)

Optimization Round 1: Aggressive Caching

The first major win came from understanding that weather data doesn’t change. Once written, it’s immutable. This meant we could cache aggressively:

gcsfuse \
  --stat-cache-ttl 600s \              # Cache file metadata for 10 minutes
  --type-cache-ttl 600s \              # Cache directory structure
  --kernel-list-cache-ttl-secs 1800 \  # Cache directory listings
  bucket-name /mount/point

Result: Directory operations went from 5-10 seconds to under 1 second on subsequent access.

But we could do better.

Optimization Round 2: File Caching

Our machines had 20TB of local disk and 756GB of RAM. The default gcsfuse configuration wasn’t using any of it for caching. We enabled file-level caching:

gcsfuse \
  --cache-dir /fast/local/disk \
  --file-cache-max-size-mb 512000 \           # 512GB cache
  --file-cache-enable-parallel-downloads \
  --file-cache-max-parallel-downloads 200 \   # Download multiple files simultaneously
  bucket-name /mount/point

With 512GB of cache on fast local SSDs, frequently accessed chunks were served at local disk speeds (~2-5 GB/s) rather than network speeds (~500-800 MB/s).

Result: Repeated reads of the same data became 5-10x faster.

Optimization Round 3: Parallelizing GCS Connections

The breakthrough came when we realized our 96-core machines were barely using the available network bandwidth. Default gcsfuse uses 10 connections per host. We cranked it up:

gcsfuse \
  --max-conns-per-host 80 \            # 80 concurrent connections
  --max-idle-conns-per-host 100 \
  --http-client-timeout 30m \          # Timeout for large files
  --max-retry-sleep 30s \
  bucket-name /mount/point

But wait: Wouldn’t 3 machines × 80 connections = 240 connections exceed GCS’s 5,000 req/sec limit?

This is where understanding your workload matters. Our preprocessing wasn’t constantly hammering GCS. The timeline looked like:

Hour 0-1:    Light metadata operations (scanning)
Hour 1-2:    Heavy read operations (loading data)
Hour 2-30:   Pure local disk I/O (writing output)

Only 5% of the time were we actually hitting GCS hard. The three machines naturally staggered their loading phases, so peak concurrent load was manageable.

Result: GCS reading time dropped from ~70 minutes to ~2 minutes per machine.

The Complete gcsfuse Configuration

After testing and benchmarking, here’s what worked for our 96-core, 756GB RAM machines:

gcsfuse \
  --implicit-dirs \
  --stat-cache-max-size-mb 51200 \
  --metadata-cache-ttl-secs 600 \
  --metadata-cache-negative-ttl-secs 10 \
  --type-cache-max-size-mb 16 \
  --kernel-list-cache-ttl-secs 1800 \
  --cache-dir /local/fast/disk \
  --file-cache-max-size-mb 512000 \
  --file-cache-enable-parallel-downloads \
  --file-cache-max-parallel-downloads 200 \
  --file-cache-parallel-downloads-per-file 16 \
  --enable-buffered-read \
  --read-global-max-blocks 100 \
  --max-conns-per-host 80 \
  --max-idle-conns-per-host 100 \
  --http-client-timeout 30m \
  --max-retry-sleep 30s \
  --log-file /var/log/gcsfuse.log \
  bucket-name /mount/point

This configuration turned GCS from a bottleneck into nearly a non-issue. Reading time became ~2% of total preprocessing time.

The Real Bottleneck: Disk I/O

With GCS optimized, the actual bottleneck emerged: writing to local disk. Our preprocessing pipeline was:

Read from GCS (2 minutes with optimization)
Process in memory (5 minutes)
Write to disk (25-30 hours!) ← The real bottleneck

No amount of gcsfuse optimization would help here. The writing phase was pure local I/O, and we were writing zarr arrays with millions of chunks. Each variable took 7-8 minutes to write, and we had 200+ variables.

This is a critical lesson: optimize the right thing. We spent significant time optimizing GCS access, which gave us a 97% speedup on reads. But reads were only 5% of total time! The real win would have been from optimizing disk writes, but that’s fundamentally limited by hardware.

Distributed Architecture

We split the work across three machines based on data locality and feature sets:

Machine 1: Land-based features (226 variables) Machine 2: Ocean-based features (227 variables)
Machine 3: Ocean-based features (227 variables)

Each machine:

Preprocessed its assigned feature set
Wrote to local disk (~17TB per machine)
Would later train models on its data

This approach had several benefits:

Data locality: Each machine had its preprocessed data locally for training
Network efficiency: Minimized data transfer between machines
Storage balance: Each machine used ~80% of available 20TB
Failure isolation: One machine failing didn’t block others

Testing Philosophy: Test Small, Deploy Large

Before running 30-hour preprocessing jobs, we implemented a test mode:

if TEST_MODE:
    # Use only 5 variables instead of 200+
    variables = variables[:5]

This let us:

Validate the entire pipeline in 40 minutes
Test gcsfuse configurations quickly
Catch bugs early (in minutes, not days)
Benchmark and extrapolate to full scale

The test run became our source of truth for performance estimates. When test mode ran successfully in 36-42 minutes for 5 variables, we knew production would take approximately:

(226 variables / 5) × 42 minutes ≈ 32 hours (Machine 1)
(227 variables / 5) × 36 minutes ≈ 27 hours (Machine 2/3)

And that’s exactly what happened in production.

Monitoring and Observability

With preprocessing running for 30+ hours, monitoring was critical:

# Watch preprocessing progress
tail -f preprocess.log

# Monitor cache usage
watch -n 60 'du -sh /cache/directory'

# Check GCS rate limiting
tail -f /var/log/gcsfuse.log | grep -i "rate\|429"

# Monitor output size
watch -n 60 'du -sh /output'

We also built automated progress tracking that printed updates every 20 variables, showing:

Variables processed so far
Current output size
Average time per variable
ETA for completion

Key insight: For long-running jobs, good logging is as important as the code itself.

Lessons Learned

1. Understand Your Storage System

GCS is not a filesystem. It’s an object store with:

High latency for metadata (100-200ms per operation)
High bandwidth for data (multiple GB/s when parallelized)
Rate limits that seem scary but are rarely hit in practice
No server-side caching (you must cache client-side)

Default gcsfuse settings are conservative. For batch processing on large machines, aggressive caching and parallelization are safe and dramatically faster.

2. Optimize the Bottleneck

We spent days optimizing GCS access, achieving a 97% speedup on reads. But reads were only 5% of runtime. The real bottleneck was disk writes, which we couldn’t optimize much.

Takeaway: Profile first, optimize the slowest part second.

3. Test Small, Deploy Big

Our test mode strategy saved us countless hours of debugging. When preprocessing takes 30 hours, you don’t want to discover a bug at hour 28.

4. Design for Failure

Long-running jobs will fail. Our preprocessing was designed to be restartable:

Check if output exists before starting
Ask user to confirm before deleting existing output
Save progress periodically
Log everything for post-mortem analysis

5. Connection Limits Are Soft Limits

We worried about hitting GCS’s 5,000 req/sec limit with 240 concurrent connections across 3 machines. In practice:

Our workload was bursty, not constant
GCS auto-retries with exponential backoff
Machines naturally staggered their load
Even if we hit limits briefly, the impact was seconds of delay over 30 hours

Takeaway: Understand your workload pattern before being conservative with parallelization.

Final Results

After all optimizations:

Time per machine:

Machine 1: 32 hours
Machine 2: 27 hours
Machine 3: 27 hours

Running in parallel: 32 hours total (1.3 days)

Storage per machine: ~17TB (out of 20TB available)

GCS reading time: ~2 minutes (< 2% of total time)

Bottleneck: Local disk I/O (93% of time)

Takeaways for ML Infrastructure

Cloud storage is fast when configured correctly: Default settings are conservative; understand your workload and tune aggressively.
Test mode is not optional: For long-running jobs, a fast test mode that validates the entire pipeline is essential.
Cache everything you can: With 512GB cache on 756GB RAM machines, we could cache frequently accessed data at local disk speeds.
Parallelize everywhere: From GCS connections to file downloads to processing workers, parallelization was key.
Know your bottleneck: We spent time optimizing GCS when the real issue was disk I/O. Profile first.
Design for observability: Long-running jobs need good logging, progress tracking, and monitoring.

Code Snippets

Optimal gcsfuse mounting:

#!/bin/bash
gcsfuse \
  --implicit-dirs \
  --stat-cache-max-size-mb 51200 \
  --metadata-cache-ttl-secs 600 \
  --type-cache-max-size-mb 16 \
  --kernel-list-cache-ttl-secs 1800 \
  --cache-dir /fast/local/disk \
  --file-cache-max-size-mb 512000 \
  --file-cache-enable-parallel-downloads \
  --file-cache-max-parallel-downloads 200 \
  --max-conns-per-host 80 \
  --max-idle-conns-per-host 100 \
  --http-client-timeout 30m \
  bucket-name /mount/point

Test mode pattern:

# Always start with test mode
TEST_MODE = '--test' in sys.argv

if TEST_MODE:
    # Process small subset for validation
    variables = variables[:5]
    print(f"TEST MODE: Using {len(variables)} variables")
else:
    # Full production run
    print(f"PRODUCTION: Processing {len(variables)} variables")

Progress monitoring:

for i, var in enumerate(variables):
    process_variable(var)
    
    if i % 20 == 0:
        elapsed = time.time() - start_time
        avg_time = elapsed / (i + 1)
        remaining = avg_time * (len(variables) - i)
        
        print(f"Progress: {i}/{len(variables)} | "
              f"Avg: {avg_time/60:.1f}min/var | "
              f"ETA: {remaining/3600:.1f}h")

Conclusion

Building high-performance ML pipelines isn’t just about the ML—it’s about understanding your infrastructure. In our case:

10-100x speedup from gcsfuse optimization
8 days → 1.3 days for preprocessing
512GB cache turning cloud storage into “local” storage
80 parallel connections saturating network bandwidth

But perhaps the biggest lesson: the right optimization in the wrong place is still wrong. We optimized GCS to near-perfection, reducing read time from 70 minutes to 2 minutes. But with reads being 5% of runtime, the real bottleneck was always disk I/O.

Know your bottleneck. Profile first. Optimize second. And always, always have a test mode.

Building ML infrastructure? Dealing with large-scale data preprocessing? Feel free to reach out—I’d love to hear about your challenges and solutions.

Appendix: Performance Breakdown

Preprocessing stages (for one machine):

Stage 1: Scan batches        3 min    (0.2%)
Stage 2: Load model vars     10 min   (0.5%)
Stage 3: Load features       2 min    (0.1%)  ← gcsfuse optimized!
Stage 4: Concatenate         5 min    (0.3%)
Stage 5: Write to disk       1,770 min (98.9%) ← Actual bottleneck
Stage 6: Finalize            1 min    (0.1%)
───────────────────────────────────────────────
Total:                       1,791 min (29.9 hours)

GCS optimization impact:

Before: 70 minutes reading from GCS
After: 2 minutes reading from GCS
Speedup: 35x faster
Impact on total time: -68 minutes (4% faster overall)

Key insight: 35x speedup on 5% of work = 4% total speedup. Always optimize the bottleneck.