Weather Data ETL Pipeline

The Pipeline

Five phases. Every one production-ready.

Each phase is a distinct Python function — modular, testable, and independently deployable. The entire pipeline is orchestrated by a single run_pipeline() call with structured logging at every step.

EXTRACT

Data Ingestion & Validation

extract()

Loads 96K+ rows from CSV using pandas with file existence check, required column validation against a predefined schema, and immediate logging of row counts and column inventory. Raises typed exceptions on failure — never silently passes bad data downstream.

→ 96,453 rows extracted · 9 columns validated · schema confirmed

TRANSFORM

Cleaning · Normalization · Feature Engineering

transform()

Three sub-phases: 2.1 Cleaning — deduplication, null removal with row-count logging before and after. 2.2 Normalization — datetime parsing with coerce + invalid date detection, temperature rounding, humidity clipped to [0,1]. 2.3 Feature Engineering — derives Part_of_Day (Morning/Afternoon/Evening/Night), Weekday, and Hour from timestamps for temporal analysis.

→ 641 rows removed · 3 features engineered · Part_of_Day · Weekday · Hour

03a

LOAD · LOCAL

SQLite — Local Persistence

load_sqlite()

Persists the full cleaned dataset to weather_cleaned and a daily temperature aggregation to daily_avg_temp. Both tables include post-load row-count assertion — if the database row count doesn't match the DataFrame, the pipeline raises an error before continuing.

→ weather_cleaned: 95,812 rows · daily_avg_temp: 3,652 rows · assert ✓

03b

LOAD · CLOUD

Google BigQuery — Cloud Analytics

load_bigquery()

Sanitizes all column names using regex (re.sub(r"[^a-zA-Z0-9_]", "_", name)) for BigQuery schema compatibility. Authenticates via GCP OAuth with graceful Colab/local fallback. Uploads both tables using pandas-gbq with if_exists="replace" to support re-runs.

→ etl-final-project.weather_data.weather_cleaned → BigQuery ✓

VERIFY + VISUALIZE

Integrity Checks + Dashboard Output

verify() + visualise()

SQL spot-check queries confirm both table row counts and run a Part_of_Day aggregation to validate feature engineering output. Matplotlib charts saved to outputs/. Looker Studio dashboards built from BigQuery data — 4 pages covering temperature trends, humidity, wind speed, and the Part_of_Day × Weekday heatmap.

→ both tables verified · 4-page Looker Studio dashboard live

What the Data Revealed

10 years of weather — four insights

These visualizations are built directly from the BigQuery output — the same data that flows through the pipeline above. Each chart answers a specific analytical question about historical weather patterns.

Page 1 — Daily Temperature Trend + KPI Scorecards · 2006–2016

Daily temperature trend 2006-2016 with KPI scorecards

10 years of seasons — captured in a single chart

The regular oscillation between ~9°C and ~24°C repeats predictably across all 10 years, confirming strong seasonal patterns in the dataset. The KPI scorecards at the bottom show dataset-wide averages directly from BigQuery: 17.89°C average temperature, 0.68 humidity, 10.14 km/h wind speed. These are the numbers that power the full dashboard.

→ Feature engineered from raw timestamps: Month · Weekday · Part_of_Day — all driving these downstream visualizations

Page 2 — Temperature Analysis by Time of Day & Weekday

Temperature analysis by part of day and weekday

Afternoon is consistently 8°C hotter than Night

The Part_of_Day feature engineered during transformation proves analytically valuable — afternoons average 22.7°C vs 14.7°C at night. Weekday analysis shows temperature is stable across days of the week, confirming the seasonal (not weekly) nature of temperature variation.

→ Part_of_Day engineered from Hour: Morning 5–11, Afternoon 12–16, Evening 17–20, Night 21–4

Page 3 — Humidity & Wind Speed Monthly Trends

Wind speed drops as the year progresses; humidity rises

Wind speeds peak in early spring (~13 km/h) and decline through autumn (~9 km/h). Humidity shows an inverse pattern — relatively stable mid-year, rising toward year-end. These opposing trends suggest a seasonal weather cycle where spring brings drier, windier conditions.

→ Both metrics aggregated by Month using GROUP BY on BigQuery output — monthly averages from 96K+ rows

Page 4 — Temperature Heatmap: Part of Day × Weekday

Temperature heatmap part of day by weekday

The most analytical chart in the dashboard — and the most telling

This pivot table heatmap crosses Part_of_Day (rows) with Weekday (columns) — both features engineered during the Transform phase. The gradient from white (Night, ~14.6°C) to dark blue (Afternoon, ~23°C) is consistent across every single day of the week. Temperature variation is driven entirely by time of day, not day of week — a clean, analytically rigorous finding made possible by the feature engineering step.

→ This chart validates the feature engineering: Part_of_Day captures real temperature structure; Weekday confirms it's not a confound

Engineering Challenges

Real problems. Real solutions.

Every engineering project encounters unexpected constraints. These are the specific challenges encountered building this pipeline — and how each was solved with sound engineering judgment.

⚠ Challenge 1

BigQuery rejects column names containing spaces, parentheses, and special characters. The weather dataset had columns like "Temperature (C)" and "Wind Speed (km/h)" — none of which are valid BigQuery schema names.

✓ Solution

Implemented a clean_column_name() utility using re.sub(r"[^a-zA-Z0-9_]", "_", name) to sanitize every column name before upload. Added a digit-prefix guard for columns starting with numbers. Applied consistently via list comprehension — all 9 columns sanitized in one step before every BigQuery operation.

⚠ Challenge 2

Docker container specs were insufficient for local PostgreSQL — couldn't configure the container environment to run a local database server reliably, which was the original target for local storage.

✓ Solution

Pivoted to a dual-target architecture that turned the constraint into a feature: SQLite (Python built-in, zero config) for local persistence + BigQuery for cloud-scale analytics. This is actually a stronger design — each storage target serves a distinct purpose, and the pipeline works identically in both local and cloud environments.

⚠ Challenge 3

Raw datetime strings in the source data were in inconsistent format and needed conversion to pandas datetime objects before any temporal feature extraction could proceed.

✓ Solution

pd.to_datetime(df["Formatted Date"], errors="coerce") with a downstream NaN count check and conditional drop. Any unparseable dates are logged with WARNING level before removal — the pipeline never silently drops data. This approach is production-safe: it handles edge cases without failing on partial bad data.

⚠ Challenge 4

GCP authentication needed to work securely in Google Colab without exposing service account credentials in code, while also supporting local development runs outside Colab.

✓ Solution

Wrapped google.colab.auth.authenticate_user() in a try/except ImportError block — in Colab it triggers the built-in OAuth flow; outside Colab it gracefully falls back to GOOGLE_APPLICATION_CREDENTIALS env var. Same pipeline code, same result, two environments — no credentials hardcoded anywhere.

Production-pattern engineering on real data.

Modular functions. Structured logging. Typed exception handling. Row-count validation. Regex schema sanitization. Dual-target loading. Live cloud dashboards. This pipeline does what production analytics engineering does — it just happens to be about weather.

MS Data Science & Analytics · Grand Valley State University · Python · BigQuery · GCP

View Live Dashboard ↗ View Code on GitHub ↗ ← Back to Portfolio

Weather Data A production ETL pipeline — from raw CSV to cloud analytics

Five phases. Every one production-ready.

10 years of weather — four insights

Real problems. Real solutions.

Built with

Production-pattern engineering on real data.