Data Engineering · ETL · GCP · BigQuery
Avg Temp 17.89°C Humidity 0.68

Weather Data A production ETL pipeline — from raw CSV to cloud analytics

96,000+ historical weather records. Five modular pipeline phases. Two storage targets. One live dashboard. This project demonstrates what production-grade data engineering looks like in practice — not just in theory.

EXTRACT
TRANSFORM
LOAD · SQLite
LOAD · BigQuery
VISUALIZE
96K+
Records Processed
historical weather readings
5
Pipeline Phases
Extract → Visualize
2
Storage Targets
SQLite + BigQuery / GCP
4
Dashboard Pages
live · interactive · Looker Studio
The Pipeline

Five phases. Every one production-ready.

Each phase is a distinct Python function — modular, testable, and independently deployable. The entire pipeline is orchestrated by a single run_pipeline() call with structured logging at every step.

01
EXTRACT
Data Ingestion & Validation
extract()

Loads 96K+ rows from CSV using pandas with file existence check, required column validation against a predefined schema, and immediate logging of row counts and column inventory. Raises typed exceptions on failure — never silently passes bad data downstream.

→ 96,453 rows extracted · 9 columns validated · schema confirmed
02
TRANSFORM
Cleaning · Normalization · Feature Engineering
transform()

Three sub-phases: 2.1 Cleaning — deduplication, null removal with row-count logging before and after. 2.2 Normalization — datetime parsing with coerce + invalid date detection, temperature rounding, humidity clipped to [0,1]. 2.3 Feature Engineering — derives Part_of_Day (Morning/Afternoon/Evening/Night), Weekday, and Hour from timestamps for temporal analysis.

→ 641 rows removed · 3 features engineered · Part_of_Day · Weekday · Hour
03a
LOAD · LOCAL
SQLite — Local Persistence
load_sqlite()

Persists the full cleaned dataset to weather_cleaned and a daily temperature aggregation to daily_avg_temp. Both tables include post-load row-count assertion — if the database row count doesn't match the DataFrame, the pipeline raises an error before continuing.

→ weather_cleaned: 95,812 rows · daily_avg_temp: 3,652 rows · assert ✓
03b
LOAD · CLOUD
Google BigQuery — Cloud Analytics
load_bigquery()

Sanitizes all column names using regex (re.sub(r"[^a-zA-Z0-9_]", "_", name)) for BigQuery schema compatibility. Authenticates via GCP OAuth with graceful Colab/local fallback. Uploads both tables using pandas-gbq with if_exists="replace" to support re-runs.

→ etl-final-project.weather_data.weather_cleaned → BigQuery ✓
04
VERIFY + VISUALIZE
Integrity Checks + Dashboard Output
verify() + visualise()

SQL spot-check queries confirm both table row counts and run a Part_of_Day aggregation to validate feature engineering output. Matplotlib charts saved to outputs/. Looker Studio dashboards built from BigQuery data — 4 pages covering temperature trends, humidity, wind speed, and the Part_of_Day × Weekday heatmap.

→ both tables verified · 4-page Looker Studio dashboard live

10 years of weather — four insights

These visualizations are built directly from the BigQuery output — the same data that flows through the pipeline above. Each chart answers a specific analytical question about historical weather patterns.

Page 2 — Temperature Analysis by Time of Day & Weekday
Temperature analysis by part of day and weekday
Afternoon is consistently 8°C hotter than Night

The Part_of_Day feature engineered during transformation proves analytically valuable — afternoons average 22.7°C vs 14.7°C at night. Weekday analysis shows temperature is stable across days of the week, confirming the seasonal (not weekly) nature of temperature variation.

→ Part_of_Day engineered from Hour: Morning 5–11, Afternoon 12–16, Evening 17–20, Night 21–4
Page 3 — Humidity & Wind Speed Monthly Trends
Humidity and wind speed monthly trends
Wind speed drops as the year progresses; humidity rises

Wind speeds peak in early spring (~13 km/h) and decline through autumn (~9 km/h). Humidity shows an inverse pattern — relatively stable mid-year, rising toward year-end. These opposing trends suggest a seasonal weather cycle where spring brings drier, windier conditions.

→ Both metrics aggregated by Month using GROUP BY on BigQuery output — monthly averages from 96K+ rows
Page 4 — Temperature Heatmap: Part of Day × Weekday
Temperature heatmap part of day by weekday
The most analytical chart in the dashboard — and the most telling

This pivot table heatmap crosses Part_of_Day (rows) with Weekday (columns) — both features engineered during the Transform phase. The gradient from white (Night, ~14.6°C) to dark blue (Afternoon, ~23°C) is consistent across every single day of the week. Temperature variation is driven entirely by time of day, not day of week — a clean, analytically rigorous finding made possible by the feature engineering step.

→ This chart validates the feature engineering: Part_of_Day captures real temperature structure; Weekday confirms it's not a confound
Engineering Challenges

Real problems. Real solutions.

Every engineering project encounters unexpected constraints. These are the specific challenges encountered building this pipeline — and how each was solved with sound engineering judgment.

⚠ Challenge 1
BigQuery rejects column names containing spaces, parentheses, and special characters. The weather dataset had columns like "Temperature (C)" and "Wind Speed (km/h)" — none of which are valid BigQuery schema names.
✓ Solution
Implemented a clean_column_name() utility using re.sub(r"[^a-zA-Z0-9_]", "_", name) to sanitize every column name before upload. Added a digit-prefix guard for columns starting with numbers. Applied consistently via list comprehension — all 9 columns sanitized in one step before every BigQuery operation.
⚠ Challenge 2
Docker container specs were insufficient for local PostgreSQL — couldn't configure the container environment to run a local database server reliably, which was the original target for local storage.
✓ Solution
Pivoted to a dual-target architecture that turned the constraint into a feature: SQLite (Python built-in, zero config) for local persistence + BigQuery for cloud-scale analytics. This is actually a stronger design — each storage target serves a distinct purpose, and the pipeline works identically in both local and cloud environments.
⚠ Challenge 3
Raw datetime strings in the source data were in inconsistent format and needed conversion to pandas datetime objects before any temporal feature extraction could proceed.
✓ Solution
pd.to_datetime(df["Formatted Date"], errors="coerce") with a downstream NaN count check and conditional drop. Any unparseable dates are logged with WARNING level before removal — the pipeline never silently drops data. This approach is production-safe: it handles edge cases without failing on partial bad data.
⚠ Challenge 4
GCP authentication needed to work securely in Google Colab without exposing service account credentials in code, while also supporting local development runs outside Colab.
✓ Solution
Wrapped google.colab.auth.authenticate_user() in a try/except ImportError block — in Colab it triggers the built-in OAuth flow; outside Colab it gracefully falls back to GOOGLE_APPLICATION_CREDENTIALS env var. Same pipeline code, same result, two environments — no credentials hardcoded anywhere.

Built with

Core Language
Python 3.10 pandas NumPy
Cloud & Storage
Google BigQuery GCP pandas-gbq SQLite
Visualization
Looker Studio matplotlib
Infrastructure
Google Colab GitHub GitHub Pages

Production-pattern engineering on real data.

Modular functions. Structured logging. Typed exception handling. Row-count validation. Regex schema sanitization. Dual-target loading. Live cloud dashboards. This pipeline does what production analytics engineering does — it just happens to be about weather.

MS Data Science & Analytics · Grand Valley State University · Python · BigQuery · GCP

View Live Dashboard ↗ View Code on GitHub ↗ ← Back to Portfolio