96,000+ historical weather records. Five modular pipeline phases. Two storage targets. One live dashboard. This project demonstrates what production-grade data engineering looks like in practice — not just in theory.
Each phase is a distinct Python function — modular, testable, and independently deployable. The entire pipeline is orchestrated by a single run_pipeline() call with structured logging at every step.
Loads 96K+ rows from CSV using pandas with file existence check, required column validation against a predefined schema, and immediate logging of row counts and column inventory. Raises typed exceptions on failure — never silently passes bad data downstream.
Three sub-phases: 2.1 Cleaning — deduplication, null removal with row-count logging before and after. 2.2 Normalization — datetime parsing with coerce + invalid date detection, temperature rounding, humidity clipped to [0,1]. 2.3 Feature Engineering — derives Part_of_Day (Morning/Afternoon/Evening/Night), Weekday, and Hour from timestamps for temporal analysis.
Persists the full cleaned dataset to weather_cleaned and a daily temperature aggregation to daily_avg_temp. Both tables include post-load row-count assertion — if the database row count doesn't match the DataFrame, the pipeline raises an error before continuing.
Sanitizes all column names using regex (re.sub(r"[^a-zA-Z0-9_]", "_", name)) for BigQuery schema compatibility. Authenticates via GCP OAuth with graceful Colab/local fallback. Uploads both tables using pandas-gbq with if_exists="replace" to support re-runs.
SQL spot-check queries confirm both table row counts and run a Part_of_Day aggregation to validate feature engineering output. Matplotlib charts saved to outputs/. Looker Studio dashboards built from BigQuery data — 4 pages covering temperature trends, humidity, wind speed, and the Part_of_Day × Weekday heatmap.
These visualizations are built directly from the BigQuery output — the same data that flows through the pipeline above. Each chart answers a specific analytical question about historical weather patterns.
The regular oscillation between ~9°C and ~24°C repeats predictably across all 10 years, confirming strong seasonal patterns in the dataset. The KPI scorecards at the bottom show dataset-wide averages directly from BigQuery: 17.89°C average temperature, 0.68 humidity, 10.14 km/h wind speed. These are the numbers that power the full dashboard.
The Part_of_Day feature engineered during transformation proves analytically valuable — afternoons average 22.7°C vs 14.7°C at night. Weekday analysis shows temperature is stable across days of the week, confirming the seasonal (not weekly) nature of temperature variation.
Wind speeds peak in early spring (~13 km/h) and decline through autumn (~9 km/h). Humidity shows an inverse pattern — relatively stable mid-year, rising toward year-end. These opposing trends suggest a seasonal weather cycle where spring brings drier, windier conditions.
This pivot table heatmap crosses Part_of_Day (rows) with Weekday (columns) — both features engineered during the Transform phase. The gradient from white (Night, ~14.6°C) to dark blue (Afternoon, ~23°C) is consistent across every single day of the week. Temperature variation is driven entirely by time of day, not day of week — a clean, analytically rigorous finding made possible by the feature engineering step.
Every engineering project encounters unexpected constraints. These are the specific challenges encountered building this pipeline — and how each was solved with sound engineering judgment.
Modular functions. Structured logging. Typed exception handling. Row-count validation. Regex schema sanitization. Dual-target loading. Live cloud dashboards. This pipeline does what production analytics engineering does — it just happens to be about weather.
MS Data Science & Analytics · Grand Valley State University · Python · BigQuery · GCP