Time series models will silently overfit if you use random train/test splits

#python #ml #xgboost #data-science

Here’s a fun way to feel like a genius for 20 minutes: train a time series model with train_test_split(shuffle=True) — which is the scikit-learn default — and marvel at your incredible metrics.

Then deploy it and watch it predict like a drunk weatherman.

Random splits leak future data into training. Your model literally learns from tomorrow to predict today. Of course it looks great on paper.

# 🚨 This is time travel, not machine learning
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# ✅ Respect the arrow of time
split_idx = int(len(X) * 0.8)
X_train, X_test = X.iloc[:split_idx], X.iloc[split_idx:]
y_train, y_test = y.iloc[:split_idx], y.iloc[split_idx:]

For proper cross-validation, use TimeSeriesSplit which rolls forward through time:

from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)
for train_idx, test_idx in tscv.split(X):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    # ...

I learned this building badi-predictor — a pool occupancy prediction model. The random-split version had suspiciously good metrics. “Suspiciously good” in ML is never a compliment.