春江暮客

春江暮客的个人学习分享网站

Hands-on Implementation of Random Forest Algorithm with Python

2020-12-28 Technology
Hands-on Implementation of Random Forest Algorithm with Python

Random forest is one of the most common entry points for tabular machine learning because it is relatively stable, works well on structured data, and is easy to get running with sklearn. If you already know the basics of decision trees and supervised learning, this article is enough to start practicing.

This post uses Shanghai weather data to build a small end-to-end regression project whose goal is simple: predict tomorrow’s maximum temperature. The workflow covers environment setup, data loading, cleaning, feature engineering, model training, parameter tuning, and a final single-row prediction example.

The implementation uses Python and sklearn. Docker is only an optional environment choice; if you already have a working local Python data-science setup, you can run the code directly.

Problem Introduction

The problem we want to solve is predicting tomorrow’s maximum temperature in our city using one year’s past weather data. Here I use Shanghai, but you can freely find your city’s data with online climate data tools. We will assume we cannot get the weather forecast and instead make our own predictions through machine learning. What we can get is one year’s historical max temperature, temperatures from the previous two days, and an estimate from a friend who always claims to understand the weather. This is a supervised regression machine learning problem. It is supervised because we have both the features (city data) and the target (temperature) we want to predict. During training, we provide features and targets to the random forest, and it must learn how to map data to predictions. Also, it is a regression task because the target value is continuous (as opposed to discrete classes in classification). That’s almost all the background we need, so let’s get started!

Docker and Jupyter Notebook Setup

Before we start Python programming directly, let’s first establish the required Python environment using Docker:

yum install docker -y  ## Install Docker
docker pull jupyter/datascience-notebook  # Pull machine learning image
mkdir ~/jupyter 
cd ~/jupyter
docker run -itd -p 8889:8888 -v jupyter:/home/jovyan jupyter/datascience-notebook  # Run image, map directories

You can check out some basics about Docker in an earlier post.

If you do not want to use Docker, at minimum install these dependencies locally:

pip install pandas numpy scikit-learn matplotlib

Data Collection

First, we need to obtain historical weather data for Shanghai. Weather data in China is mostly paid, so I only could get free data from foreign websites. I used NOAA9 climate data online tool to get Shanghai’s weather data from January 1, 2019, to December 24, 2019. You can place an order by providing your email,

order_weather_shanghai

Choose CSV format, then check your email to receive the data,

order_weather_shanghai

Gmail is recommended. Usually, about 80% of time in data analysis is spent cleaning and retrieving data, but you can reduce this by finding quality data sources. NOAA is the official US weather site providing various weather data. Temperature data is directly downloadable as CSV files, easy to read and parse with Python. The complete data file has been downloaded and placed on this site at:

Shanghai 2019 Weather

Use pandas to read the data:

# by chunjiangmuke
import pandas as pd

weather = pd.read_csv('1984178.csv')
weather.head(10)

weather_head

Column descriptions:

  • STATION: National city code (this is Shanghai including Hongqiao)
  • NAME: City name, Shanghai
  • DATE: Date
  • TAVG: Average temperature of the day
  • TMAX: Max temperature of the day
  • TMIN: Min temperature of the day

Detecting Anomalies and Data Transformation

If we look at the data dimensions, we see there are 3605 days’ data. From NOAA data, I found there are two observation areas for Shanghai: Shanghai and Hongqiao. Also, some days are missing, probably not updated yet. This reminds us that real-world data is never perfect. Here, I removed Hongqiao data. Missing or incorrect data or anomalies affect analysis. In this case, missing data has little effect and data quality is generally good due to the source.

# Keep only Shanghai station data
weather = weather[weather.STATION == "CHM00058362"]  # CHM00058362 is Shanghai, CHM00058367 is Hongqiao
import numpy as np
np.shape(weather)

# Output: (3605, 6)

Basic statistics of data:

weather.describe()

# Output example:
# TAVG	TMAX	TMIN
# count	3605.000000	1799.000000	2549.000000
# mean	63.298197	67.204558	57.074147
# std	16.052515	16.498672	16.856828
# min	21.000000	32.000000	18.000000
# 25%	49.000000	53.000000	42.000000
# 50%	65.000000	68.000000	58.000000
# 75%	76.000000	80.000000	72.000000
# max	96.000000	103.000000	89.000000

You can see that some columns contain many missing values, so this example starts with a simple mean-value fill. Because weather has continuity, yesterday’s and the day-before-yesterday’s temperatures are often useful signals, so the code creates lag1 and lag2 features.

One detail matters here: the code below does not actually one-hot encode the month. It keeps month and day as direct numeric features. That is acceptable for a simple teaching example, but if you want a stricter modeling workflow, it is worth comparing against categorical encoding later.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Read data
weather = pd.read_csv('1984178.csv')

# Keep only Shanghai station data
weather = weather[weather.STATION == "CHM00058362"]

# Fill missing values with column means
weather['TMAX'] = weather['TMAX'].fillna(weather['TMAX'].mean())
weather['TMIN'] = weather['TMIN'].fillna(weather['TMIN'].mean())
weather['TAVG'] = weather['TAVG'].fillna(weather['TAVG'].mean())

# Convert DATE to datetime and extract features
weather['DATE'] = pd.to_datetime(weather['DATE'])
weather['month'] = weather['DATE'].dt.month
weather['day'] = weather['DATE'].dt.day

# Create lag features for previous two days' max temperature
weather['TMAX_lag1'] = weather['TMAX'].shift(1)
weather['TMAX_lag2'] = weather['TMAX'].shift(2)

# Drop rows with NaN values
weather = weather.dropna()

# Define features and target
features = ['month', 'day', 'TMAX_lag1', 'TMAX_lag2', 'TMIN', 'TAVG']
X = weather[features]
y = weather['TMAX']

# Split train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Build Random Forest Model

Now we can build and train the random forest model:

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Initialize random forest regressor
rf = RandomForestRegressor(n_estimators=100, random_state=42)

# Train the model
rf.fit(X_train, y_train)

# Predictions on training set
train_predictions = rf.predict(X_train)

# Predictions on test set
test_predictions = rf.predict(X_test)

# Calculate mean absolute error
train_mae = mean_absolute_error(y_train, train_predictions)
test_mae = mean_absolute_error(y_test, test_predictions)

print(f"Training MAE: {train_mae:.2f}")
print(f"Testing MAE: {test_mae:.2f}")

Model Evaluation and Optimization

Let’s evaluate the model and perform some optimization:

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Parameter grid for tuning
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10]
}

# Grid search with 5-fold CV
grid_search = GridSearchCV(estimator=RandomForestRegressor(random_state=42),
                           param_grid=param_grid,
                           cv=5,
                           n_jobs=-1,
                           verbose=2)

grid_search.fit(X_train, y_train)

# Best parameters
print(f"Best parameters: {grid_search.best_params_}")

# Use best estimator for prediction
best_rf = grid_search.best_estimator_
test_predictions = best_rf.predict(X_test)
test_mae = mean_absolute_error(y_test, test_predictions)

print(f"Optimized test MAE: {test_mae:.2f}")

How to tell whether the model is basically usable

For this example, the quickest checks are:

  1. Compare training MAE and testing MAE to see whether overfitting is obvious
  2. Run one single-row prediction to confirm the input schema and prediction path are both correct

If training error is much lower than test error, the model is usually memorizing the training data too aggressively and needs further tuning.

Complete Prediction Workflow

Finally, we can use the trained model to make actual predictions:

def predict_temperature(model, last_two_days):
    """
    Predict tomorrow's max temperature

    Parameters:
    model -- trained random forest model
    last_two_days -- DataFrame containing weather data for the last two days

    Returns:
    Predicted max temperature for tomorrow
    """
    features = ['month', 'day', 'TMAX_lag1', 'TMAX_lag2', 'TMIN', 'TAVG']

    if not all(f in last_two_days.columns for f in features):
        raise ValueError("Input data is missing required features")

    prediction = model.predict(last_two_days[features])

    return prediction[0]

# Example usage
sample_data = pd.DataFrame({
    'month': [12],
    'day': [25],
    'TMAX_lag1': [50],  # Yesterday's max temp
    'TMAX_lag2': [48],  # Day before yesterday's max temp
    'TMIN': [40],       # Today's min temp
    'TAVG': [45]        # Today's avg temp
})

predicted_temp = predict_temperature(best_rf, sample_data)
print(f"Predicted max temperature for tomorrow: {predicted_temp:.1f}°F")

Summary

In this article, we completed the following:

  • Obtained and cleaned Shanghai weather data from NOAA
  • Performed feature engineering by creating lag features
  • Built and trained a random forest regression model
  • Optimized model parameters using grid search
  • Analyzed feature importance
  • Created a complete prediction workflow

This model can predict Shanghai’s maximum temperature quite accurately, with mean absolute error around 2-3°F. To further improve performance, consider:

  • Adding more historical weather data
  • Introducing other meteorological features like humidity, precipitation, etc.
  • Trying other machine learning algorithms for comparison
  • Using more complex time series processing methods

友情链接

其它