Creating an end-to-end machine learning model involves several key steps, from data preprocessing to model training and model evaluation. In this blog, we'll walk through each of these steps to help you implement a complete end-to-end ML project. We'll use a common dataset, the California housing dataset, to illustrate the process.
Table of Contents:
- Understand the problem
- Data Collection
- Download the Data
- Take a quick look at the dataset
- Create a test set
- Data Visualization
- Visualizing Geographical Data
- Looking for Correlations
- Experiment with Attribute Combinations
- Prepare the Data for Machine Learning Algorithms
- Data Cleaning
- Handling Text and Categorical Attributes
- Feature Scaling
- Transformation Pipelines
- Select and Train a Model
- Training and Evaluating the Training Set
- Better Evaluation using Cross-Validation
- Fine-Tuning the Model
- Grid Search
- Randomized Search
- Evaluating the model on the Test Set
- Conclusion
1. Understand the Problem
The first step in any ML project is to understand the problem you're trying to solve. In our case, we want to build a model that predicts the median housing price in California districts based on various features.
This data includes metrics such as the population, median income, and median housing price for each block group in California.
The model that we are going to create should learn from this data and be able to predict the median housing prices in any district, given all the other metrics.
This is a multiple regression problem since the system will use multiple features to make a prediction (it will use the district's population, the median income, etc.). It is also a univariate regression problem since we only try to predict a single value for each district.
Finally, there is no continuous flow of data coming into the system, and the data is small enough to fit in memory, so plain batch learning should work fine.
2. Data Collection
Download the Data
The full Jupyter notebook is available here.
from pathlib import Path
import pandas as pd
import tarfile
import urllib.request
def load_housing_data():
tarball_path = Path("datasets/housing.tgz")
if not tarball_path.is_file():
Path("datasets").mkdir(parents=True, exist_ok=True)
url = "https://github.com/ageron/data/raw/main/housing.tgz"
urllib.request.urlretrieve(url, tarball_path)
with tarfile.open(tarball_path) as housing_tarball:
housing_tarball.extractall(path="datasets")
return pd.read_csv(Path("datasets/housing/housing.csv"))
housing = load_housing_data()
In the above code, we defined a function load_housing_data(), in which we are creating a Path object called tarball_path representing the file path "datasets/housing.tgz". This path is used for storing a compressed data file which is retrieved from the code line urllib.request.urlretrieve(url, tarball_path). The url variable points to a URL pointing to a tarball file hosted on GitHub - the source from which the data will be downloaded.
Finally, the function returns the dataframe which is read by pandas's read_csv() method, and stored in the housing variable.
Take a quick look at the dataset
Let's take a look at the top five rows using the DataFrame's head() method.
Each row represents one district. There are 10 attributes (columns) - longitude, latitude, housing_median_age, total_rooms, total_bedrooms, population, households, median_income, median_house_value, and ocean_proximity.
The info() method is useful to get a quick description of the data.
There are 20,640 instances in the dataset, which means it is fairly small by ML standards, but it's perfect to get started.
Observations:
i. Notice that the total_bedrooms attribute has only 20,433 nonnull values, meaning that 207 districts are missing this feature.
ii. All attributes are numerical, except the ocean_prximity field. Its type is an object, so it could hold any kind of Python object. When you look at the top five rows, you probably notice that the values in the column are repetitive, which means it is probably a categorical attribute.
Let's look at the other fields. The describe() method shows a summary of the numerical attributes.
Another quick way to get a feel of the type of data you are dealing with is to plot a histogram for each numerical attribute. A histogram shows the number of instances (on the vertical axis) that have a given value range (on the horizontal axis).
%matplotlib inline
import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(12, 8))
plt.show()
![]() |
| Figure 1. A histogram for each numerical value |
Create a test set
A test dataset is a distinct subset of the main dataset that is reserved exclusively for evaluating the performance and generalization capabilities of a machine learning model. It serves as an independent dataset that the model has never seen during its training phase, which help us identify potential issues like overfitting or underfitting.
Creating a test set is simple: pick some instances randomly, typically 20% of the dataset (or less if the dataset is very large), and set them aside.
Scikit-Learn's train_test_split() method is used to split a dataset into a training set and a test set. We pass the arguments test_size=0.2, with a random_state=42 for code reproducibility.
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
The concept of Stratified Sampling.
Stratified sampling is a valuable sampling technique used in statistics and data analysis to ensure representative and unbiased samples. In this method, a population is divided into subgroups or strata based on certain characteristics or attributes that are relevant to the study. Then, random samples are independently drawn from each stratum, ensuring that each subgroup is proportionally represented in the final sample.
This approach helps reduce the risk of sampling bias and ensures that the sample reflects the diversity and characteristics of the entire population, making it particularly useful when dealing with imbalanced datasets.
housing["income_cat"] = pd.cut(housing["median_income"],
bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
labels=[1, 2, 3, 4, 5])
housing["income_cat"].hist()
The above cut uses pd.cut() function to create a new categorical feature housing['income_cat'] based on the median_income feature in the housing dataframe. The function cuts the median_income into bins, breaking it into five bines: [0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0], and (6.0, infinity). These bins define different income ranges. Each of the bins is labeled as 1, 2, 3, 4, and 5, respectively
Now you are ready to do stratified sampling based on the income category. For this, you can use Scikit-Learn's StratifiedShuffleSplit class:
from sklearn.model_selection import StratifiedShuffleSplit
splitter = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=42)
for train_index, test_index in splitter.split(housing, housing["income_cat"]):
strat_train_set_n = housing.iloc[train_index]
strat_test_set_n = housing.iloc[test_index]
Now you should remove the income_cat attribute so the data is back to its original state:
for set_ in (strat_train_set, strat_test_set):
set_.drop("income_cat", axis=1, inplace=True)
3. Data Visualization
So far we have only taken a quick glance at the data to get a general understanding. Now the goal is to go a little deeper into it.
First, make sure you have put the test set aside and you are only exploring the training dataset.
We first create a copy of the original dataset so that we can tweak it without disturbing the original one.
Visualizing Geographical Data
housing.plot(kind="scatter", x="longitude", y="latitude", grid=True, alpha=0.2)
plt.show()
| Figure 3. A geographical scatterplot that highlights high-density areas |
Setting the alpha=0.2 makes it easier to visualize the places where there is a high density of data points.
housing.plot(kind="scatter", x="longitude", y="latitude", grid=True,
s=housing["population"] / 100, label="population",
c="median_house_value", cmap="jet", colorbar=True,
legend=True, sharex=False, figsize=(10, 7))
plt.show()
| Figure 4. California housing prices: red is expensive, blue is cheap, larger circles indicate areas with a larger population |
The above graph depicts the housing prices. The radius of each circle represents the district's population (parameter s), and the color represents the price (parameter c). We are using a predefined color map (parameter cmap) called jet, which ranges from blue (low values) to red (high prices)
The graph tells us that the housing prices are very much related to the location and to the population density.
![]() |
| Figure 5. Visualizing the plot superimposed on a map of California |
Looking for Correlations
Pearson Correlation Coefficient is a statistical measure of the linear association between two variables. It's the most common way to measure linear correlation.
We can calculate Pearson's correlation between every pair of attributes.
- When it is close to 1, it means that there is a strong positive correlation. E.g., the median house value tends to go up when the median income goes up.
- When the coefficient is close to -1, it means that there is a strong negative correlation.
- Coefficients close to 0 means that there is no linear correlation.
Correlation analysis usually starts with a scatter diagram that graphically represents the relation of data pairs. The closer the scatter plot resembles a straight line, the higher the strength of association.
from pandas.plotting import scatter_matrix
attributes = ["median_house_value", "median_income", "total_rooms",
"housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))
plt.show()
| Figure 6. This scatter matrix plots every numerical attribute against every other numerical attribute, plus a histogram of each numerical attribute |
Another way to check for correlation between attributes is to use the pandas scatter_matrix() function, which plots every numerical attribute against every other numerical attribute.
In the above picture, we are implementing 4 attributes (median_house_value, median_income, total_rooms, housing_median_age), there are a total of 16 plots (= 4 x 4).
The most promising attribute to predict the median house is the median income, so plotting their correlation scatterplot.
housing.plot(kind="scatter", x="median_income", y="median_house_value",
alpha=0.1, grid=True)
plt.show()
Experiment with Attribute Combinations
One last thing you may want to do before preparing the data for Machine Learning algorithms is to try out various attribute combinations.
For example, the population parameter is not very useful, but obtaining another interesting attribute like 'population_per_household' is more useful by dividing the 'population' by the 'households' parameter.
housing["rooms_per_house"] = housing["total_rooms"] / housing["households"]
housing["bedrooms_ratio"] = housing["total_bedrooms"] / housing["total_rooms"]
housing["population_per_house"] = housing["population"] / housing["households"]
Now let's look at the correlation matrix again:
corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)
The new 'bedrooms_per_room' attribute is much more correlated with the median house value than the total number of rooms or bedrooms. Apparently, houses with a lower bedroom ratio tend to be more expensive.
This round of exploration does not have to be absolutely thorough; the point is to start off on the right foot and quickly gain insights.
4. Prepare the Data for Machine Learning Algorithms
Data Cleaning
Most ML algorithms cannot work with missing features. We saw earlier that the total_bedrooms attribute has some missing values. We have three options to fix this:
- Get rid of the corresponding districts (or rows).
- Get rid of the whole attribute (total_bedrooms).
- Set the values to some value (zero, mean, median, etc.).
Scikit-Learn provides a class to take care of missing values: SimpleImputer.
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")
Since the median can only be computed on numerical attributes, we need to create a copy of the dataset without the text attribute ocean_proximity.
housing_num = housing.select_dtypes(include=[np.number])
Handling Text and Categorical Attributes
Coming to text attributes: ocean_proximity. Looking at the first 10 instances:
Each of the text represents a category, hence it is a categorical attribute. Most ML algorithms prefer to work with numbers, so let's convert these categories from text into numerical values. For this, we use Scikit-Learn's OrdinalEncoder class.
Feature Scaling
Feature Scaling involves transforming the numerical features of a dataset to ensure that they all have a similar scale. The goal is the prevent features with larger scales from dominating those with smaller scales, as this can lead to bias in the learning process.
Common techniques include Normalization (scaling features to a range between 0 and 1) and Standardization (scaling features to have a mean of 0 and a standard deviation of 1).
In Min-Max scaling (normalization), values are shifted and scaled so that they end up ranging from 0 to 1. Scikit-Learn provides a transformer called MinMaxScaler for this.
Standardization first subtracts the mean value (zero means), and then it is divided by the standard deviation so that the resulting distribution has a unit variance. Unlike min-max scaling, standardization does not bind values to a specific range. Scikit-Learn provides a transformer called StandardScaler for standardization.
Transformation Pipelines
A transformation pipeline is a series of data preprocessing and transformation steps that are applied to the raw data before it is fed into a machine-learning model. It provides a systematic way to organize and automate these data preparation tasks.
Transformation pipelines typically include steps such as feature scaling, handling missing values, encoding categorical variables, and feature engineering. By structuring these steps into a pipeline, you ensure that the same transformations are consistently applied to both the training and test datasets, avoiding data leakage and reducing the risk of errors.
The Pipeline constructor takes a list of name/estimator pairs defining a sequence of steps. All but the last estimator must be transformers (i.e., they must have a fit_transform() method).
When you call the pipeline's fit() method, it calls fit_transform() sequentially on all transformers, passing the output of each call as the parameter to the next call until it reaches the estimator, for which it calls the fit() method.
5. Select and Train a Model
Finally, we get to the interesting part of selecting and training a Machine Learning model.
Training and Evaluating the Training Set
Let's first start with a simple model - Linear Regression model.
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)
It is as simple as that. You now have a working Linear Regression model. Let's try it out on a few instances from the training set:
Let's measure this regression model's RMSE on the whole training set using Scikit-Learn's mean_squared_error() function:
Underfitting is a common issue in ML problems when a model is too simplistic to capture the underlying patterns in the data. It occurs when a model is overly generalized and lacks the capacity to fit the training data adequately. It cannot capture the complexities and nuances in the data.
It can happen for various reasons: including using a model that is too simple, not having enough data for the model to learn from, or applying excessive regularization.
Addressing underfitting often involves selecting more complex models, increasing the model's capacity, collecting more data, or reducing regularization to enable the model to better represent the underlying relationships in the dataset.
Let's train a DecisionTreeRegressor. This is a powerful model, capable of finding complex nonlinear relationships in the data.
A zero error indicates that the model is perfect making no mistakes in its predictions. But in machine learning, no model is 100% perfect. It means the model is badly overfitting the data.
Overfitting condition:
Overfitting occurs when a model becomes overly complex and starts fitting the training data noise or random fluctuations rather than the underlying patterns. When a model overfits, it performs exceptionally well on the training data but poorly on unseen data.
It can be mitigated by using simpler models, adding regularization techniques, increasing the amount of training data, or applying feature selection methods.
The goal is to strike a balance that allows the model to generalize well to new data while avoiding excessive complexity that leads to overfitting.
Better Evaluation using Cross-Validation
Cross-validation is a crucial machine-learning technique for assessing a model's performance and generalization. It involves dividing a dataset into multiple subsets, or folds, training the model on some of the folds, and testing it on the remaining fold(s). This process is repeated several times, with different folds used as the test set in each iteration.
Cross-validation provides a more robust estimate of a model's performance by reducing the risk of overfitting to a single train-test split.
k-fold cross-validation - the dataset is divided into k subsets, and leave-one-out cross-validation, where each data point serves as the test set in turn. The results are averaged to give a more reliable assessment of how well a model is likely to perform on unseen data.
Scikit-Learn's cross-validation features expect a utility function (greater is better) rather than a cost function (lower is better), so the scoring function is actually the opposite of the MSE (i.e., a negative value), which is why the preceding code computes -scores (negative scores) before calculating the square root.
6. Fine-Tuning the Model
Fine-tuning a model is optimizing its hyperparameters to achieve the best possible performance, once you have a shortlist of promising models. Fine-tuning a model's hyperparameters helps improve a model's accuracy, generalization, or efficiency.
Hyperparameters are settings or configurations not learned from the data but must be set before training.
GridSearch and Randomized Search are two common techniques for hyperparameter tuning.
Grid Search
GridSearch involves specifying a predefined set of hyperparameter values to exhaustively search through. It systematically tests all possible combinations, which can be time-consuming but ensures comprehensive exploration of the hyperparameter space.
Scikit-Learn's GridSearchCV does the work for you.
from sklearn.model_selection import GridSearchCV
param_grid = [
{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
{'bootstrap': [False], 'n_estimator': [3, 10], 'max_features': [2, 3, 4]},
]
forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
scoring='neg_main_squared_error',
return_train_score=True)
grid_search.fit(housing_prepared, housing_labels)
The grid search will explore 12 + 6 = 18 combinations of RandomForestRegressor hyperparameter values, and it will train each model 5 times (we are using 5-fold cross-validation). So, multiplied by the number of folds, we have 18 x 5 = 90 rounds of training.
Once it is done, we can get the best combination of parameters like this:
grid_search.best_estimator_
Randomized Search
The grid search approach is fine when we are exploring relatively few combinations. But when the hyperparameter search space is large, it is often preferable to use RandomizedSearchCV instead.
Randomized Search selects hyperparameter values randomly from predefined distributions. It offers a more efficient search, especially when the hyperparameter space is large, and can often find good hyperparameter values faster than GridSearch.
Evaluating the model on the Test Set
After tweaking your models for a while, you eventually have a system that performs sufficiently well. Now is the time to evaluate the final model on the test set.
This process is nothing different from what we have implemented before. Run your full_pipeline to transform the data (call transform(), not fit_transform() - you do not want to fit the test set!), and evaluate the final model on the test set:
final_model = grid_search.best_estimator_
X_test = strat_test_set.drop('median_house_value', axis=1)
y_test = strat_test_set['median_house_value'].copy()
X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test)
final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
If you did a lot of hyperparameter tuning, the performance will usually be slightly worse than what you measured using cross-validation (because your system ends up fine-tuned to perform well on the validation data and will likely not perform as well on unknown datasets). Therefore, you must resist the temptation to tweak the hyperparameters to make the numbers look good on the test set; the improvements would be unlikely to generalize to new data.
If you have reached here, congratulations! We have successfully built an end-to-end ML model using the California dataset.
Conclusion
In conclusion, building an end-to-end machine learning model is a multifaceted process that involves several critical stages, from understanding the problem and collecting data to preprocessing, model selection, evaluation, and deployment. Each step plays a pivotal role in the success of your machine learning project. By following this comprehensive guide, you're well-equipped to embark on your machine-learning journey and tackle real-world problems effectively.
Remember, learning from practice and embracing the iterative nature of machine learning are keys to mastery in this ever-evolving field.
Stay tuned for more interesting topics on Machine Learning!



Comments
Post a Comment