Modeling with Hyperopt and XGBoost, LightGBM, Catboost: Clickstream price prediction

Posted on Fri 11 March 2022 in regression • 8 min read

So far in the FailSafe series (500 Data Science projects failed safely), I have combined working through projects on my own with working through top-performing Kaggle solutions. I've really learned a lot and have enjoyed this combination. I love working through the amazing work of people with various talents, skills and insights. On the other hand, taking a dataset and working through it on my own has allowed me to develop a modeling pipeline I've come to be quite happy with for now (this is ever-evolving). I started with working through simple regression tasks and have learned many other elements along the way, many of which I am recording here.

My objective for project 9: Clickstream Price Prediction from the the UCI repository was to have two separate modeling experiments. The first one is straightforward modeling with minimal feature and model engineering (for downstream interpretation and analysis) and the second one to work through some techniques used in the winning Amazon Access Kaggle Solution. This post explains the former modeling experiment. This is a surface-level explanation of what the modeling pipeline looks like without delving into the details behind how all the 'black-boxes' work. I will cover the inner-workings of the black-boxes in future posts. This is following the breadth-first approach I speak about in my first post

Dataset

As a reminder from the previous post, project 9 of the series deals with the price prediction of an e-clothing shop based on a few online-shopping attributes. The data and the description can be found at the UCI repository.

Modeling

We focus on exmaining only modeling and hyperparameter tuning. In a future post, we will explore more in-depth model evaluations and related plots. The structure of the main plot is as follows:

Definition of the pipeine
Baseline
XGBoost defaults
Simple Grid Search with XGBoost
Tuning with regularization (Hyperopt)
Comparison across models using Hyperopt

One of the first things I do is define my results Dataframe which contains all the different models and their results. I have defined a separate Utils file which handles the calculation of all of the metrics. This allows me to abstract this and pick and choose metrics quite easily.

In [4]:

# define results
results_df = pd.DataFrame(columns = ['model','rmse','r2','mape'])

The next few cells handle some definitions. I typically split my categorical features into groups where there are many unique values and few unique values. The categorical features which have fewer unique values than max_levels can easily be one-hot encoded. This is really conventient for analysis purposes. The features with many levels, we find a good categorical encoder for (I have a post explaining categorical encoder selection) Its also useful to keep track of these things in order to compile the feature names (again for analysis). Next, we do some very simple feature, dataset and model definitions.

In [5]:

# many levels can't be one-hot encoded - use encoder for these
max_levels = round(clothing_data_df.columns.shape[0]*0.5)
categorical_features = clothing_data_df.select_dtypes(exclude=[np.number]).columns
cats_many = []
cats_few = []
for ft in categorical_features:
    levels = clothing_data_df[ft].unique().shape[0]
    if levels > max_levels:
        cats_many.append(ft)
    else:
        cats_few.append(ft)

In [6]:

# definitions
numeric_features = clothing_data_df.select_dtypes([np.number]).drop(['price'], axis=1).columns
numeric_features

categorical_features = clothing_data_df.select_dtypes(exclude=[np.number]).columns
categorical_features

X = clothing_data_df.drop('price', axis=1)
y = clothing_data_df['price']

y = np.log(y)

X_train, X_test_tmp, y_train, y_test_tmp = train_test_split(X, y, test_size=0.3)
X_val, X_test, y_val, y_test = train_test_split(X_test_tmp, y_test_tmp, test_size=0.5)

del X_test_tmp, y_test_tmp

In [ ]:

# first model
selected_model = XGBRegressor(tree_method = "gpu_hist",single_precision_histogram=True, gpu_id=0)

All of the standard preprocessing stuff is kept in a Pipeline as I find it quite neat and nice to work with. I won't go into the particulars of the pipeline for this post, but please reach out with any questions!

In [ ]:

# define pipeline using BackwardDifferenceEncoder for categorical

categorical_transformer_many_level = Pipeline(
    steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('encoder', encoders['BackwardDifferenceEncoder']())
    ]
)    

categorical_transformer = Pipeline(
    steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('encoder', encoders['OneHotEncoder']())
    ]
) 

numeric_transformer = Pipeline(
    steps=[
        ('imputer', SimpleImputer(strategy='mean')),
        ('scaler', StandardScaler())
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ('numerical', numeric_transformer, numeric_features),
        ('categorical_many', categorical_transformer_many_level, cats_many),
        ('categorical', categorical_transformer, cats_few)
    ]
)

pipe = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('regressor', selected_model)
    ]
)

Now let's examine our mean baseline for good practice.

In [8]:

pred = [y_train.mean()]*len(y_val)
pred = np.exp(pred)
y_val_s = np.exp(y_val)
res_row_obj = reg_metrics.calc_results(pred, y_val_s,'Baseline')
row = res_row_obj.calc_results_row()
results_df = results_df.append(row,ignore_index=True)
results_df

Out[8]:

	model	rmse	r2	mape
0	Baseline	12.650522	-7.924584e+29	23.99252

If we look at the distribution of the target price, we see that the modal price is around 40, so these results are not ideal. Another note is that even though I've scaled the target, I always inverse transform it in order to made sense of the results and for them to be closer to reality.

In my process, I usually look to an XGBoost model trained with the default parameters to be a baseline. The default parameters are well chosen. Naturally, it is really easy to see that yes, indeed, a machine learning model performs much better than a simple average and it is useful for this particular problem. I also use it to look at what the learning curves are doing which helps me adjust early_stopping, iterations, evaluations etc further on down the line.

In [9]:

preprocessor.fit(X_train, y_train)
X_train_prc = preprocessor.transform(X_train)
X_val_prc = preprocessor.transform(X_val)
X_test_prc = preprocessor.transform(X_test)
evalset = [(X_train_prc, y_train), (X_val_prc,y_val)]
selected_model.fit(X_train_prc, y_train, eval_metric='rmse', eval_set=evalset, verbose=0)

Out[9]:

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
             gamma=0, gpu_id=0, importance_type=None,
             interaction_constraints='', learning_rate=0.300000012,
             max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
             monotone_constraints='()', n_estimators=100, n_jobs=12,
             num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
             reg_lambda=1, scale_pos_weight=1, single_precision_histogram=True,
             subsample=1, tree_method='gpu_hist', validate_parameters=1,
             verbosity=None)

In [17]:

results = selected_model.evals_result()

plt.figure(figsize=(8,6))
# plot learning curves
plt.plot(results['validation_0']['rmse'], label='train')
plt.plot(results['validation_1']['rmse'], label='test')
plt.legend()
plt.show()

In [10]:

pred = selected_model.predict(X_test_prc)
pred = np.exp(pred)
y_val_s = np.exp(y_test)
res_row_obj = reg_metrics.calc_results(pred, y_val_s,'defaults XGBReg')
row = res_row_obj.calc_results_row()
results_df = results_df.append(row,ignore_index=True)
results_df

Out[10]:

	model	rmse	r2	mape
0	Baseline	12.650522	-7.924584e+29	23.992520
1	defaults XGBReg	1.773988	9.805227e-01	0.135421

Already, the default XGBoost is performing really well. On closer inspection, we can see that it is overfitting slightly. This becomes obvious when we look at a plot of actual vs predicted price. This we can remedy with using regularization or by selecting a different model.

In [26]:

g =sns.scatterplot(y_val_s, pred)
g.set(ylim=(0, 200))

Out[26]:

[(0.0, 200.0)]

Before we start looking at slight overfitting, we look at the parameters that typically take the longest to tune and are like the 'base' parameters. Getting an idea of the best performing tree depth, learning_rate and number of estimators (equivalent to number of boosting trees) is easy with a straightforward GridSearch and also gives a good idea about what parameter ranges to use.

In [11]:

# BASIC TUNE PARAMS 

start_time=time.time()

# grid search
xgb_reg = XGBRegressor(tree_method = "gpu_hist",single_precision_histogram=True, gpu_id=0)

param_grid = {
        'max_depth': [6, 9, 12],
        'learning_rate': [0.03, 0.05, 0.07],
        'n_estimators':[1500, 1800, 2100],
                               }

kfold = KFold(n_splits=3, shuffle=True, random_state=10)
grid_search = GridSearchCV(xgb_reg, param_grid, scoring="neg_root_mean_squared_error", cv=kfold)
grid_result = grid_search.fit(X_train_prc, y_train, verbose=0)

# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_[ 'mean_test_score' ]
stds = grid_result.cv_results_[ 'std_test_score' ]
params = grid_result.cv_results_[ 'params' ]

print(time.time()-start_time)

Best: -0.003677 using {'learning_rate': 0.05, 'max_depth': 6, 'n_estimators': 1500}
1219.6475224494934

In [12]:

model = XGBRegressor(tree_method = "gpu_hist",single_precision_histogram=True, gpu_id=0, **grid_result.best_params_)
model.fit(X_train_prc,y_train)
y_pred = model.predict(X_test_prc)

In [13]:

pred = np.exp(y_pred)
y_val_s = np.exp(y_test)
res_row_obj = reg_metrics.calc_results(pred, y_val_s,'GridSearchCV XGBReg')
row = res_row_obj.calc_results_row()
results_df = results_df.append(row,ignore_index=True)
results_df

Out[13]:

	model	rmse	r2	mape
0	Baseline	12.650522	-7.924584e+29	23.992520
1	defaults XGBReg	1.773988	9.805227e-01	0.135421
2	GridSearchCV XGBReg	1.228305	9.906251e-01	0.041728

We can see the average percentage error of residuals has dropped substantially for the GridSearch-set parameters. However, the Root Mean Squared Error which penalizes outliers more, is still close to the original. The distribution of the residuals is actually a tiny bit more dispersed but there are no super huge outlier predictions.

Since this post is growing in length, I will leave out XGBoost tuning with HyperOpt for finding regularization parameters. If this is detail that interests you, feel free to check out my notebook on Github. Instead, I will go right ahead to comparing the models we've been waiting for. I always assumed that finding a good model was secondary to problem definition and feature selection because the models we are using are all really good. However; after looking at these results, I realised what a huge difference model selection really makes. As I've previously mentioned, this post is not about delving deeply into the inner workings of the models or the optimization algorithms. To tune the models, I use HyperOpt. This post has really great information and the way that the author has put together all his code was a big learning to me. It forms the base of how I decided to keep my code together for hyperparameter tuning using this library and method.

First, we define the parameter spaces for model tuning. HyperOpt has several choices for distributions over which it will search specific parameters. When using hp.choice, the specification we give to HyperOpt is that we would like to select from a given list, set or nested list. It therefore returns indices. In order to select the correct hyperparameters with these indices, I define dictionaries with the hp.choice arrays. The heavy lifter is the class HPOpt which takes in a string description of which model method to use and processes the model training with the HyperOpt minimization.

In [27]:

from sklearn.metrics import mean_squared_error


# XGB parameters
xgb_reg_params = {
    'learning_rate':    hp.choice('learning_rate',    np.arange(0.05, 0.31, 0.05)),
    'max_depth':        hp.choice('max_depth',        np.arange(5, 16, 1, dtype=int)),
    'min_child_weight': hp.choice('min_child_weight', np.arange(1, 8, 1, dtype=int)),
    'colsample_bytree': hp.choice('colsample_bytree', np.arange(0.3, 0.8, 0.1)),
    'subsample':        hp.uniform('subsample', 0.8, 1),
    'n_estimators':     200,
}
xgb_fit_params = {
    'eval_metric': 'rmse',
    'early_stopping_rounds': 10,
    'verbose': False
}
xgb_para = dict()
xgb_para['reg_params'] = xgb_reg_params
xgb_para['fit_params'] = xgb_fit_params
xgb_para['loss_func' ] = lambda y, pred: np.sqrt(mean_squared_error(y, pred))


# LightGBM parameters
lgb_reg_params = {
    'learning_rate':    hp.choice('learning_rate',    np.arange(0.05, 0.31, 0.05)),
    'max_depth':        hp.choice('max_depth',        np.arange(5, 16, 1, dtype=int)),
    'min_child_weight': hp.choice('min_child_weight', np.arange(1, 8, 1, dtype=int)),
    'colsample_bytree': hp.choice('colsample_bytree', np.arange(0.3, 0.8, 0.1)),
    'subsample':        hp.uniform('subsample', 0.8, 1),
    'n_estimators':     100,
}
lgb_fit_params = {
    'eval_metric': 'l2',
    'early_stopping_rounds': 10,
    'verbose': False
}
lgb_para = dict()
lgb_para['reg_params'] = lgb_reg_params
lgb_para['fit_params'] = lgb_fit_params
lgb_para['loss_func' ] = lambda y, pred: np.sqrt(mean_squared_error(y, pred))


# CatBoost parameters
ctb_reg_params = {
    'learning_rate':     hp.choice('learning_rate',     np.arange(0.05, 0.31, 0.05)),
    'max_depth':         hp.choice('max_depth',         np.arange(5, 16, 1, dtype=int)),
    'colsample_bylevel': hp.choice('colsample_bylevel', np.arange(0.3, 0.8, 0.1)),
    'n_estimators':      100,
    'eval_metric':       'RMSE',
}
ctb_fit_params = {
    'early_stopping_rounds': 10,
    'verbose': False
}
ctb_para = dict()
ctb_para['reg_params'] = ctb_reg_params
ctb_para['fit_params'] = ctb_fit_params
ctb_para['loss_func' ] = lambda y, pred: np.sqrt(mean_squared_error(y, pred))

In [17]:

import lightgbm as lgb
import xgboost as xgb
import catboost as ctb
from hyperopt import fmin, tpe, STATUS_OK, STATUS_FAIL, Trials


class HPOpt(object):

    def __init__(self, x_train, x_test, y_train, y_test):
        self.x_train = x_train
        self.x_test  = x_test
        self.y_train = y_train
        self.y_test  = y_test

    def process(self, fn_name, space, trials, algo, max_evals):
        fn = getattr(self, fn_name)
        try:
            result = fmin(fn=fn, space=space, algo=algo, max_evals=max_evals, trials=trials)
        except Exception as e:
            return {'status': STATUS_FAIL,
                    'exception': str(e)}
        return result, trials

    def xgb_reg(self, para):
        reg = xgb.XGBRegressor(**para['reg_params'])
        return self.train_reg(reg, para)

    def lgb_reg(self, para):
        reg = lgb.LGBMRegressor(**para['reg_params'])
        return self.train_reg(reg, para)

    def ctb_reg(self, para):
        reg = ctb.CatBoostRegressor(**para['reg_params'])
        return self.train_reg(reg, para)

    def train_reg(self, reg, para):
        reg.fit(self.x_train, self.y_train,
                eval_set=[(self.x_train, self.y_train), (self.x_test, self.y_test)],
                **para['fit_params'])
        pred = reg.predict(self.x_test)
        loss = para['loss_func'](self.y_test, pred)
        return {'loss': loss, 'status': STATUS_OK}

In [29]:

obj = HPOpt(X_train_prc, X_val_prc, y_train, y_val)
# the dataset is not so complex that it needs tones of evals
xgb_opt = obj.process(fn_name='xgb_reg', space=xgb_para, trials=Trials(), algo=tpe.suggest, max_evals=50)
lgb_opt = obj.process(fn_name='lgb_reg', space=lgb_para, trials=Trials(), algo=tpe.suggest, max_evals=50)
ctb_opt = obj.process(fn_name='ctb_reg', space=ctb_para, trials=Trials(), algo=tpe.suggest, max_evals=50)

<bound method HPOpt.xgb_reg of <__main__.HPOpt object at 0x000002389EEE8790>>
100%|██████████████████████████████████████████| 50/50 [2:52:02<00:00, 206.44s/trial, best loss: 5.039651774326599e-05]
<bound method HPOpt.lgb_reg of <__main__.HPOpt object at 0x000002389EEE8790>>
100%|████████████████████████████████████████████| 50/50 [02:16<00:00,  2.74s/trial, best loss: 0.00040253329178474475]
<bound method HPOpt.ctb_reg of <__main__.HPOpt object at 0x000002389EEE8790>>
100%|█████████████████████████████████████████████| 50/50 [33:13<00:00, 39.86s/trial, best loss: 9.554781700581013e-06]

In [78]:

xgb_tuned_params = {x[0]:xgb_choice_params[x[0]][x[1]] if x[0] in xgb_choice_params else x[1] for x in xgb_opt[0].items()}
lgb_tuned_params = {x[0]:lgb_choice_params[x[0]][x[1]] if x[0] in lgb_choice_params else x[1] for x in lgb_opt[0].items()}
ctb_tuned_params = {x[0]:ctb_choice_params[x[0]][x[1]] if x[0] in ctb_choice_params else x[1] for x in ctb_opt[0].items()}

After obtaining all the best parameters and training models, the final results look really interesting.

In [91]:

ctb_reg = ctb.CatBoostRegressor(**ctb_tuned_params)
ctb_reg.fit(X_train_prc, y_train, eval_set=[(X_train_prc, y_train), (X_val_prc, y_val)],**ctb_para['fit_params'])
y_pred = ctb_reg.predict(X_test_prc)
pred = np.exp(y_pred)
y_val_s = np.exp(y_test)
res_row_obj = reg_metrics.calc_results(pred, y_val_s,'Hyperopt CTBReg')
row = res_row_obj.calc_results_row()
results_df = results_df.append(row,ignore_index=True)
results_df

Out[91]:

	model	rmse	r2	mape
0	Baseline	12.650522	-7.924584e+29	23.992520
1	defaults XGBReg	1.773988	9.805227e-01	0.135421
2	GridSearchCV XGBReg	1.228305	9.906251e-01	0.041728
3	Hyperopt XGBReg	21.725783	2.558151e-01	0.248813
4	Hyperopt LGBReg	0.066331	9.999724e-01	0.016224
5	Hyperopt CTBReg	0.001424	1.000000e+00	0.000392

Previous Post Next Post