Selecting categorical encoding: Clickstream price prediction

Posted on Mon 07 March 2022 in regression • 4 min read

Welcome to the FailSafe experiment series where the tag-line is that there's no such thing as failure if the objective is to fail and if you fail at failing then you've succeeded. In this project, I'll be exploring a method of selection for a categorical encoder for the Clickstream prediction project.

Dataset

As a reminder from the previous post, project 9 of the series deals with the price prediction of an e-clothing shop based on a few online-shopping attributes. The data and the description can be found at the UCI repository. For the purposes of this post, it is not too important to look too closely into data EDA and variable descriptions.

Most variables are already encoded for us - as described in the data description of the project.

In [5]:
clothing_data_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 165474 entries, 0 to 165473
Data columns (total 12 columns):
 #   Column                   Non-Null Count   Dtype 
---  ------                   --------------   ----- 
 0   month                    165474 non-null  int64 
 1   day                      165474 non-null  int64 
 2   order                    165474 non-null  int64 
 3   country                  165474 non-null  int64 
 4   session ID               165474 non-null  int64 
 5   page 1 (main category)   165474 non-null  int64 
 6   page 2 (clothing model)  165474 non-null  object
 7   colour                   165474 non-null  int64 
 8   location                 165474 non-null  int64 
 9   model photography        165474 non-null  int64 
 10  price                    165474 non-null  int64 
 11  page                     165474 non-null  int64 
dtypes: int64(11), object(1)
memory usage: 15.1+ MB

As we can see from the output above, there is only one 'object' type variable. I use this as practice. A disclaimer: some models such as Catboost do not need prior encoding of categorical variables - this method is mostly useful for models which require numerical inputs.

Encoder Selection

When we want to perform model explanation and interpretation, we typically would want to encode in such a way as to preserve the original structure of the levels as much as possible. For example, the first choice would be a One-Hot encoding as this would retain each feature level as a separate and individual column. This runs into trouble very quickly as soon as the number of levels starts to grow. In our particular example, we have 217 unique clothing model categories. Adding so many columns to our dataset is not very practical and is unlikely to aid in the modeling as the One-Hot encoding would be quite sparse.

Just as with any choice we can make in Machine Learning, it is best to test the possibilities. We can very simply just loop through various styles of encoding using a an XGBRegressor with default parameters and keep track of the results. My go-to for this is the category_encoders library which contains many encoding methods put nicely together in a scikit-learn style of transformers. First, I set up some defitions. The code below detects which features are categorical and which are numeric without having to manually type column names

In [ ]:
#identify numeric and categorical features
numeric_features = clothing_data_df.select_dtypes([np.number]).drop(['price'], axis=1).columns
numeric_features

categorical_features = clothing_data_df.select_dtypes(exclude=[np.number]).columns
categorical_features

We do some cute lil data prep:

In [ ]:
#prep data
X = clothing_data_df.drop('price', axis=1)
y = clothing_data_df['price']

y = np.log(y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

We define a big-ass list of encoders:

In [ ]:
#define encoders
encoders = {
    'BackwardDifferenceEncoder': ce.backward_difference.BackwardDifferenceEncoder,
    'BaseNEncoder': ce.basen.BaseNEncoder,
    'BinaryEncoder': ce.binary.BinaryEncoder,
    'CatBoostEncoder': ce.cat_boost.CatBoostEncoder,
    #'HashingEncoder': ce.hashing.HashingEncoder, takes too long
    'HelmertEncoder': ce.helmert.HelmertEncoder,
    'JamesSteinEncoder': ce.james_stein.JamesSteinEncoder,
    'OneHotEncoder': ce.one_hot.OneHotEncoder,
    'LeaveOneOutEncoder': ce.leave_one_out.LeaveOneOutEncoder,
    'MEstimateEncoder': ce.m_estimate.MEstimateEncoder,
    'OrdinalEncoder': ce.ordinal.OrdinalEncoder,
    'PolynomialEncoder': ce.polynomial.PolynomialEncoder,
    'SumEncoder': ce.sum_coding.SumEncoder,
    'TargetEncoder': ce.target_encoder.TargetEncoder,
    # 'WOEEncoder': ce.woe.WOEEncoder target must be binary
}
The functionality of the individual encoders might deserve their own posts.

We define the model and the result dataframe to add to:

In [ ]:
selected_model = XGBRegressor(tree_method = "gpu_hist",single_precision_histogram=True, gpu_id=0)
clothing_data_df_results = pd.DataFrame(columns=['encoder', 'rmse', 'r2'])

We iterate through the dictionary of encoders, defining a pipeline at each iteration. For neatness, we could (should heh) actually define a function for the pipeline which will accept the encoder dictionary item as an argument.

In [ ]:
for key in encoders:
    
    time_0 = time.time()
    categorical_transformer = Pipeline(
        steps=[
            ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
            ('encoder', encoders[key]())
        ]
    )    

    numeric_transformer = Pipeline(
        steps=[
            ('imputer', SimpleImputer(strategy='mean')),
            ('scaler', StandardScaler())
        ]
    )

    preprocessor = ColumnTransformer(
        transformers=[
            ('numerical', numeric_transformer, numeric_features),
            ('categorical', categorical_transformer, categorical_features)
        ]
    )

    pipe = Pipeline(
        steps=[
            ('preprocessor', preprocessor),
            ('regressor', selected_model)
        ]
    )

    model = pipe.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    row = {
        'encoder': key,
        'rmse': np.sqrt(mse(y_test, y_pred)),
        'r2': r2_score(y_test, y_pred),
    }

    clothing_data_df_results = clothing_data_df_results.append(row, ignore_index=True)
    #print(key, 'time taken:', time.time()-time_0)
    

Now we see the fruits of the machine's hard labour:

In [7]:
clothing_data_df_results.head(20).sort_values(by='rmse')
Out[7]:
encoder rmse r2
0 BackwardDifferenceEncoder 0.007809 0.999273
9 OrdinalEncoder 0.008245 0.999189
1 BaseNEncoder 0.014554 0.997474
2 BinaryEncoder 0.014554 0.997474
5 JamesSteinEncoder 0.016508 0.996750
12 TargetEncoder 0.021040 0.994721
8 MEstimateEncoder 0.023136 0.993616
4 HelmertEncoder 0.032937 0.987063
6 OneHotEncoder 0.042753 0.978201
11 SumEncoder 0.042821 0.978133
7 LeaveOneOutEncoder 0.076102 0.930933
10 PolynomialEncoder 0.109938 0.855862
3 CatBoostEncoder 0.149905 0.732010

Based on the results, it's fair to say that this is a useful exercise and I have it as a standard step in my modeling process. The difference in the treatment of one variable with different techniques can actually have a significant impact.