Clickstream Prediction: Data Science project 9 of 500

Posted on Fri 04 March 2022 in Prediction • 3 min read

In my last post I wrote a high-level summary explaining how I would (and did) approach learning Data Science if I had to do it over again. You can find this at this post

Introduction to the FailSafe experiment

In the FailSafe experiment, I'm 'failing' 500 Data Science projects safely. This means I'm not judging myself for anything I know or don't know. The thinking behind this is that there's no such thing as failure if the objective is to fail and if you fail at failing then you've succeeded. That's kind of the tag-line for this whole thing and its becoming my moto. The aim for all of the projects is quantity over quality, which turns out (for me at least), is a much better way to learn. There are many projects, each of which are intended to teach me at least one small thing. My aim is to never 'get stuck'. The opportunity cost of getting stuck on one project is getting practice on a few projects and learning something new each time. I've learned more this way than I have at any Data Science job or even MOOC I've been at or taken.

Each project has objectives which I set out as the thing I'd like to learn. More often than not, I land up learning something completely different. Which is okay! I've removed all stigma from getting stuff wrong now so I can just get on with it. Setting an objective really does help to set the tone and direction for the project. For example, sometimes I will choose a more simple target to predict because I would like to practice a new technique. It was important in this whole process for me to be systematic. For example - I don't do a whole different field with every project. Even though I plan to cover most sub-fields within Data Science, it would be haphazard and shallow learning to jump from a Linear Regression Analysis to Computer Vision to Time Series. It feels like most things can be formulated as an explore-exploit trade-off and this is no different. I started off with simple prediction modeling and over a few projects I landed up with a Machine Learning pipeline I'm becoming happy with.

If you're interested in following along, here is the link to the repository on Github.

Clickstream Prediction

Alt text

This dataset is available at the UCI machine learning repository. This is data for the year 2008 for on online store - specifically offering clothing to pregnant women.

For my previous project, I had worked through the winning Kaggle solution to the Amazon Access challenge. I wanted to incoporate some of my learnings. This forms part of the objectives. Here are the points of my objective:

  1. Have explanatory modeling as well as performance modeling
  2. In the performance modeling, I would like to try staking models
  3. I want to create the polynomial featuresets, the engineered feature sets, the greedy feature selection

In this series of posts, I will examine the process end to end of the Clickstream prediction modeling process.

The next post will look at some data exploration.