Predicting Game Play Time on Steam
PART 1 — DATA
My final project at General Assembly Data Science Immersive Program was predicting game average playtime on Steam and I wanted to share a little bit about my experience and progress of this project.
First of all, Steam is a platform that is currently the largest PC game online distributor and it accumulates user and game data constantly. One of the data that is collected is user’s play time per game and overall. I chose this metric for my project as it is ultimately customer engagement measurement. Using various regression models, I wanted to see if I can predict the average time played based on game features.
Data was acquired using Steam Web and SteamSpy API. Steam Web has public access to player and game data and SteamSpy is a third party site that aggregates statistics on Steam Games. While acquiring the data, I stumbled on some potential issues. The distribution of playtimes was very skewed and looking more into it, the data gaps were most likely affected by private account settings by many users. If a user chose the privacy settings, their data is restricted from external access.
After acquiring the data, I ended up with 30,000 rows and 50 columns all together. After dropping rows with mostly null values and columns that are duplicated or irrelevant (such as media and images), I ended up with 25,000 rows and 32 columns. I needed to filter out games with little to no owners and kept only paid games, to balance out the data. I found the For my numerical features (required age, ratings, number of owners, price, discount. I filled any remaining null values with the median and dropped any extreme outliers. The categorical features (Category, Developer, Genre) where one hot encoded. Now that the data cleaning and initial EDA is complete, next step is to apply linear regression to the features as a baseline model.
I’ll go further into detail on EDA and the models in part 2 of my capstone project blog. Stay tuned!