March Madness in October?¶
This post is a long time coming. Honestly, I meant to get this out last March but better late than never. This notebook was the method I used in cleansing and exploring data from the 2017 Kaggle Machine Learning Mania competition. You could apply these techniques to any data where you are trying to predict a winner based off prior information. If you are interested in competing in a Kaggle competition I highly recommend this one. Especially if you are getting tired of Titanic datasets and flower petal measurements. Reading the past competition message boards was a huge help. I’ll try to sprinkle my thought process at different points to clarify for readers. Enjoy!
import pandas as pd
from fuzzywuzzy import process, fuzz
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
Data Exploration¶
At this point I just wanted to see what kind of data I was dealing with. Luckily, kaggle already assigned each team a Team ID to keep me from having to perform any kind of normalization.
teams = pd.read_csv("./Teams.csv", index_col=0)
teams.head()
Of course I had to find out the team id of the home team. :)
print(teams[teams['Team_Name']=='Georgia Tech'])
We can also look up teams by team id.
print(teams[teams['Team_Id']==1455])
rscr = pd.read_csv("./RegularSeasonCompactResults.csv")
rsdr = pd.read_csv("./RegularSeasonDetailedResults.csv")
tcr = pd.read_csv("./TourneyCompactResults.csv")
The kenpom.csv is a external dataset that I obtained by scraping https://kenpom.com/. I highly recommend looking at and consider using data from this site. Ken Pomeroy runs a college basketball ranking site and was recommended in previous march madness competitions.
ss = pd.read_csv("./sample_submission.csv")
kp = pd.read_csv("./kenpom.csv")
rscr.head()
Sometimes it’s easier to view all the features by transposing the dataframe.
rsdr.head().transpose()
rscr.tail()
rsdr.tail()
No surprise here on the top 10 teams by total wins.
top_10_reg = teams.join(pd.DataFrame(rscr['Wteam'].value_counts()[:9]), how='right')
top_10_reg.columns = ['Team_Name', 'Total_Wins']
top_10_reg
ss.head()
The feature we are trying to predict is the probability that team1 will win. In order to do this effectively we need to choose or engineer numerical features that we’ll know prior to the game occurring. The makeup of each team varies greatly from year to year. With that in mind I’ve decided to isolate my train and test data by year.
rs2013 = rsdr[rsdr['Season']==2013]
rs2014 = rsdr[rsdr['Season']==2014]
rs2015 = rsdr[rsdr['Season']==2015]
rs2016 = rsdr[rsdr['Season']==2016]
rs2017 = rsdr[rsdr['Season']==2017]
rs2013_d1 = pd.DataFrame()
rs2014_d1 = pd.DataFrame()
rs2015_d1 = pd.DataFrame()
rs2016_d1 = pd.DataFrame()
rs2017_d1 = pd.DataFrame()
rs2013_d2 = pd.DataFrame()
rs2014_d2 = pd.DataFrame()
rs2015_d2 = pd.DataFrame()
rs2016_d2 = pd.DataFrame()
rs2017_d2 = pd.DataFrame()
rs2013_d1[['team1', 'team2', 'Daynum', 't1_fgm', 't1_fga', 't1_fgm3', 't1_fga3', 't1_ftm', 't1_fta', 't1_ast', 't1_dr', 't1_stl', 't1_pf', 't1_blk', 't2_fgm', 't2_fga', 't2_fgm3', 't2_fga3', 't2_ftm', 't2_fta', 't2_ast', 't2_dr', 't2_stl', 't2_pf', 't2_blk' ]] = rs2013[['Wteam', 'Lteam', 'Daynum', 'Wfgm', 'Wfga', 'Wfgm3', 'Wfga3', 'Wftm', 'Wfta', 'Wast', 'Wdr', 'Wstl', 'Wpf', 'Wblk', 'Lfgm', 'Lfga', 'Lfgm3', 'Lfga3', 'Lftm', 'Lfta', 'Last', 'Ldr', 'Lstl', 'Lpf', 'Lblk']].copy()
rs2014_d1[['team1', 'team2', 'Daynum', 't1_fgm', 't1_fga', 't1_fgm3', 't1_fga3', 't1_ftm', 't1_fta', 't1_ast', 't1_dr', 't1_stl', 't1_pf', 't1_blk', 't2_fgm', 't2_fga', 't2_fgm3', 't2_fga3', 't2_ftm', 't2_fta', 't2_ast', 't2_dr', 't2_stl', 't2_pf', 't2_blk' ]] = rs2014[['Wteam', 'Lteam', 'Daynum', 'Wfgm', 'Wfga', 'Wfgm3', 'Wfga3', 'Wftm', 'Wfta', 'Wast', 'Wdr', 'Wstl', 'Wpf', 'Wblk', 'Lfgm', 'Lfga', 'Lfgm3', 'Lfga3', 'Lftm', 'Lfta', 'Last', 'Ldr', 'Lstl', 'Lpf', 'Lblk']].copy()
rs2015_d1[['team1', 'team2', 'Daynum', 't1_fgm', 't1_fga', 't1_fgm3', 't1_fga3', 't1_ftm', 't1_fta', 't1_ast', 't1_dr', 't1_stl', 't1_pf', 't1_blk', 't2_fgm', 't2_fga', 't2_fgm3', 't2_fga3', 't2_ftm', 't2_fta', 't2_ast', 't2_dr', 't2_stl', 't2_pf', 't2_blk' ]] = rs2015[['Wteam', 'Lteam', 'Daynum', 'Wfgm', 'Wfga', 'Wfgm3', 'Wfga3', 'Wftm', 'Wfta', 'Wast', 'Wdr', 'Wstl', 'Wpf', 'Wblk', 'Lfgm', 'Lfga', 'Lfgm3', 'Lfga3', 'Lftm', 'Lfta', 'Last', 'Ldr', 'Lstl', 'Lpf', 'Lblk']].copy()
rs2016_d1[['team1', 'team2', 'Daynum', 't1_fgm', 't1_fga', 't1_fgm3', 't1_fga3', 't1_ftm', 't1_fta', 't1_ast', 't1_dr', 't1_stl', 't1_pf', 't1_blk', 't2_fgm', 't2_fga', 't2_fgm3', 't2_fga3', 't2_ftm', 't2_fta', 't2_ast', 't2_dr', 't2_stl', 't2_pf', 't2_blk' ]] = rs2016[['Wteam', 'Lteam', 'Daynum', 'Wfgm', 'Wfga', 'Wfgm3', 'Wfga3', 'Wftm', 'Wfta', 'Wast', 'Wdr', 'Wstl', 'Wpf', 'Wblk', 'Lfgm', 'Lfga', 'Lfgm3', 'Lfga3', 'Lftm', 'Lfta', 'Last', 'Ldr', 'Lstl', 'Lpf', 'Lblk']].copy()
rs2017_d1[['team1', 'team2', 'Daynum']] = rs2017[['Wteam', 'Lteam', 'Daynum']].copy()
rs2013_d2[['team1', 'team2', 'Daynum', 't1_fgm', 't1_fga', 't1_fgm3', 't1_fga3', 't1_ftm', 't1_fta', 't1_ast', 't1_dr', 't1_stl', 't1_pf', 't1_blk', 't2_fgm', 't2_fga', 't2_fgm3', 't2_fga3', 't2_ftm', 't2_fta', 't2_ast', 't2_dr', 't2_stl', 't2_pf', 't2_blk' ]] = rs2013[['Lteam', 'Wteam', 'Daynum', 'Lfgm', 'Lfga', 'Lfgm3', 'Lfga3', 'Lftm', 'Lfta', 'Last', 'Ldr', 'Lstl', 'Lpf', 'Lblk', 'Wfgm', 'Wfga', 'Wfgm3', 'Wfga3', 'Wftm', 'Wfta', 'Wast', 'Wdr', 'Wstl', 'Wpf', 'Wblk']].copy()
rs2014_d2[['team1', 'team2', 'Daynum', 't1_fgm', 't1_fga', 't1_fgm3', 't1_fga3', 't1_ftm', 't1_fta', 't1_ast', 't1_dr', 't1_stl', 't1_pf', 't1_blk', 't2_fgm', 't2_fga', 't2_fgm3', 't2_fga3', 't2_ftm', 't2_fta', 't2_ast', 't2_dr', 't2_stl', 't2_pf', 't2_blk' ]] = rs2014[['Lteam', 'Wteam', 'Daynum', 'Lfgm', 'Lfga', 'Lfgm3', 'Lfga3', 'Lftm', 'Lfta', 'Last', 'Ldr', 'Lstl', 'Lpf', 'Lblk', 'Wfgm', 'Wfga', 'Wfgm3', 'Wfga3', 'Wftm', 'Wfta', 'Wast', 'Wdr', 'Wstl', 'Wpf', 'Wblk']].copy()
rs2015_d2[['team1', 'team2', 'Daynum', 't1_fgm', 't1_fga', 't1_fgm3', 't1_fga3', 't1_ftm', 't1_fta', 't1_ast', 't1_dr', 't1_stl', 't1_pf', 't1_blk', 't2_fgm', 't2_fga', 't2_fgm3', 't2_fga3', 't2_ftm', 't2_fta', 't2_ast', 't2_dr', 't2_stl', 't2_pf', 't2_blk' ]] = rs2015[['Lteam', 'Wteam', 'Daynum', 'Lfgm', 'Lfga', 'Lfgm3', 'Lfga3', 'Lftm', 'Lfta', 'Last', 'Ldr', 'Lstl', 'Lpf', 'Lblk', 'Wfgm', 'Wfga', 'Wfgm3', 'Wfga3', 'Wftm', 'Wfta', 'Wast', 'Wdr', 'Wstl', 'Wpf', 'Wblk']].copy()
rs2016_d2[['team1', 'team2', 'Daynum', 't1_fgm', 't1_fga', 't1_fgm3', 't1_fga3', 't1_ftm', 't1_fta', 't1_ast', 't1_dr', 't1_stl', 't1_pf', 't1_blk', 't2_fgm', 't2_fga', 't2_fgm3', 't2_fga3', 't2_ftm', 't2_fta', 't2_ast', 't2_dr', 't2_stl', 't2_pf', 't2_blk' ]] = rs2016[['Lteam', 'Wteam', 'Daynum', 'Lfgm', 'Lfga', 'Lfgm3', 'Lfga3', 'Lftm', 'Lfta', 'Last', 'Ldr', 'Lstl', 'Lpf', 'Lblk', 'Wfgm', 'Wfga', 'Wfgm3', 'Wfga3', 'Wftm', 'Wfta', 'Wast', 'Wdr', 'Wstl', 'Wpf', 'Wblk']].copy()
rs2017_d2[['team1', 'team2', 'Daynum']] = rs2017[['Lteam', 'Wteam', 'Daynum']].copy()
rs2013_d1['pred'] = 1
rs2014_d1['pred'] = 1
rs2015_d1['pred'] = 1
rs2016_d1['pred'] = 1
rs2017_d1['pred'] = 1
rs2013_d2['pred'] = 0
rs2014_d2['pred'] = 0
rs2015_d2['pred'] = 0
rs2016_d2['pred'] = 0
rs2017_d2['pred'] = 0
rs2013_train = pd.concat([rs2013_d1, rs2013_d2]).reset_index(drop=True)
rs2014_train = pd.concat([rs2014_d1, rs2014_d2]).reset_index(drop=True)
rs2015_train = pd.concat([rs2015_d1, rs2015_d2]).reset_index(drop=True)
rs2016_train = pd.concat([rs2016_d1, rs2016_d2]).reset_index(drop=True)
rs2017_train = pd.concat([rs2017_d1, rs2017_d2]).reset_index(drop=True)
I did a little feature engineering by creating standard bball percentages.
rs2013_train['t1_fg_pct'] = rs2013_train['t1_fgm'] / rs2013_train['t1_fga']
rs2013_train['t1_fg3_pct'] = rs2013_train['t1_fgm3'] / rs2013_train['t1_fga3']
rs2013_train['t1_ft_pct'] = rs2013_train['t1_ftm'] / rs2013_train['t1_fta']
rs2014_train['t1_fg_pct'] = rs2014_train['t1_fgm'] / rs2014_train['t1_fga']
rs2014_train['t1_fg3_pct'] = rs2014_train['t1_fgm3'] / rs2014_train['t1_fga3']
rs2014_train['t1_ft_pct'] = rs2014_train['t1_ftm'] / rs2014_train['t1_fta']
rs2015_train['t1_fg_pct'] = rs2015_train['t1_fgm'] / rs2015_train['t1_fga']
rs2015_train['t1_fg3_pct'] = rs2015_train['t1_fgm3'] / rs2015_train['t1_fga3']
rs2015_train['t1_ft_pct'] = rs2015_train['t1_ftm'] / rs2015_train['t1_fta']
rs2016_train['t1_fg_pct'] = rs2016_train['t1_fgm'] / rs2016_train['t1_fga']
rs2016_train['t1_fg3_pct'] = rs2016_train['t1_fgm3'] / rs2016_train['t1_fga3']
rs2016_train['t1_ft_pct'] = rs2016_train['t1_ftm'] / rs2016_train['t1_fta']
rs2013_train['t2_fg_pct'] = rs2013_train['t2_fgm'] / rs2013_train['t2_fga']
rs2013_train['t2_fg3_pct'] = rs2013_train['t2_fgm3'] / rs2013_train['t2_fga3']
rs2013_train['t2_ft_pct'] = rs2013_train['t2_ftm'] / rs2013_train['t2_fta']
rs2014_train['t2_fg_pct'] = rs2014_train['t2_fgm'] / rs2014_train['t2_fga']
rs2014_train['t2_fg3_pct'] = rs2014_train['t2_fgm3'] / rs2014_train['t2_fga3']
rs2014_train['t2_ft_pct'] = rs2014_train['t2_ftm'] / rs2014_train['t2_fta']
rs2015_train['t2_fg_pct'] = rs2015_train['t2_fgm'] / rs2015_train['t2_fga']
rs2015_train['t2_fg3_pct'] = rs2015_train['t2_fgm3'] / rs2015_train['t2_fga3']
rs2015_train['t2_ft_pct'] = rs2015_train['t2_ftm'] / rs2015_train['t2_fta']
rs2016_train['t2_fg_pct'] = rs2016_train['t2_fgm'] / rs2016_train['t2_fga']
rs2016_train['t2_fg3_pct'] = rs2016_train['t2_fgm3'] / rs2016_train['t2_fga3']
rs2016_train['t2_ft_pct'] = rs2016_train['t2_ftm'] / rs2016_train['t2_fta']
I added a few more features by calculating each team’s trailing average for bbal percentages to try and encapsulate how a team was doing at that point in time.
rs2013_train = rs2013_train.sort_values('Daynum')
rs2014_train = rs2013_train.sort_values('Daynum')
rs2015_train = rs2013_train.sort_values('Daynum')
rs2016_train = rs2013_train.sort_values('Daynum')
t1_fg_pct_avg_2013 = rs2013_train.groupby('team1', as_index=False)['t1_fg_pct'].apply(lambda x: x.expanding().mean().shift())
rs2013_train['t1_fg_pct_avg'] = t1_fg_pct_avg_2013.fillna(0).reset_index(level=0, drop=True)
t1_fg3_pct_avg_2013 = rs2013_train.groupby('team1', as_index=False)['t1_fg3_pct'].apply(lambda x: x.expanding().mean().shift())
rs2013_train['t1_fg3_pct_avg'] = t1_fg3_pct_avg_2013.fillna(0).reset_index(level=0, drop=True)
t1_ft_pct_avg_2013 = rs2013_train.groupby('team1', as_index=False)['t1_ft_pct'].apply(lambda x: x.expanding().mean().shift())
rs2013_train['t1_ft_pct_avg'] = t1_ft_pct_avg_2013.fillna(0).reset_index(level=0, drop=True)
t1_ast_2013 = rs2013_train.groupby('team1', as_index=False)['t1_ast'].apply(lambda x: x.expanding().mean().shift())
rs2013_train['t1_ast_avg'] = t1_ast_2013.fillna(0).reset_index(level=0, drop=True)
t1_dr_2013 = rs2013_train.groupby('team1', as_index=False)['t1_dr'].apply(lambda x: x.expanding().mean().shift())
rs2013_train['t1_dr_avg'] = t1_dr_2013.fillna(0).reset_index(level=0, drop=True)
t1_stl_2013 = rs2013_train.groupby('team1', as_index=False)['t1_stl'].apply(lambda x: x.expanding().mean().shift())
rs2013_train['t1_stl_avg'] = t1_stl_2013.fillna(0).reset_index(level=0, drop=True)
t1_pf_2013 = rs2013_train.groupby('team1', as_index=False)['t1_pf'].apply(lambda x: x.expanding().mean().shift())
rs2013_train['t1_pf_avg'] = t1_pf_2013.fillna(0).reset_index(level=0, drop=True)
t1_blk_2013 = rs2013_train.groupby('team1', as_index=False)['t1_blk'].apply(lambda x: x.expanding().mean().shift())
rs2013_train['t1_blk_avg'] = t1_blk_2013.fillna(0).reset_index(level=0, drop=True)
t1_fg_pct_avg_2014 = rs2014_train.groupby('team1', as_index=False)['t1_fg_pct'].apply(lambda x: x.expanding().mean().shift())
rs2014_train['t1_fg_pct_avg'] = t1_fg_pct_avg_2014.fillna(0).reset_index(level=0, drop=True)
t1_fg3_pct_avg_2014 = rs2014_train.groupby('team1', as_index=False)['t1_fg3_pct'].apply(lambda x: x.expanding().mean().shift())
rs2014_train['t1_fg3_pct_avg'] = t1_fg3_pct_avg_2014.fillna(0).reset_index(level=0, drop=True)
t1_ft_pct_avg_2014 = rs2014_train.groupby('team1', as_index=False)['t1_ft_pct'].apply(lambda x: x.expanding().mean().shift())
rs2014_train['t1_ft_pct_avg'] = t1_ft_pct_avg_2014.fillna(0).reset_index(level=0, drop=True)
t1_ast_2014 = rs2014_train.groupby('team1', as_index=False)['t1_ast'].apply(lambda x: x.expanding().mean().shift())
rs2014_train['t1_ast_avg'] = t1_ast_2014.fillna(0).reset_index(level=0, drop=True)
t1_dr_2014 = rs2014_train.groupby('team1', as_index=False)['t1_dr'].apply(lambda x: x.expanding().mean().shift())
rs2014_train['t1_dr_avg'] = t1_dr_2014.fillna(0).reset_index(level=0, drop=True)
t1_stl_2014 = rs2014_train.groupby('team1', as_index=False)['t1_stl'].apply(lambda x: x.expanding().mean().shift())
rs2014_train['t1_stl_avg'] = t1_stl_2014.fillna(0).reset_index(level=0, drop=True)
t1_pf_2014 = rs2014_train.groupby('team1', as_index=False)['t1_pf'].apply(lambda x: x.expanding().mean().shift())
rs2014_train['t1_pf_avg'] = t1_pf_2014.fillna(0).reset_index(level=0, drop=True)
t1_blk_2014 = rs2014_train.groupby('team1', as_index=False)['t1_blk'].apply(lambda x: x.expanding().mean().shift())
rs2014_train['t1_blk_avg'] = t1_blk_2014.fillna(0).reset_index(level=0, drop=True)
t1_fg_pct_avg_2015 = rs2015_train.groupby('team1', as_index=False)['t1_fg_pct'].apply(lambda x: x.expanding().mean().shift())
rs2015_train['t1_fg_pct_avg'] = t1_fg_pct_avg_2015.fillna(0).reset_index(level=0, drop=True)
t1_fg3_pct_avg_2015 = rs2015_train.groupby('team1', as_index=False)['t1_fg3_pct'].apply(lambda x: x.expanding().mean().shift())
rs2015_train['t1_fg3_pct_avg'] = t1_fg3_pct_avg_2015.fillna(0).reset_index(level=0, drop=True)
t1_ft_pct_avg_2015 = rs2015_train.groupby('team1', as_index=False)['t1_ft_pct'].apply(lambda x: x.expanding().mean().shift())
rs2015_train['t1_ft_pct_avg'] = t1_ft_pct_avg_2015.fillna(0).reset_index(level=0, drop=True)
t1_ast_2015 = rs2015_train.groupby('team1', as_index=False)['t1_ast'].apply(lambda x: x.expanding().mean().shift())
rs2015_train['t1_ast_avg'] = t1_ast_2015.fillna(0).reset_index(level=0, drop=True)
t1_dr_2015 = rs2015_train.groupby('team1', as_index=False)['t1_dr'].apply(lambda x: x.expanding().mean().shift())
rs2015_train['t1_dr_avg'] = t1_dr_2015.fillna(0).reset_index(level=0, drop=True)
t1_stl_2015 = rs2015_train.groupby('team1', as_index=False)['t1_stl'].apply(lambda x: x.expanding().mean().shift())
rs2015_train['t1_stl_avg'] = t1_stl_2015.fillna(0).reset_index(level=0, drop=True)
t1_pf_2015 = rs2015_train.groupby('team1', as_index=False)['t1_pf'].apply(lambda x: x.expanding().mean().shift())
rs2015_train['t1_pf_avg'] = t1_pf_2015.fillna(0).reset_index(level=0, drop=True)
t1_blk_2015 = rs2015_train.groupby('team1', as_index=False)['t1_blk'].apply(lambda x: x.expanding().mean().shift())
rs2015_train['t1_blk_avg'] = t1_blk_2015.fillna(0).reset_index(level=0, drop=True)
t1_fg_pct_avg_2016 = rs2016_train.groupby('team1', as_index=False)['t1_fg_pct'].apply(lambda x: x.expanding().mean().shift())
rs2016_train['t1_fg_pct_avg'] = t1_fg_pct_avg_2016.fillna(0).reset_index(level=0, drop=True)
t1_fg3_pct_avg_2016 = rs2016_train.groupby('team1', as_index=False)['t1_fg3_pct'].apply(lambda x: x.expanding().mean().shift())
rs2016_train['t1_fg3_pct_avg'] = t1_fg3_pct_avg_2016.fillna(0).reset_index(level=0, drop=True)
t1_ft_pct_avg_2016 = rs2016_train.groupby('team1', as_index=False)['t1_ft_pct'].apply(lambda x: x.expanding().mean().shift())
rs2016_train['t1_ft_pct_avg'] = t1_ft_pct_avg_2016.fillna(0).reset_index(level=0, drop=True)
t1_ast_2016 = rs2016_train.groupby('team1', as_index=False)['t1_ast'].apply(lambda x: x.expanding().mean().shift())
rs2016_train['t1_ast_avg'] = t1_ast_2016.fillna(0).reset_index(level=0, drop=True)
t1_dr_2016 = rs2016_train.groupby('team1', as_index=False)['t1_dr'].apply(lambda x: x.expanding().mean().shift())
rs2016_train['t1_dr_avg'] = t1_dr_2016.fillna(0).reset_index(level=0, drop=True)
t1_stl_2016 = rs2016_train.groupby('team1', as_index=False)['t1_stl'].apply(lambda x: x.expanding().mean().shift())
rs2016_train['t1_stl_avg'] = t1_stl_2016.fillna(0).reset_index(level=0, drop=True)
t1_pf_2016 = rs2016_train.groupby('team1', as_index=False)['t1_pf'].apply(lambda x: x.expanding().mean().shift())
rs2016_train['t1_pf_avg'] = t1_pf_2016.fillna(0).reset_index(level=0, drop=True)
t1_blk_2016 = rs2016_train.groupby('team1', as_index=False)['t1_blk'].apply(lambda x: x.expanding().mean().shift())
rs2016_train['t1_blk_avg'] = t1_blk_2016.fillna(0).reset_index(level=0, drop=True)
t2_fg_pct_avg_2013 = rs2013_train.groupby('team2', as_index=False)['t2_fg_pct'].apply(lambda x: x.expanding().mean().shift())
rs2013_train['t2_fg_pct_avg'] = t2_fg_pct_avg_2013.fillna(0).reset_index(level=0, drop=True)
t2_fg3_pct_avg_2013 = rs2013_train.groupby('team2', as_index=False)['t2_fg3_pct'].apply(lambda x: x.expanding().mean().shift())
rs2013_train['t2_fg3_pct_avg'] = t2_fg3_pct_avg_2013.fillna(0).reset_index(level=0, drop=True)
t2_ft_pct_avg_2013 = rs2013_train.groupby('team2', as_index=False)['t2_ft_pct'].apply(lambda x: x.expanding().mean().shift())
rs2013_train['t2_ft_pct_avg'] = t2_ft_pct_avg_2013.fillna(0).reset_index(level=0, drop=True)
t2_ast_2013 = rs2013_train.groupby('team2', as_index=False)['t2_ast'].apply(lambda x: x.expanding().mean().shift())
rs2013_train['t2_ast_avg'] = t2_ast_2013.fillna(0).reset_index(level=0, drop=True)
t2_dr_2013 = rs2013_train.groupby('team2', as_index=False)['t2_dr'].apply(lambda x: x.expanding().mean().shift())
rs2013_train['t2_dr_avg'] = t2_dr_2013.fillna(0).reset_index(level=0, drop=True)
t2_stl_2013 = rs2013_train.groupby('team2', as_index=False)['t2_stl'].apply(lambda x: x.expanding().mean().shift())
rs2013_train['t2_stl_avg'] = t2_stl_2013.fillna(0).reset_index(level=0, drop=True)
t2_pf_2013 = rs2013_train.groupby('team2', as_index=False)['t2_pf'].apply(lambda x: x.expanding().mean().shift())
rs2013_train['t2_pf_avg'] = t2_pf_2013.fillna(0).reset_index(level=0, drop=True)
t2_blk_2013 = rs2013_train.groupby('team2', as_index=False)['t2_blk'].apply(lambda x: x.expanding().mean().shift())
rs2013_train['t2_blk_avg'] = t2_blk_2013.fillna(0).reset_index(level=0, drop=True)
t2_fg_pct_avg_2014 = rs2014_train.groupby('team2', as_index=False)['t2_fg_pct'].apply(lambda x: x.expanding().mean().shift())
rs2014_train['t2_fg_pct_avg'] = t2_fg_pct_avg_2014.fillna(0).reset_index(level=0, drop=True)
t2_fg3_pct_avg_2014 = rs2014_train.groupby('team2', as_index=False)['t2_fg3_pct'].apply(lambda x: x.expanding().mean().shift())
rs2014_train['t2_fg3_pct_avg'] = t2_fg3_pct_avg_2014.fillna(0).reset_index(level=0, drop=True)
t2_ft_pct_avg_2014 = rs2014_train.groupby('team2', as_index=False)['t2_ft_pct'].apply(lambda x: x.expanding().mean().shift())
rs2014_train['t2_ft_pct_avg'] = t2_ft_pct_avg_2014.fillna(0).reset_index(level=0, drop=True)
t2_ast_2014 = rs2014_train.groupby('team2', as_index=False)['t2_ast'].apply(lambda x: x.expanding().mean().shift())
rs2014_train['t2_ast_avg'] = t2_ast_2014.fillna(0).reset_index(level=0, drop=True)
t2_dr_2014 = rs2014_train.groupby('team2', as_index=False)['t2_dr'].apply(lambda x: x.expanding().mean().shift())
rs2014_train['t2_dr_avg'] = t2_dr_2014.fillna(0).reset_index(level=0, drop=True)
t2_stl_2014 = rs2014_train.groupby('team2', as_index=False)['t2_stl'].apply(lambda x: x.expanding().mean().shift())
rs2014_train['t2_stl_avg'] = t2_stl_2014.fillna(0).reset_index(level=0, drop=True)
t2_pf_2014 = rs2014_train.groupby('team2', as_index=False)['t2_pf'].apply(lambda x: x.expanding().mean().shift())
rs2014_train['t2_pf_avg'] = t2_pf_2014.fillna(0).reset_index(level=0, drop=True)
t2_blk_2014 = rs2014_train.groupby('team2', as_index=False)['t2_blk'].apply(lambda x: x.expanding().mean().shift())
rs2014_train['t2_blk_avg'] = t2_blk_2014.fillna(0).reset_index(level=0, drop=True)
t2_fg_pct_avg_2015 = rs2015_train.groupby('team2', as_index=False)['t2_fg_pct'].apply(lambda x: x.expanding().mean().shift())
rs2015_train['t2_fg_pct_avg'] = t2_fg_pct_avg_2015.fillna(0).reset_index(level=0, drop=True)
t2_fg3_pct_avg_2015 = rs2015_train.groupby('team2', as_index=False)['t2_fg3_pct'].apply(lambda x: x.expanding().mean().shift())
rs2015_train['t2_fg3_pct_avg'] = t2_fg3_pct_avg_2015.fillna(0).reset_index(level=0, drop=True)
t2_ft_pct_avg_2015 = rs2015_train.groupby('team2', as_index=False)['t2_ft_pct'].apply(lambda x: x.expanding().mean().shift())
rs2015_train['t2_ft_pct_avg'] = t2_ft_pct_avg_2015.fillna(0).reset_index(level=0, drop=True)
t2_ast_2015 = rs2015_train.groupby('team2', as_index=False)['t2_ast'].apply(lambda x: x.expanding().mean().shift())
rs2015_train['t2_ast_avg'] = t2_ast_2015.fillna(0).reset_index(level=0, drop=True)
t2_dr_2015 = rs2015_train.groupby('team2', as_index=False)['t2_dr'].apply(lambda x: x.expanding().mean().shift())
rs2015_train['t2_dr_avg'] = t2_dr_2015.fillna(0).reset_index(level=0, drop=True)
t2_stl_2015 = rs2015_train.groupby('team2', as_index=False)['t2_stl'].apply(lambda x: x.expanding().mean().shift())
rs2015_train['t2_stl_avg'] = t2_stl_2015.fillna(0).reset_index(level=0, drop=True)
t2_pf_2015 = rs2015_train.groupby('team2', as_index=False)['t2_pf'].apply(lambda x: x.expanding().mean().shift())
rs2015_train['t2_pf_avg'] = t2_pf_2015.fillna(0).reset_index(level=0, drop=True)
t2_blk_2015 = rs2015_train.groupby('team2', as_index=False)['t2_blk'].apply(lambda x: x.expanding().mean().shift())
rs2015_train['t2_blk_avg'] = t2_blk_2015.fillna(0).reset_index(level=0, drop=True)
t2_fg_pct_avg_2016 = rs2016_train.groupby('team2', as_index=False)['t2_fg_pct'].apply(lambda x: x.expanding().mean().shift())
rs2016_train['t2_fg_pct_avg'] = t2_fg_pct_avg_2016.fillna(0).reset_index(level=0, drop=True)
t2_fg3_pct_avg_2016 = rs2016_train.groupby('team2', as_index=False)['t2_fg3_pct'].apply(lambda x: x.expanding().mean().shift())
rs2016_train['t2_fg3_pct_avg'] = t2_fg3_pct_avg_2016.fillna(0).reset_index(level=0, drop=True)
t2_ft_pct_avg_2016 = rs2016_train.groupby('team2', as_index=False)['t2_ft_pct'].apply(lambda x: x.expanding().mean().shift())
rs2016_train['t2_ft_pct_avg'] = t2_ft_pct_avg_2016.fillna(0).reset_index(level=0, drop=True)
t2_ast_2016 = rs2016_train.groupby('team2', as_index=False)['t2_ast'].apply(lambda x: x.expanding().mean().shift())
rs2016_train['t2_ast_avg'] = t2_ast_2016.fillna(0).reset_index(level=0, drop=True)
t2_dr_2016 = rs2016_train.groupby('team2', as_index=False)['t2_dr'].apply(lambda x: x.expanding().mean().shift())
rs2016_train['t2_dr_avg'] = t2_dr_2016.fillna(0).reset_index(level=0, drop=True)
t2_stl_2016 = rs2016_train.groupby('team2', as_index=False)['t2_stl'].apply(lambda x: x.expanding().mean().shift())
rs2016_train['t2_stl_avg'] = t2_stl_2016.fillna(0).reset_index(level=0, drop=True)
t2_pf_2016 = rs2016_train.groupby('team2', as_index=False)['t2_pf'].apply(lambda x: x.expanding().mean().shift())
rs2016_train['t2_pf_avg'] = t2_pf_2016.fillna(0).reset_index(level=0, drop=True)
t2_blk_2016 = rs2016_train.groupby('team2', as_index=False)['t2_blk'].apply(lambda x: x.expanding().mean().shift())
rs2016_train['t2_blk_avg'] = t2_blk_2016.fillna(0).reset_index(level=0, drop=True)
Training the Model¶
Now that my feature are how I want them it was time to train my model. I split my target variable, in this case whether team1 would win or lose (indicated by a 1 or 0 respectively) into it’s own array.
target13 = rs2013_train['pred'].values
target14 = rs2014_train['pred'].values
target15 = rs2015_train['pred'].values
target16 = rs2016_train['pred'].values
I created my feature arrays separated by year.
features_array13 = rs2013_train[['t1_fg_pct_avg', 't1_fg3_pct_avg', 't1_ft_pct_avg', 't2_fg_pct_avg', 't2_fg3_pct_avg', 't2_ft_pct_avg', 't1_ast_avg', 't1_dr_avg','t1_stl_avg','t1_pf_avg','t1_blk_avg', 't2_ast_avg', 't2_dr_avg','t2_stl_avg','t2_pf_avg','t2_blk_avg']].values
features_array14 = rs2014_train[['t1_fg_pct_avg', 't1_fg3_pct_avg', 't1_ft_pct_avg', 't2_fg_pct_avg', 't2_fg3_pct_avg', 't2_ft_pct_avg', 't1_ast_avg', 't1_dr_avg','t1_stl_avg','t1_pf_avg','t1_blk_avg', 't2_ast_avg', 't2_dr_avg','t2_stl_avg','t2_pf_avg','t2_blk_avg']].values
features_array15 = rs2015_train[['t1_fg_pct_avg', 't1_fg3_pct_avg', 't1_ft_pct_avg', 't2_fg_pct_avg', 't2_fg3_pct_avg', 't2_ft_pct_avg', 't1_ast_avg', 't1_dr_avg','t1_stl_avg','t1_pf_avg','t1_blk_avg', 't2_ast_avg', 't2_dr_avg','t2_stl_avg','t2_pf_avg','t2_blk_avg']].values
features_array16 = rs2016_train[['t1_fg_pct_avg', 't1_fg3_pct_avg', 't1_ft_pct_avg', 't2_fg_pct_avg', 't2_fg3_pct_avg', 't2_ft_pct_avg', 't1_ast_avg', 't1_dr_avg','t1_stl_avg','t1_pf_avg','t1_blk_avg', 't2_ast_avg', 't2_dr_avg','t2_stl_avg','t2_pf_avg','t2_blk_avg']].values
logreg = LogisticRegression(C=1)
lrscores13 = cross_val_score(logreg, features_array13, target13, cv=5)
lrscores14 = cross_val_score(logreg, features_array14, target14, cv=5)
lrscores15 = cross_val_score(logreg, features_array15, target15, cv=5)
lrscores16 = cross_val_score(logreg, features_array16, target16, cv=5)
lrscores13.mean()
lrscores14.mean()
lrscores15.mean()
lrscores16.mean()
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100)
rfscores13 = cross_val_score(rf, features_array13, target13, cv=5, n_jobs=4, scoring='accuracy')
rfscores14 = cross_val_score(rf, features_array14, target14, cv=5, n_jobs=4, scoring='accuracy')
rfscores15 = cross_val_score(rf, features_array15, target15, cv=5, n_jobs=4, scoring='accuracy')
rfscores16 = cross_val_score(rf, features_array16, target16, cv=5, n_jobs=4, scoring='accuracy')
print('Random Forest CV Scores:')
print('min: {:.3f}, mean; {:.3f}, max: {:.3f}'.format(
rfscores13.min(), rfscores13.mean(), rfscores13.max()))
print('min: {:.3f}, mean; {:.3f}, max: {:.3f}'.format(
rfscores14.min(), rfscores14.mean(), rfscores14.max()))
print('min: {:.3f}, mean; {:.3f}, max: {:.3f}'.format(
rfscores15.min(), rfscores15.mean(), rfscores15.max()))
print('min: {:.3f}, mean; {:.3f}, max: {:.3f}'.format(
rfscores16.min(), rfscores16.mean(), rfscores16.max()))
from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, subsample=.8, max_features=.5)
gbscores13 = cross_val_score(gb, features_array13, target13, cv=10, n_jobs=4, scoring='accuracy')
gbscores14 = cross_val_score(gb, features_array14, target14, cv=10, n_jobs=4, scoring='accuracy')
gbscores15 = cross_val_score(gb, features_array15, target15, cv=10, n_jobs=4, scoring='accuracy')
gbscores16 = cross_val_score(gb, features_array16, target16, cv=10, n_jobs=4, scoring='accuracy')
print("Gradient Boosted Trees CV scores:")
print("min: {:.3f}, mean: {:.3f}, max: {:.3f}".format(
gbscores13.min(), gbscores13.mean(), gbscores13.max()))
print("min: {:.3f}, mean: {:.3f}, max: {:.3f}".format(
gbscores14.min(), gbscores14.mean(), gbscores14.max()))
print("min: {:.3f}, mean: {:.3f}, max: {:.3f}".format(
gbscores15.min(), gbscores15.mean(), gbscores15.max()))
print("min: {:.3f}, mean: {:.3f}, max: {:.3f}".format(
gbscores16.min(), gbscores16.mean(), gbscores16.max()))
For now I’ll stick with Gradient Boosting as my classifier and I’ll start munging the test data that I’ll be using to make my predictions with. First I need to get the sample submission csv that Kaggle gave us into a format that I can use with scikit-learn.
ss['year'] = ss.id.str.split('_').str.get(0)
ss['team1'] = ss.id.str.split('_').str.get(1)
ss['team2'] = ss.id.str.split('_').str.get(2)
ss.drop('pred', axis=1, inplace=True)
ss[['year', 'team1', 'team2']] = ss[['year', 'team1', 'team2']].apply(pd.to_numeric)
ss13 = ss[ss['year']==2013]
ss14 = ss[ss['year']==2014]
ss15 = ss[ss['year']==2015]
ss16 = ss[ss['year']==2016]
rs2013_end = pd.DataFrame()
rs2014_end = pd.DataFrame()
rs2015_end = pd.DataFrame()
rs2016_end = pd.DataFrame()
rs2013_end = rs2013_train.groupby('team1').last().reset_index(drop=False)
rs2014_end = rs2014_train.groupby('team1').last().reset_index(drop=False)
rs2015_end = rs2015_train.groupby('team1').last().reset_index(drop=False)
rs2016_end = rs2016_train.groupby('team1').last().reset_index(drop=False)
t1_2013_end = rs2013_end[['team1', 't1_fg_pct_avg', 't1_fg3_pct_avg', 't1_ft_pct_avg', 't1_ast_avg', 't1_dr_avg','t1_stl_avg','t1_pf_avg','t1_blk_avg']]
ss13 = pd.merge(ss13, t1_2013_end, on='team1')
t2_2013_end = pd.DataFrame()
t2_2013_end[['team2', 't2_fg_pct_avg', 't2_fg3_pct_avg', 't2_ft_pct_avg', 't2_ast_avg', 't2_dr_avg','t2_stl_avg','t2_pf_avg','t2_blk_avg']] = t1_2013_end
ss13 = pd.merge(ss13, t2_2013_end, on='team2')
t1_2014_end = rs2014_end[['team1', 't1_fg_pct_avg', 't1_fg3_pct_avg', 't1_ft_pct_avg', 't1_ast_avg', 't1_dr_avg','t1_stl_avg','t1_pf_avg','t1_blk_avg']]
ss14 = pd.merge(ss14, t1_2014_end, on='team1')
t2_2014_end = pd.DataFrame()
t2_2014_end[['team2', 't2_fg_pct_avg', 't2_fg3_pct_avg', 't2_ft_pct_avg', 't2_ast_avg', 't2_dr_avg','t2_stl_avg','t2_pf_avg','t2_blk_avg']] = t1_2014_end
ss14 = pd.merge(ss14, t2_2014_end, on='team2')
t1_2015_end = rs2015_end[['team1', 't1_fg_pct_avg', 't1_fg3_pct_avg', 't1_ft_pct_avg', 't1_ast_avg', 't1_dr_avg','t1_stl_avg','t1_pf_avg','t1_blk_avg']]
ss15 = pd.merge(ss15, t1_2015_end, on='team1')
t2_2015_end = pd.DataFrame()
t2_2015_end[['team2', 't2_fg_pct_avg', 't2_fg3_pct_avg', 't2_ft_pct_avg', 't2_ast_avg', 't2_dr_avg','t2_stl_avg','t2_pf_avg','t2_blk_avg']] = t1_2015_end
ss15 = pd.merge(ss15, t2_2015_end, on='team2')
t1_2016_end = rs2016_end[['team1', 't1_fg_pct_avg', 't1_fg3_pct_avg', 't1_ft_pct_avg', 't1_ast_avg', 't1_dr_avg','t1_stl_avg','t1_pf_avg','t1_blk_avg']]
ss16 = pd.merge(ss16, t1_2016_end, on='team1')
t2_2016_end = pd.DataFrame()
t2_2016_end[['team2', 't2_fg_pct_avg', 't2_fg3_pct_avg', 't2_ft_pct_avg', 't2_ast_avg', 't2_dr_avg','t2_stl_avg','t2_pf_avg','t2_blk_avg']] = t1_2016_end
ss16 = pd.merge(ss16, t2_2016_end, on='team2')
test_13 = ss13[['t1_fg_pct_avg', 't1_fg3_pct_avg', 't1_ft_pct_avg', 't2_fg_pct_avg', 't2_fg3_pct_avg', 't2_ft_pct_avg', 't1_ast_avg', 't1_dr_avg','t1_stl_avg','t1_pf_avg','t1_blk_avg', 't2_ast_avg', 't2_dr_avg','t2_stl_avg','t2_pf_avg','t2_blk_avg']].values
test_14 = ss14[['t1_fg_pct_avg', 't1_fg3_pct_avg', 't1_ft_pct_avg', 't2_fg_pct_avg', 't2_fg3_pct_avg', 't2_ft_pct_avg', 't1_ast_avg', 't1_dr_avg','t1_stl_avg','t1_pf_avg','t1_blk_avg', 't2_ast_avg', 't2_dr_avg','t2_stl_avg','t2_pf_avg','t2_blk_avg']].values
test_15 = ss15[['t1_fg_pct_avg', 't1_fg3_pct_avg', 't1_ft_pct_avg', 't2_fg_pct_avg', 't2_fg3_pct_avg', 't2_ft_pct_avg', 't1_ast_avg', 't1_dr_avg','t1_stl_avg','t1_pf_avg','t1_blk_avg', 't2_ast_avg', 't2_dr_avg','t2_stl_avg','t2_pf_avg','t2_blk_avg']].values
test_16 = ss16[['t1_fg_pct_avg', 't1_fg3_pct_avg', 't1_ft_pct_avg', 't2_fg_pct_avg', 't2_fg3_pct_avg', 't2_ft_pct_avg', 't1_ast_avg', 't1_dr_avg','t1_stl_avg','t1_pf_avg','t1_blk_avg', 't2_ast_avg', 't2_dr_avg','t2_stl_avg','t2_pf_avg','t2_blk_avg']].values
gb.fit(features_array13, target13)
t1_pred_prob_2013 = gb.predict_proba(test_13)
gb.fit(features_array14, target14)
t1_pred_prob_2014 = gb.predict_proba(test_14)
gb.fit(features_array15, target15)
t1_pred_prob_2015 = gb.predict_proba(test_15)
gb.fit(features_array16, target16)
t1_pred_prob_2016 = gb.predict_proba(test_16)
preds13 = t1_pred_prob_2013[:,1]
ss13['pred'] = preds13
preds14 = t1_pred_prob_2014[:,1]
ss14['pred'] = preds14
preds15 = t1_pred_prob_2015[:,1]
ss15['pred'] = preds15
preds16 = t1_pred_prob_2016[:,1]
ss16['pred'] = preds16
ss13 = ss13[['id', 'pred']]
ss14 = ss14[['id', 'pred']]
ss15 = ss15[['id', 'pred']]
ss16 = ss16[['id', 'pred']]
ss13 = ss13.sort_values('id').reset_index(drop=True)
ss14 = ss14.sort_values('id').reset_index(drop=True)
ss15 = ss15.sort_values('id').reset_index(drop=True)
ss16 = ss16.sort_values('id').reset_index(drop=True)
#ss_3_10_1635 = pd.concat([ss13, ss14, ss15, ss16], ignore_index=True)
#ss_3_10_1635.to_csv('ss_3_10_1635.csv', index=False)
The following section involves using data from Ken Pomeroy. Various publications have indicated that these features are helpful in predicting NCAA matchups.
kp.tail()
teams.reset_index(level=0, inplace=True)
teams.tail(5)
At this point I only needed to concentrate on results from 2017. The second half of the competition involved predicting all outcomes for the teams that had been selected for the NCAA tournament and their associated probabilities.
kp_2017 = kp[kp['Year']==2017]
# kp_2017['Team'] = kp_2017['Team'].apply(lambda x: difflib.get_close_matches(x, teams['Team_Name']))
# kp_id17 = pd.merge(teams, kp_2017, how='left', left_on='Team_Name', right_on='Team')
# kp_id17
kp_2017.tail()
kp_id2017 = kp_id2017.sort_values('Team_Id')
kp_id2017.tail()
kp_feat_17 = pd.merge(kp_id2017, rs2017_train, left_on='Team_Id', right_on='team1')
kp_feat_17.head()
kp_feat_17 = kp_feat_17.rename(columns={'Team_Name':'t1_name', 'Rank':'t1_rank', 'Conference':'t1_conf',
'Wins':'t1_wins', 'Losses':'t1_losses','Pyth':'t1_pyth', 'AdjustO':'t1_adjusto',
'AdjustD':'t1_adjustd', 'AdjustT':'t1_adjustt', 'Luck':'t1_luck', 'SOS Pyth':'t1_sospyth',
'SOS OppO':'t1_sos_oppo', 'SOS OppD':'t1_sos_oppd', 'NCSOS Pyth':'t1_ncsos_pyth'})
kp_feat_17.drop(['Team_Id', 'Team', 'Daynum'], axis=1, inplace=True)
kp_feat_17 = pd.merge(kp_id2017, kp_feat_17, left_on='Team_Id', right_on='team2')
kp_feat_17 = kp_feat_17.rename(columns={'Team_Name':'t2_name', 'Rank':'t2_rank', 'Conference':'t2_conf',
'Wins':'t2_wins', 'Losses':'t2_losses','Pyth':'t2_pyth', 'AdjustO':'t2_adjusto',
'AdjustD':'t2_adjustd', 'AdjustT':'t2_adjustt', 'Luck':'t2_luck', 'SOS Pyth':'t2_sospyth',
'SOS OppO':'t2_sos_oppo', 'SOS OppD':'t2_sos_oppd', 'NCSOS Pyth':'t2_ncsos_pyth'})
kp_feat_17.drop(['Seed_x', 'Team', 'AdjustO Rank_x', 'AdjustD Rank_x', 'AdjustT Rank_x', 'Luck Rank_x',
'SOS Pyth Rank_x', 'SOS OppO Rank_x', 'SOS OppD Rank_x', 'NCSOS Pyth Rank_x', 'Year_y', 'Seed_y',
'AdjustO Rank_y', 'AdjustD Rank_y', 'AdjustT Rank_y', 'Luck Rank_y', 'SOS OppO Rank_y',
'SOS OppD Rank_y', 'NCSOS Pyth Rank_y'], axis=1, inplace=True)
kp_feat_17.drop('Team_Id', axis=1, inplace=True)
target17 = kp_feat_17['pred'].values
features_array17 = kp_feat_17[['t1_ncsos_pyth', 't1_sos_oppd', 't1_sos_oppo', 't1_sospyth', 't1_luck', 't1_adjustt',
't1_adjusto', 't1_adjustd', 't1_pyth', 't2_ncsos_pyth', 't2_sos_oppd', 't2_sos_oppo',
't2_sospyth', 't2_luck', 't2_adjustt', 't2_adjusto', 't2_adjustd', 't2_pyth']].values
gbscores17 = cross_val_score(gb, features_array17, target17, cv=10, n_jobs=4, scoring='accuracy')
print("Gradient Boosted Trees CV scores:")
print("min: {:.3f}, mean: {:.3f}, max: {:.3f}".format(
gbscores17.min(), gbscores17.mean(), gbscores17.max()))
ssub = pd.read_csv("./SampleSubmission.csv")
ssub.head()
ssub['year'] = ssub.Id.str.split('_').str.get(0)
ssub['team1'] = ssub.Id.str.split('_').str.get(1)
ssub['team2'] = ssub.Id.str.split('_').str.get(2)
ssub.drop('Pred', axis=1, inplace=True)
ssub[['year', 'team1', 'team2']] = ssub[['year', 'team1', 'team2']].apply(pd.to_numeric)
ss17 = ssub
ss17 = ss17.merge(teams, how='left', left_on='team1', right_on='Team_Id')
ss17.tail()
For my final submission I decided to train my model using the regular season from 2017 and I used the features from the Ken Pomeroy ncaa statistics. The hardest part was getting the team names that Ken Pomeroy used to align with the team names that Kaggle provided in their data. After playing with multiple fuzzy string matching modules I found out that someone from Kaggle had already posted a file that had multiple spellings of team names with the official team id’s that Kaggle wanted in your submission file. This was a huge help.
kp2017_end = kp_feat_17.groupby('team1').last().reset_index(drop=False)
kp2017_end['team1'] = kp2017_end['team1'].apply(pd.to_numeric)
t1_2017_end = kp2017_end[['team1', 't1_ncsos_pyth', 't1_sos_oppd', 't1_sos_oppo', 't1_sospyth', 't1_luck',
't1_adjustt','t1_adjusto', 't1_adjustd', 't1_pyth']]
ss17 = ss17.merge(t1_2017_end, how='left', left_on='team1', right_on='team1')
t2_2017_end = pd.DataFrame()
t2_2017_end[['team2', 't2_ncsos_pyth', 't2_sos_oppd', 't2_sos_oppo', 't2_sospyth', 't2_luck',
't2_adjustt','t2_adjusto', 't2_adjustd', 't2_pyth']] = t1_2017_end
ss17 = ss17.merge(t2_2017_end, how='left', left_on='team2', right_on='team2')
ss17.tail()
ss17.transpose()
test_17 = ss17[['t1_ncsos_pyth', 't1_sos_oppd', 't1_sos_oppo', 't1_sospyth', 't1_luck',
't1_adjustt','t1_adjusto', 't1_adjustd', 't1_pyth', 't2_ncsos_pyth',
't2_sos_oppd', 't2_sos_oppo', 't2_sospyth', 't2_luck',
't2_adjustt','t2_adjusto', 't2_adjustd', 't2_pyth']].values
gb.fit(features_array17, target17)
t1_pred_prob_2017 = gb.predict_proba(test_17)
preds17 = t1_pred_prob_2017[:,1]
ss17['pred'] = preds17
ss17 = ss17[['Id', 'pred']]
ss17 = ss17.sort_values('Id').reset_index(drop=True)
ss17.tail()
# ss17.to_csv('kp_gb_2017.csv', index=False)
In the end, I completely screwed up by thinking I had extra time to submit my result by assuming the deadline was in EST. In fact, it was in UTC and I completely missed the deadline. I had a lot of fun and plan on competing in the 2018 competition. In retrospect I probably need to find a way to incorporate Vegas odds in my model. Most of the research I read implied that Vegas odds were very helpful in predicting winners. I also need to figure out how to introduce a certain amount of randomness in the outcome. Research also indicates that the best any model can get to predicting the outcome of any sport in around 80%. I guess that’s why they say sports is the ULTIMATE reality tv.