How to do modular coding

Recently, the Data Science Team at AccuWeather has been promoting modular coding to better our coding skills and make our work more efficient. So what is modular coding? Per wikipedia, modular programming is a software design technique that emphasizes separating the functionality of a program into independent, interchangeable modules, such that each contains everything necessary to execute only one aspect of the desired functionality. To simply put, it means that anything you will need to use repeatedly should be written into a function or even a module that you can just easily call/import it. At the very beginning, this idea wasn’t so welcome within the team as it drastically increase the time we spend on data science coding and it’s just so cumbersome to adopt. However, after the Director of Data Science persisted to ask us to showcase our modular coding at a code showcase meeting, I finally got a hang of it and would like to showcase a simple example of module coding: MAKE A MODULE AND IMPORT IT!

STEP1: Find the repeated work you would need to do

Often, we will come across certain process or line of codes that we’d like to use many times and copy&paste made it so sweet and easy. However, it still takes time to copy and paste plus you need to dig that code out. What if you can easily call it just like import pandas as pd? Of course, only a couple lines of code is not worth inventing the wheel. What about a massive modeling process such as you have to cycle through each state to make a each model and each model has to be iterated above 5000 times. That sounds like a worthy project to use modular coding. Therefore, I chose the below modeling process to start with my modular coding.

from sklearn.model_selection import train_test_split 
import xgboost as xgb
from sklearn.model_selection import cross_validate
from sklearn import metrics
LABEL = 'cpi'       
# model each category and check # impressions
# create feature list for modeling
varimporder = []
varaddunit = []
FEATURES = [feat for feat in list(merged_df1_m) if feat not in ['date', 'metro','imps', 'clks', 'cpi', 'state']]
idv_list = []
metros = set(merged_df1_m['metro'])
for metro in metros:
    tmp_df = merged_df1_m[merged_df1_m['metro']== metro]
    
    print(metro)
    idv_list.append(metro)
    x_train, x_test, y_train, y_test = train_test_split(tmp_df, tmp_df[LABEL], 
                                                               test_size=0.33, random_state=42)
    dtrain = xgb.DMatrix(data = x_train[FEATURES],  label = y_train)
    dval = xgb.DMatrix(data = x_test[FEATURES], label = y_test)
    param = {'max_depth':5, 
                'eta':0.001, 
                'silent':1, 
                "objective":"reg:linear",
                "eval_metric":"rmse",
                'subsample': 0.8,
                'maximize': False,
                'colsample_bytree': 0.8}
    evals = [(dtrain,'train'),(dval,'test')]
    clf = xgb.train ( params = param,
                  dtrain = dtrain,
                  num_boost_round = 50000,
                  verbose_eval=100, 
                  early_stopping_rounds = 500,
                  evals=evals)
    importances = clf.get_fscore()
    importance_frame = pd.DataFrame({'Importance': list(importances.values()), 'Feature': list(importances.keys())})
    importance_frame.sort_values(by = 'Importance', inplace = True, ascending= False)
    print(importance_frame)
    varaddunit.append(metro)
    varimporder.append(importance_frame['Feature'].values)
        

STEP2: Spot out the repetitive parts of the code

In the above code snippet, I was trying to fit an XGBoost regression model for each state data. There, I know I will need to repeat the below steps:

  • train_test_split for each state data

  • they share the same DV and IV

  • the algorithm parameters will be more or less the same

  • They will output into the same kind of data.frame to show the import features

After knowing that, I then just replace the repeated parts into a universal variable and wrap all these into a function with the variables subject to change. See below:

  • I have made three variables to fill up in the function: non_features, data, loop_feat

  • substitute the actual names with the variables in the function

  • added if___name___=='main': syntax in case I also would like to call it in the current module or test it out

  • save it as xg_boost.py and put it in the site packages underneath the site-packages of your current python interpreter path. THIS IS VERY IMPORTANT! If you didn’t put the module under the interpreter’s path, the module wouldn’t load.

import pandas as pd
from sklearn.model_selection import train_test_split
import xgboost as xgb
# Class model(object):
def xgboost_model(non_features,data,loop_feat):
    LABEL = 'cpi'
    # model each category and check # impressions
    # create feature list for modeling
    varimporder = []
    varaddunit = []
    feat_list = [feat for feat in list(data) if feat not in non_features]
    idv_list = []
    target = set(data[loop_feat])
    for item in target:
        tmp_df = data[data[loop_feat] == item]
        if len(tmp_df) > 1000:
            print(item)
            idv_list.append(item)
            x_train, x_test, y_train, y_test = train_test_split(tmp_df, tmp_df[LABEL],
                                                                test_size=0.33, random_state=42)
            dtrain = xgb.DMatrix(data=x_train[feat_list], label=y_train)
            dval = xgb.DMatrix(data=x_test[feat_list], label=y_test)
            param = {'max_depth': 5,
                     'eta': 0.001,
                     'silent': 1,
                     "objective": "reg:linear",
                     "eval_metric": "rmse",
                     'subsample': 0.8,
                     'maximize': False,
                     'colsample_bytree': 0.8}
            evals = [(dtrain, 'train'), (dval, 'test')]
            clf = xgb.train(params=param,
                            dtrain=dtrain,
                            num_boost_round=5000,
                            verbose_eval=50,
                            early_stopping_rounds=500,
                            evals=evals)
            importances = clf.get_fscore()
            importance_frame = pd.DataFrame({'Importance': list(importances.values()), 'Feature': list(importances.keys())})
            importance_frame.sort_values(by='Importance', inplace=True, ascending=False)
            print(importance_frame)
            varaddunit.append(item)
            varimporder.append(importance_frame['Feature'].values)
if __name__=='main':
    results=xgboost_model(non_features,data,loop_feat)
    print(results)

STEP3: call the module and fill in the variabls

As a final step, we will just useimport xgboost_model to call the function/module, then swap in all the different names for the variables. Voila, you got it. You have a simple modular coding. Note that xg_boost is the name of the module I created. xgboost_model is the function within the module.

import xg_boost
non_features0=['date', 'metro','imps', 'clks', 'cpi', 'state','cur_skycode']
# label0='cpi'
loop_feat0='state'
xg_boost.xgboost_model(non_features0, duracell, loop_feat0)

Related