Case Studies

Kaggle.com and HackerRank.com Competitions(2012-2017)

Worked on various competitions which use - Nearest neighbor, Naïve Bayes, Decision Trees, Regression, xgboost, sklearn (using Python and R).
For Data Exploration, used statistical distributions and plots in Jupyter, Python (matplotlib.pyplot) and R Packages.

Online Data Case Studies:

Quora Question Pairs:

Preprocessing - Identifying whether two questions are similar or not Classification Model, Data Exploration, Variables identification, Data cleaning stopwords, duplicates, HTML tags removal and stemming
Feature extraction – word counting, TF-IDF, Vectorize words and create image feature
Modeling Techniques and Training - Cross Validation Methodology (StratifiedKFold indices split), Train Gradient Boosting model and generate submission based on the test set.

Liberty Mutual Group:

Preprocessing - Property Inspection Prediction, Data Exploration, Variables identification, each row in the dataset corresponds to a property that was inspected and given a hazard score ("Hazard").
Modeling Technique - Implement a technique called stacking/stacked generalization, three classifiers (ExtraTreesRegressor, RandomForestRegressor, and a GradientBoostingRegressor) are built to be stacked by a RidgeCVRegressor.

Caterpillar Tube Pricing:

Preprocessing, 60% of my time was spent on Feature Engineering, There are mainly two categories of features, namely tube features and components features.
Modeling Technique - used linear regression (major model), an ensemble with other models like  nearest neighbor, xgb, RandomForestRegressor, and ExtraTreesRegressor.

Crowd flower search results relevance:

Pre-processing- Predict the relevance of search results from e-commerce sites, Data Exploration, Variables identification, Data cleaning stop words, duplicates, HTML tags removal and stemming
Feature extraction – word counting, position and statistical distance, TF-IDF, SVD Reduced, cosine similarity
Modeling Techniques and Training - Cross Validation Methodology (StratifiedKFold indices split), the quadratic weighted kappa, which measures the agreement between two ratings. Since the relevance score is in 1; 2; 3; 4, it is straightforward to apply multi-classification to the problem (using softmax loss). Classification doesn't take into account the weight and the magnitude of the rating. Finally, ensemble selection, firstly, the model library is built with parameters of each model guided by a parameter searching algorithm. Secondly, model weight optimization is allowed in the procedure of ensemble selection. Thirdly, we used random weight for ensemble model similar to ExtraTreesRegressor. In the following, we will detail our ensemble methodology.

Taxi Trip Time Prediction:

Preprocessing – build a predictive framework that is able to infer the trip time of taxi rides Data Exploration, Variables identification, track sampling based on length of track and coordinates, removed trips which do not follow general distributions. Fixed misread GPS coordinates.
Feature Extraction - Interestingly, most of the metadata seemed to have little to no predictive power, so in the end, I only used the time stamp.
Modeling and Training - All models are trained using a 5 fold cross-validation technique. I used RandomForestRegressor (RFR) and GrandientBoostingRegressor (GBR) from sklearn with default settings except for the number of trees, which was set to 200.

Drawbridge Cross-Device Connections:

Preprocessing – Identify individual users across their digital devices, Explored data join all the basic information about the device, cookie, and IP address.
Feature Extraction - Generate a few features based on the interaction between device, cookie, and IP.
Modeling Technique - On this reduced dataset we built a learning-to-rank model which was a modified version of xgboost's "rank: pairwise" partitioning by the device.

Indeed.com -Tagging Raw Job Descriptions:

Preprocessing - Data Exploration, Variables identification, Data cleaning stop words, duplicates, HTML tags removal and stemming
Feature Extraction - word counting, TF-IDF, cosine similarity
Modeling Techniques and Training - Cross Validation Methodology (StratifiedKFold indices split), Train Gradient Boosting model and generate submission based on the test set.

Stock Predictions: Applied Machine learning sklearn NN algorithms that can correctly predict stock prices to generate millions of dollars.

Popular Posts