Data Science

General Process of Statistical Investigation:

Identify a question or problem.
Collect relevant data on the topic.
Analyze the data.
Form a conclusion.

Case study: using stents to prevent strokes

An experiment that studies effectiveness of stents in treating patients at risk of stroke.

Stents are devices put inside blood vessels that assist in patient recovery after cardiac events and reduce the risk of an additional heart attack or death.
Does the use of stents reduce the risk of stroke?

Treatment and Control Group

Each volunteer patient(of 451) was randomly assigned to one of two groups:

Treatment group(224) - Patients in the treatment group received a stent and medical management. The medical management included medications, management of risk factors, and help in lifestyle modiﬁcation.

Control group(227) - Patients in the control group received the same medical management as the treatment group, but they did not receive stents.

A Data Table

Researchers studied the effect of stents at two time points: 30 days after enrollment and 365 days after enrollment.

A Data Summary

Considering data from each patient individually would be a long, cumbersome path towards answering the original research question. Instead, performing a statistical data analysis allows us to consider all of the data at once.

A Summary Statistics

A summary statistic is a single number summarizing a large amount of data.3 For instance, the primary results of the study after 1 year could be described by two summary statistics: the proportion of people who had a stroke in the treatment and control groups.

Proportion who had a stroke in the treatment (stent) group: 45/224 = 0.20 = 20%.

Proportion who had a stroke in the control group: 28/227 = 0.12 = 12%.

These two summary statistics are useful in looking for differences in the groups, and we are in for a surprise: an additional 8% of patients in the treatment group had a stroke! This is important for two reasons. First, it is contrary to what doctors expected, which was that stents would reduce the rate of strokes. Second, it leads to a statistical question: do the data show a “real” difference between the groups.

Random Fluctuation

"Do the data show a “real” difference between the groups?"

Suppose you ﬂip a coin 100 times. While the chance a coin lands heads in any given coin ﬂip is 50%, we probably won’t observe exactly 50 heads. This type of ﬂuctuation is part of almost any type of data generating process.

It is possible that the 8% difference in the stent study is due to this natural variation. However, the larger the difference we observe (for a particular sample size), the less believable it is that the difference is due to chance. So what we are really asking is the following: is the difference so large that we should reject the notion that it was due to chance?

Be careful: do not generalize the results of this study to all patients and all stents. This study looked at patients with very speciﬁc characteristics who volunteered to be a part of this study and who may not be representative of all stroke patients.

Kaggle.com and HackerRank.com Competitions(2012-2017)

Worked on various competitions which use - Nearest neighbor, Naïve Bayes, Decision Trees, Regression, xgboost, sklearn (using Python and R).
For Data Exploration, used statistical distributions and plots in Jupyter, Python (matplotlib.pyplot) and R Packages.

Online Data Case Studies:

Quora Question Pairs:

Preprocessing - Identifying whether two questions are similar or not Classification Model, Data Exploration, Variables identification, Data cleaning stopwords, duplicates, HTML tags removal and stemming
Feature extraction – word counting, TF-IDF, Vectorize words and create image feature
Modeling Techniques and Training - Cross Validation Methodology (StratifiedKFold indices split), Train Gradient Boosting model and generate submission based on the test set.

Liberty Mutual Group:

Preprocessing - Property Inspection Prediction, Data Exploration, Variables identification, each row in the dataset corresponds to a property that was inspected and given a hazard score ("Hazard").
Modeling Technique - Implement a technique called stacking/stacked generalization, three classifiers (ExtraTreesRegressor, RandomForestRegressor, and a GradientBoostingRegressor) are built to be stacked by a RidgeCVRegressor.

Caterpillar Tube Pricing:

Preprocessing, 60% of my time was spent on Feature Engineering, There are mainly two categories of features, namely tube features and components features.
Modeling Technique - used linear regression (major model), an ensemble with other models like  nearest neighbor, xgb, RandomForestRegressor, and ExtraTreesRegressor.

Crowd flower search results relevance:

Pre-processing- Predict the relevance of search results from e-commerce sites, Data Exploration, Variables identification, Data cleaning stop words, duplicates, HTML tags removal and stemming
Feature extraction – word counting, position and statistical distance, TF-IDF, SVD Reduced, cosine similarity
Modeling Techniques and Training - Cross Validation Methodology (StratifiedKFold indices split), the quadratic weighted kappa, which measures the agreement between two ratings. Since the relevance score is in 1; 2; 3; 4, it is straightforward to apply multi-classification to the problem (using softmax loss). Classification doesn't take into account the weight and the magnitude of the rating. Finally, ensemble selection, firstly, the model library is built with parameters of each model guided by a parameter searching algorithm. Secondly, model weight optimization is allowed in the procedure of ensemble selection. Thirdly, we used random weight for ensemble model similar to ExtraTreesRegressor. In the following, we will detail our ensemble methodology.

Taxi Trip Time Prediction:

Preprocessing – build a predictive framework that is able to infer the trip time of taxi rides Data Exploration, Variables identification, track sampling based on length of track and coordinates, removed trips which do not follow general distributions. Fixed misread GPS coordinates.
Feature Extraction - Interestingly, most of the metadata seemed to have little to no predictive power, so in the end, I only used the time stamp.
Modeling and Training - All models are trained using a 5 fold cross-validation technique. I used RandomForestRegressor (RFR) and GrandientBoostingRegressor (GBR) from sklearn with default settings except for the number of trees, which was set to 200.

Drawbridge Cross-Device Connections:

Preprocessing – Identify individual users across their digital devices, Explored data join all the basic information about the device, cookie, and IP address.
Feature Extraction - Generate a few features based on the interaction between device, cookie, and IP.
Modeling Technique - On this reduced dataset we built a learning-to-rank model which was a modified version of xgboost's "rank: pairwise" partitioning by the device.

Indeed.com -Tagging Raw Job Descriptions:

Preprocessing - Data Exploration, Variables identification, Data cleaning stop words, duplicates, HTML tags removal and stemming
Feature Extraction - word counting, TF-IDF, cosine similarity
Modeling Techniques and Training - Cross Validation Methodology (StratifiedKFold indices split), Train Gradient Boosting model and generate submission based on the test set.

Stock Predictions: Applied Machine learning sklearn NN algorithms that can correctly predict stock prices to generate millions of dollars.

ABOUT ME

Himanshu Gamit

My Professional Skills

Services

Statistical Analysis

Data Intuition

Communication

Machine Learning

Data Visualization

Software Engineering

Portfolio

Open Intro Statistics Case Study

Case Studies

Popular Posts

SAY HELLO TO ME

ADDRESS

EMAIL

TELEPHONE

MOBILE