Himanshu Gamit

Having earned my Bachelors degree in Computer Engineering from Sardar Vallabhabhai National Institute of Technology, Surat (INDIA).I have kept learning new trends in market. Today, I understand the importance of understanding data to solve real world problems.To illustrate, As human brain learns problem solving based on several experiences(learn and apply), data driven analysis can be used to learn from data and provide more efficient solution to real world problems.

  • Mail#6841, University of St. Thomas, 2115 Summit Ave, Saint Paul, MN-55105(USA)
  • +1 (651)-214-4908
  • himanshu.gamit@stthomas.edu
  • https://datascienceusage.blogspot.com/
Me

My Professional Skills

Data Analysis with Python, R and Application Development.Django Web Devenlopment with analytical backend.

Python,R 90%
Web Development 70%
App Development 95%
Django Web 60%

Statistical Analysis

familiar with statistical tests, distributions, maximum likelihood estimators, etc.

Data Intuition

Understanding what things are important, and what things aren’t.

Communication

describing your findings, or the way techniques work to audiences, both technical and non-technical.

Machine Learning

k-nearest neighbors, random forests, ensemble methods, and more (R and Python libraries).

Data Visualization

data visualization tools like matplotlib, ggplot, or d3.js.

Software Engineering

data logging, and potentially the development of data-driven products.

  • Open Intro Statistics Case Study


    General Process of Statistical Investigation:
    1. Identify a question or problem.
    2. Collect relevant data on the topic.
    3. Analyze the data.
    4. Form a conclusion. 
    Case study: using stents to prevent strokes
    An experiment that studies effectiveness of stents in treating patients at risk of stroke.
    • Stents are devices put inside blood vessels that assist in patient recovery after cardiac events and reduce the risk of an additional heart attack or death.
    • Does the use of stents reduce the risk of stroke? 
    Treatment and Control Group
    Each volunteer patient(of 451) was randomly assigned to one of two groups:
    Treatment group(224) - Patients in the treatment group received a stent and medical management. The medical management included medications, management of risk factors, and help in lifestyle modification.
    Control group(227) - Patients in the control group received the same medical management as the treatment group, but they did not receive stents.

    A Data Table
    Researchers studied the effect of stents at two time points: 30 days after enrollment and 365 days after enrollment.


    A Data Summary
    Considering data from each patient individually would be a long, cumbersome path towards answering the original research question. Instead, performing a statistical data analysis allows us to consider all of the data at once.


    A Summary Statistics
    A summary statistic is a single number summarizing a large amount of data.3 For instance, the primary results of the study after 1 year could be described by two summary statistics: the proportion of people who had a stroke in the treatment and control groups.
    Proportion who had a stroke in the treatment (stent) group: 45/224 = 0.20 = 20%.
    Proportion who had a stroke in the control group: 28/227 = 0.12 = 12%.
    These two summary statistics are useful in looking for differences in the groups, and we are in for a surprise: an additional 8% of patients in the treatment group had a stroke! This is important for two reasons. First, it is contrary to what doctors expected, which was that stents would reduce the rate of strokes. Second, it leads to a statistical question: do the data show a “real” difference between the groups.

    Random Fluctuation
    "Do the data show a “real” difference between the groups?" 
    Suppose you flip a coin 100 times. While the chance a coin lands heads in any given coin flip is 50%, we probably won’t observe exactly 50 heads. This type of fluctuation is part of almost any type of data generating process.
    It is possible that the 8% difference in the stent study is due to this natural variation. However, the larger the difference we observe (for a particular sample size), the less believable it is that the difference is due to chance. So what we are really asking is the following: is the difference so large that we should reject the notion that it was due to chance?
    Be careful: do not generalize the results of this study to all patients and all stents. This study looked at patients with very specific characteristics who volunteered to be a part of this study and who may not be representative of all stroke patients.
  • Case Studies



    Kaggle.com and HackerRank.com Competitions(2012-2017) 
    • Worked on various competitions which use - Nearest neighbor, Naïve Bayes, Decision Trees, Regression, xgboost, sklearn (using Python and R).
    • For Data Exploration, used statistical distributions and plots in Jupyter, Python (matplotlib.pyplot) and R Packages.

    Online Data Case Studies:  

    Quora Question Pairs:
    • Preprocessing - Identifying whether two questions are similar or not Classification Model, Data Exploration, Variables identification, Data cleaning stopwords, duplicates, HTML tags removal and stemming 
    • Feature extraction – word counting, TF-IDF, Vectorize words and create image feature 
    • Modeling Techniques and Training - Cross Validation Methodology (StratifiedKFold indices split), Train Gradient Boosting model and generate submission based on the test set.
    Liberty Mutual Group:
    • Preprocessing - Property Inspection Prediction, Data Exploration, Variables identification, each row in the dataset corresponds to a property that was inspected and given a hazard score ("Hazard"). 
    •  Modeling Technique - Implement a technique called stacking/stacked generalization, three classifiers (ExtraTreesRegressor, RandomForestRegressor, and a GradientBoostingRegressor) are built to be stacked by a RidgeCVRegressor. 
    Caterpillar Tube Pricing:
    • Preprocessing, 60% of my time was spent on Feature Engineering, There are mainly two categories of features, namely tube features and components features. 
    • Modeling Technique - used linear regression (major model), an ensemble with other models like  nearest neighbor, xgb, RandomForestRegressor, and ExtraTreesRegressor. 
    Crowd flower search results relevance:
    • Pre-processing- Predict the relevance of search results from e-commerce sites, Data Exploration, Variables identification,  Data cleaning stop words, duplicates, HTML tags removal and stemming 
    • Feature extraction – word counting, position and statistical distance, TF-IDF, SVD Reduced, cosine similarity 
    • Modeling Techniques and Training - Cross Validation Methodology (StratifiedKFold indices split), the quadratic weighted kappa, which measures the agreement between two ratings. Since the relevance score is in 1; 2; 3; 4, it is straightforward to apply multi-classification to the problem (using softmax loss). Classification doesn't take into account the weight and the magnitude of the rating. Finally, ensemble selection, firstly, the model library is built with parameters of each model guided by a parameter searching algorithm. Secondly, model weight optimization is allowed in the procedure of ensemble selection. Thirdly, we used random weight for ensemble model similar to ExtraTreesRegressor. In the following, we will detail our ensemble methodology. 
    Taxi Trip Time Prediction:
    • Preprocessing – build a predictive framework that is able to infer the trip time of taxi rides Data Exploration, Variables identification, track sampling based on length of track and coordinates, removed trips which do not follow general distributions. Fixed misread GPS coordinates. 
    • Feature Extraction - Interestingly, most of the metadata seemed to have little to no predictive power, so in the end, I only used the time stamp. 
    • Modeling and Training - All models are trained using a 5 fold cross-validation technique. I used RandomForestRegressor (RFR) and GrandientBoostingRegressor (GBR) from sklearn with default settings except for the number of trees, which was set to 200.  
    Drawbridge Cross-Device Connections:
    • Preprocessing – Identify individual users across their digital devices, Explored data join all the basic information about the device, cookie, and IP address.
    • Feature Extraction - Generate a few features based on the interaction between device, cookie, and IP.
    • Modeling Technique - On this reduced dataset we built a learning-to-rank model which was a modified version of xgboost's "rank: pairwise" partitioning by the device.
    Indeed.com -Tagging Raw Job Descriptions:
    • Preprocessing - Data Exploration, Variables identification, Data cleaning stop words, duplicates, HTML tags removal and stemming 
    • Feature Extraction - word counting, TF-IDF, cosine similarity
    • Modeling Techniques and Training - Cross Validation Methodology (StratifiedKFold indices split), Train Gradient Boosting model and generate submission based on the test set.
    Stock Predictions: Applied Machine learning sklearn NN algorithms that can correctly predict stock prices to generate millions of dollars.





    ADDRESS

    Mail#6841, University of St. Thomas, 2115 Summit Ave, Saint Paul, MN-55105(USA)

    EMAIL

    himanshu.gamit@stthomas.edu

    TELEPHONE

    +1 (651)-214-4908

    MOBILE

    +1 (651)-214-4908