Graduate Student
no image
Graduate Student

Shi Pu
------------ New York, NY 10025 ------------/in/spu20 ------------
SUMMARY
Master student seeking data analyst opportunities. Possess experience with data visualization and machine learning projects. Knowledge in statistics, data analytics and programming skills such as Python, R and SQL
EDUCATION
Columbia University, New York, NY December 2021
Master of Science in Applied Analytics
Courses: Applied Analytics Frameworks and Methods, Research Design, Storytelling with Data
Boston University, Boston, MA May 2020
Bachelor of Arts in Mathematics specialized in Statistics GPA: 3.41/4.0
Dean's List: (Spring 2017, Fall 2018, Spring 2019, Spring 2020)
SUMMARY OF SKILLS
Programming: Python (Sklearn, Pandas, Numpy), SQL, R, SAS
Machine Learning: Classical and Penalized Regression Methods (LASSO, Ridge), Regularization, Decision Tree, Clustering, K-means, K-nearest Neighbors, Principal Component Analysis
Statistics Analysis: Hypothesis Testing (A/B Testing), Text Mining, Time Series Analysis
PROFESSIONAL EXPERIENCE
Bank J. Safra Sarasin Ltd, Branch - Analytics Assistant; Hong Kong, China July 2019 - August 2019
� Examined a list of equities that the Financial Advisory team was interested in and created a sorted table based on growth rates using Bloomberg. Filtered the list by removing companies that cannot be incorporated in the sustainability matrix
� Analyzed funds performances and created lists of 5 types of funds sorted based on best to worst performances YTD using Morningstar
� Gave portfolio construction suggestions to the Financial Advisory team based on net growth, volatility, and types of funds
Renaissance Era Investment - Analytics Assistant; Beijing, China May 2019 - July 2019
� Developed linear models, and time series models based on data provided to predict a masked index Y using R
� Selected an AR model based on a higher coefficient of determination, instead of the multiple linear regression model which the supervisor favors
� Examined investment performances of other companies and compared their investment strategy characteristics by generating tables and mind map to assist analytics team
� Modified and implemented Wide & Deep to predict performances of stocks as the team requested
RESEARCH EXPERIENCES
Racism and Internet
� Research Assistant (Supervisor: Dr. Robert D. Eschmann); Boston, MA; September 2018 - February 2020
� Explored the correlation between racism comments and other factors on the internet by conducting hypothesis testing and building linear models using SAS
� Collected data on speech dates. Performed exploratory data analysis. Cleaned sample data through missing values processing, duplicated data removal, formats editing, and feature encoding. Generated tables in Excel
� Applied linear models and hypothesis testing to find factors that may trigger discussions about racism in materials such as Trump's twitter or Obama's speech
� No correlations were found for Trump�s speech or Obama�s speech between racism comments on Twitter
Predicting Association between MicroRNAs and Diseases
� Research Assistant (Supervisor: Dr. Xing Chen); Beijing, China; August 2018 - August 2018
� Implemented an improved algorithm to predict associations between mRNAs and diseases using Matlab
� Examined research papers in bioinformatics, explored implementation methods of random walk, PMBDA model and heterogeneous graph on the research subject
� Adopted gravity model to improve the PMBDA prediction model
� Built a heterogeneous network and applied the new modified algorithm using Matlab. Compared prediction accuracy with the result from the PMBDA model
� The modified algorithm offers a wider range of possible selections at a cost of lower time efficiency. The ingenuity of the idea was commended by the supervisor
RELEVANT PROJECTS
Bank Customer Attrition Prediction
� Developed machine learning models to predict bank customer loss and analyzed key features based on labeled data provided using Python
� Preprocessed data set with data cleaning, categorical feature transformation and standardization
� Trained supervised machine learning models including Logistic Regression, Random Forest, and K- Nearest Neighbors with regularization that has optimal parameters to prevent overfitting
� Evaluated model performance of classification (accuracy or F1) via k-fold cross-validation. Analyzed feature importance to identify factors with relatively high influence on the results
Customer Review Clustering and Topic Modeling
� Performed customer review clustering and examined the latent semantic structures using Python
� Preprocessed review text by tokenization, stemming, removing stop words and extracted features using Term Frequency - Inverse Document Frequency (TF-IDF)
� Trained unsupervised learning models of K-means clustering and Latent Dirichlet Analysis
� Identified and clustered potential topics and keywords of each review
Amazon Prime Movie View Time Prediction
� Built a machine learning model in Python to predict view time of Amazon prime movie with Python
� Performed exploratory data analysis. Preprocessed raw data of sales and products through handling missing values, categorical feature encoding and feature scaling
� Built linear regression and random forest models and performed model evaluation by splitting cleaned data into train-validation-test sets
� Selected the final model based on prediction accuracy. Identified important factors for view time
Daily Online Video Game Player Count Prediction
� Built time series models to predict daily player count of Counter-Strike: Global Offensive with R
� Performed exploratory data analysis. Preprocessed raw data of player count through handling missing values and feature scaling. Split the data into train and test sets
� Developed SARIMA models based on results of Box-Jenkins type analysis and Augmented Dicker Fuller Test. Selected the final model based on AICc
� Performed model evaluation using the testing set and compared to results from addictive Holt Winter Forecast. The SARIMA performed adequately initially but has a gradually worsen MAPE possibly due to the epidemic