Hi! I'm Henry

Swarthmore College  •  Sushi Chef  •  PNW  •  Musician  •  Data Scientist

Portfolio

Fake Bananas

bananas

Fake Bananas is a fake news detector web app based on stance detection, natural language processing and machine learning. At HackMIT 2017, Fake Bananas finished in the top 10 teams out of over 400 teams and 1250 hackers. Fake Bananas also won Best AI/Hack for Social Good from Baidu and the prize for the Most Interesting Use of Data from Hudson River Trading.

Our fake news detection is based on the concept of stance detection. Fake news is tough to identify. Many 'facts' are highly complex and difficult to check, or exist on a 'continuum of truth' or are compound sentences with act and fiction overlapping. The best way to attack this problem is not through fact checking, but by comparing how reputable sources feel about a claim.

How FakeBananas works:

  1. 1. Users input a claim like "The Afghanistan war was bad for the world"
  2. 2. Our program will search the thousands of global and local news sources for their 'stance' on that topic.
  3. 3. We run sources through our Reputability Algorithm. If lots of reputable sources all agree with your claim, then it's probably true.
  4. 4. Then we cite our sources so our users can click through and read more about that topic!
I was primarily in charge of web scraping for articles. Given a user URL or claim, I used Microsoft's Azure Cognitive and IBM's Natural Language Processing to parse the article or claim and perform keyword extraction. I then used combinations of the keywords to collect up to a few thousand articles from Event Registry's database to pass on to the machine learning model. Here I aired on the side of collecting more rather than fewer articles because the machine learning will accurately determine relevancy further in the pipeline.

Moving forward, we hope to launch a public facing web application and potentially even a browser plug-in that can detect news articles and display what our pipeline returns.

-- Link to Github --

Lotus Journal

Lotus Journal is a self-writing journal At HackMIT 2018, Lotus Journal finished in the top 10 teams out of over 400 teams and 1250 hackers. Lotus Journal also won the Azure Champ prize from Microsoft and Most Interesting Use of Text-Based Machine Learning from Quora.

bananas

How Lotus Journal works:

  1. 1. User pictures are uploaded into a journal entry
  2. 2. We use image2text libraries and metadata from the pictures to generate a caption of what is going on in the picture, and if provided, location and date.
  3. 3. We use this information to generate headings and prompts. In the screenshot above, the user only provided the picture of a man climbing. Our algorithm was able to extract the activity, rock climbing, the location, Rumney, and time of day, morning. Based on this information, we were able to generate a heading and prompt.
  4. 4. As users type, the text is fed to my algorithm in real time. My algorithm does keyword extraction and matches them against the corpus of previous journal entries. My algorithm looks for similar temporal, locational, and linguistic keywords.
  5. 5. This keyword information is then passed through my rule-based system to generate grammatically correct open-ended questions geared towards reflection.
    Example of generated question: "How was climbing today compared to 2 weeks ago in Wenatche, WA?"
I was primarily in charge of the real time question generation based on user input. My algorithm does keyword extraction and matches them against the corpus of previous journal entries. My algorithm looks for similar temporal, locational, and linguistic keywords. This keyword information is then passed through my rule-based system to generate grammatically correct open-ended questions geared towards reflection. Question generation was particularly difficult because if the keyword was an activity, for example, I would need to take into account the tense of the verb used. Climb, climbed, will climb, climbing, etc. all need to be handled independently and tense gathered from context.

Moving forward, we hope to turn this into a startup. We want to incorporate numerous data streams as users are able to provide them, data streams like fitbit data, sleep data, fitness and nutrition data, moods, progress tracking from apps like duolingo, etc. Over time, there will be plenty of user data on which machine learning can be conducted to find relationships and correlations. As we gain expertise in wellbeing, suggestions can be made to provide users insights into their lives.

-- Link to Github --

NBA Statistics Scraper

I built a web scraper that gathered statistical information from every active player in the NBA in the 2016-2017 season, as well as offensive and defensive team stats for each team. The data was pulled, cleaned, and displayed using a combination of the python libraries Numpy, Pandas, Bokeh, and BeautifulSoup. Hover over the graph to see some statistic. The legend in the bottom right is also interactive, click on a position to try it out!

Moving forward, I would like to analyze the data that I compiled and create a machine learning model that can predict roughly how well a player is going to do based on recent performances, offensive/defensive team metrics, and other statistics.

-- Link to Github --

NBA Shot Chart ShinyR App

For my data science class, my partner, Alex Mandel, and I created this project to explore how NBA players shot charts. We were interested in seeing how well the top 25 NBA players perform. We look at different areas of the court and can see where they made the shot from and in different quarters or for an entire season. We downloaded and cropped our court images from Sports Illustrated . We downloaded our player data from NBA Savant and downloaded the NBA schedule from Kaggle . Finally, we scraped the NBA abbreviations from Wikipedia which helped us match a lot of our data. The app is hosted here.

Moving forward, I would like to analyze the data that we compiled and create a machine learning model that can predict what a player's shot chart will be on a particular night. I have read about existing projects that have been particular successful with these kind of predictions with deep neural nets.

-- Link to Github --
-- Link to App --

Analysis of Feature Selection Across Different Machine Learning Algorithms

For my machine learning class, my partner and I chose to do an analysis on how different machine learning algorithms handle feature selection and dimensionality reduction implicitly or explictly. We chose 4 machine learning models: support vector machines, Bayesian networks, and logistic regression with L1 and L2 norms. We compared the results to existing preprocessing techniques that are used for dimensionality reduction - Principle Component Analysis and Linear Discriminant Analysis. The linear transformation using PCA is shown below.

Abstract

Feature importance and dimensionality reduction are important for effectively visualizing and interpreting real-world datasets, as well as improving prediction accuracy. Using the Students Academic Performance Dataset from Kaggle, we implemented support vector machines, Bayesian networks, and logistic regression with L1 and L2 norms. These algorithms with various hyperparamets were implemented to determine feature importance and predictive power of the models. These results were then compared to the important features determined from the preprocessing techniques: Principle Component Anaylsis (PCA) and Linear Discriminant Analysis (LDA). Using R2 values, which explain the percentage of the response variable explained by the variation in the model, it was found that L1 regularized logistic regression and Bayesian networks best fit the data. The higher R2 values identify that these algorithms had the most predictive power, allowing them to identify the ost important features in the dataset. SVM, LDA and L2 regularized logistic regression least accurately fit the data with SVM having the next highest R2 value and L1 having the lowerst. The important features between Bayesian networks, PCA, and LDA were compared, while the irrelevant features determined by SVM and L1/L2 logistic regression were also evaluated.

Moving forward, I would like to do more analysis in the sci-kit learn library feature_selection. I would also like to use multiple datasets and have more baselines to compare results to.

-- Link to paper --

Why You Need 12 Bananas Worth of Potassium Every Day

Potassium is the third most abundant mineral in the body so it is no surprise that a potassium deficient diet can lead to a slew of problems. The most common symptoms of potassium deficiency include fatigue, muscle weakness, and brain fog. In addition, supplementing potassium has been proven to stimulate neural activities like memorization and learning, help lower blood pressure, and reduce stress and anxiety. Potassium plays a vital role in maintaining water balance in the body, and a sufficient concentration is also required for regular contraction and relaxation of muscles.To an extent, a state of potassium deficiency can be thought of as being intoxicated by alcohol - where both are characterized with poor muscle coordination, lapses in judgement, and potentially even poor memory. The daily recommended value is 4700mg - 12 bananas worth - of potassium, and I’m willing to bet that you’re not getting enough of it on a consistent basis.

Read more about why potassium is so important here: -- Link to paper

Superpixels!

For my Parallel and Distributed Systems class, my partners and I chose to parallelize a superpixel segmentation algorithm called Simple Linear Iterative Clustering, or SLIC. We compared the runtime and boundary recall of our parallel implementation to the sequential implementation and found some interesting results!
Below is an example of an image with superpixels boundaries superimposed onto it (shown in yellow)

Abstract

Superpixelation involves grouping pixels in a way that captures some of the perceptual and intuitive meaning of an image. Superpixels are a useful tool for many computer vision problems including image segmentation and object recognition because superpixelation can be used as a preprocessing step that both reduces dimensionality and identifies more meaningful features. A good superpixel algorithm efficiently produces superpixels that respect object boundaries, are of approximately equal size, and are compact -- this means that superpixel edges fall along the edges of objects in the image, there is a roughly constant number of pixels per superpixel, and that each superpixel is relatively round. We implement a parallelized version of the simple linear iterative clustering (SLIC) algorithm for superpixelation with the goal of improving time efficiency and possibly scaling to larger image sizes. SLIC uses a $k$-means clustering approach which iteratively updates the assignment of pixels to superpixels based on the distance between the pixel and the superpixel centers. It is a good candidate algorithm for GPU parallelization because it has subparts that can computed independently by pixel or by superpixel. Although our results show that our parallelized implementation is 4-5 times slower than the sequential SLIC, we achieve nearly the same accuracy using metrics calculated using UC Berkeley's Segmentation Benchmarks - especially as the number of superpixels increase.

-- Link to paper --

Paces Cafe

Paces is a student-run cafe that is open from Sunday-Wednesday nights on Swarthmore College's campus. As a freshman, I joined the cafe as a prep chef. I was in charge of making guacamole and salsa for the week as well as prepping vegatables and special dishes for the week. I was promoted to Kitchen Director at the end of freshman year and proceeded to revamp the menu to provide healthier and tastier options, while also canning the dishes that had one-dimensional ingredients. I worked closely with the dining staff to conduct inventory and sales analysis to further increase our profit margins and reduce waste. In the first 4 months of joining the team of directors, revenue has increased 400% and we have reached our nightly capacity of the cafe. Swarthmore College is currently looking to rennovate a space to accommodate our sudden growth. I also interviewed, hired, and trained a team of 40 students and work closely with the staff and facilitate nightly operations so that we can all have a good time as both students and employees. Below is a menu that we began the second semester with.

Reiki OM

Produced by Soundings of the Planet in 2011. I was the featured artist playing the GuZheng.
COVR Visionary Award Double Winner - Best Innerspace, Meditation, Healing Music and Music of the Year!

I'm verified on Spotify! Find my profile on the left side bar and experience the benefits of Earth Resonance Frequency (ERF)! ERF helps entrain brainwaves to Alpha state.


Traveling and Photography

I love to travel! Some of my most memorable trips have been to Banff, Canada; Dubai, UAE; Harare, Zimbabwe; KunMing, China; and Helsinki, Finland.

I'm also getting into photography - check out my VSCO to get a glimpse at what I've been up to!

-- Link to VSCO --

Late Nite Swarthmore

Within 2 months of starting my freshmen year of college at Swarthmore College, I started a food delivery service that tackled the shortage of late night grub options on campus. Check out the menu here

Scouted.io included me in their List of Top 5 Scouted Student Entrepreneurs: -- Scouted.io Blog --

Swarthmore College interviewed me for the article "24 Ways to Look at a Fish"

DJai

Using Spotify, DJai generates playlists based on user inputs like desired energy level and tempo. The web app automatically syncs the beats of consecutive songs and crossfades them as if a real DJ mixed the tracks! Generated playlists can also be exported to Spotify to be played in the future.

Coming soon!
-- Link to Github --

About Me

I am a computer science and math double major at Swarthmore College graduating in the Spring of 2020.
I am also passionate in the topics of psychology, neuroscience, and nutrition. I like to stay fit through basketball, tennis, hiking, and bodybuilding. I play many instruments, but mainly the piano and GuZheng. My favorite pieces include Grades etudes de Paganini by Franz Liszt, all three of Rachmaninioff's Piano Concertos, and Chopin Scherzo No.3 in C Sharp Minor, Op.39.

On a journey to quench my thirst for knowledge