Learning to Crawl (the Internet)

Most projects start with wanting to know something: how many dinosaurs would be necessary to re-take over the world, could cats become world-famous scientists, etc. In my case, I just wanted to know what famous people were born on a particular day of the year. Google does a great job with their doodles, but they are very selective with who gets one. You probably think this is a non-issue — there are tons of sites that have this information, right? And you would be correct, except for one problem: they look awful. 

If I wanted to use a website with the UI/UX from the early 2000s, I would not be writing this article. Like most people, I really enjoy using the Internet, and I want it to look good. I also believe in minimal, unintrusive design without ads for hoverboards in my face. Based on these principles, I had the requirements for my minimum viable product. I would develop a website that:

  • Shows you a famous person that was born on the current date. 
  • Information about the person (birthday, some information, a picture, and a link to learn more about them).
  • Loads a different person every time you refresh (or click a button).
  • Allows you to choose any day of the year.
  • Be available for offline usage. 

Seems simple. Except, as I got started, I ran into problem one: what people would be on this website? It would be a bit bias to put myself on, and I could not find any good APIs that solved this problem. After some quick google-fu, I learned that Wikipedia was my answer (as it is the answer to everything else). It turns out that Wikipedia has a page for every day of the year. On this page are significant events, births, and deaths. Awesome!

Except this led me to problem two: I had no clue how to get the information off that page, unless I wanted to embed an iFrame of that unstyled list into my site. Take a second to visualize that. Sorry about that, next time I will include a trigger warning. Another option could just be manually copying the list from every page into a giant JSON file. There are only 366 days (do not forget February 29th!), after all, and this is a MVP. I decided not to do that, however, as I was reminded of a quote.

“I choose a lazy person to do a hard job. Because a lazy person will find an easy way to do it.” 

— Bill Gates

Save every list myself? That would be madness and probably take longer than writing the actual program. And that is how I found the solution to problem two: learn how to crawl. Or, more specifically, learn how to crawl every one of these wiki pages to get the information I needed.

The code of my project went through three major revisions. I want to take some time to thank Jonathan E. Magen for inspiring the second two revisions, lessons learned, and other major aspects of this application. Without his advice and guidance, we may still be looking at famous birthdays on a list that looks like this: 

Birthday List

We want to avoid sites that look like this. Source: http://www.onthisday.com/today/birthdays.php

Anyway, let us look at the major takeaways from each of these revisions and lessons learned.

0.1: Everything is in an Angular service.

Version 0.1 of the site had everything in an Angular service. Every request meant multiple sub-requests to the Wikipedia API. That’s right, all parsing, formatting, and querying was done on the fly. As you can imagine, this was not very efficient. The code was extremely messy and unorganized.

Lessons learned: 

  • Use test driven development. I had no idea what worked, what did not, and what would break if I changed something.
  • Break code into logical classes via the single responsibility principle. I like to think of this as the “do-not-have-a-function-be-a-hero philosophy”. Functions should have one responsibility, they should do it well, and they should rely on their fellow functions to do the other parts.
  • Stop doing the same actions more than once! Why query/parse/format this data on-the-fly when it only has to happen once. 

0.2: Use the latest, most hipster technology ever.

Version 0.2 had two goals: break the application in half, and use the latest, coolest web frameworks. One of these was accomplished.

In order to make the app faster, Jonathan suggested breaking the application into two parts: the web app, and the crawler. The crawler would be responsible for:

  • Parsing every Wikipedia page and grabbing the list of names.
  • Querying every name to get additional information about them (short biography, birthday, picture).
  • Saving the information into a data store.

The web-app would be responsible for:

  • Serving the birthday data.

The first part went great. Using my newly learned lessons, I created a Node.JS application called WikiCrawler. Using the MediaWiki API and Wikipedia, it makes the necessary requests and saves the data into JSON.

It takes ~100 seconds to gather the data for ~90,000 famous people. Not too bad when you factor in rate-limits and me trying not to get IP-banned from using the API. Note: at this time, this does not include the requests needed to get images.

The final file size wound up being 42.8 megabytes. Not too bad, but not something you want to load every time you use a website.

The web-app was supposed to be a super-sleek Angular 2 app, built with TypeScript, componentized, and compiled with Babel. Angular 2 just hit beta, and it is awesome! It also has a fraction of the documentation that Angular 1.x has, which can be a problem when you are not sure what is broken, and what is you doing something wrong. After internally struggling with how to build my web-app, I was reminded of another quote.

There are two kinds of start-ups. Those with beautiful code, and those that ship. 

— Silvio Galea

This quote is what inspired the next (and current) version.

0.3: Keep it simple, get it working, get it public.

The goal of version 0.3 was to use my (mostly working) crawler and get this app shipped! This version solved my main problem:

  • Write code that makes sense.
  • Loading the data quickly and efficiently.

I ditched Angular 2, for now, and went back to the stable yet reliable Angular generator. I have experience with it, it works, and there is tons of support for it.

When I talked to Jonathan, he had a simple solution for the data problem: “why not just break the data up by day”. So I did, and created 366 JSON files. Boom, instant loads. Completely offline. Now that my code was in a workable format, I was able to easily create services to handle my data. And Famous Birthdays was born.

Let’s wrap it up. 

Famous Birthdays is currently live, on the web, and is hosted on GitHub Pages. At the time of this article, we do not currently have images, but they will be added soon.

I learned to crawl. This project may not change the world, but it was an amazing learning experience for me, for future projects. It was not about the problem of finding out what famous people were born, but solving the surrounding problems:

  • How can we programmatically use Wikipedia to quickly and easily gather required data.
  • Learning to use the right tools for the job.
  • Writing a program that is readable and reliable. (Or, the importance of building a solid foundation on important concepts, such as test-driven-development and the single responsibility principle).
  • Approaching a problem that seems easy on the outside, but is in fact a lot more complicated on the inside. 

Thank you for taking the time to follow the journey of how I learned to crawl! I have learned a lot on this project, and there is still more to go. Next stop, maybe the Chrome Web Store. And after that is out-of-scope for this post.

You can check out the project here.

Hackathon @Google


This was the hashtag used by close to 100 student hackers, Googlers, and guest speakers at Google NYC’s recent Hack4Humanity hackathon! From Friday night to early Sunday morning, we coded and brainstormed with intermittent breaks for talks, food, coffee, and sometimes even sleeping.

The Google Logo

This was my first official hackathon, so I had no idea what to expect. On top of that, it was at Google, so I really did not want to embarass myself too much.

Upon arriving, and after being given an unhealthily amount of Google swag, we found our table and listened to some guest talks. They described real-life problems they have encountered and empowered us to find a solution. This hackathon was not about making something cool for yourself, or technically challenging; it was about making something useful and even life-changing to some part of the world.

Baris Yuksel (@baris_wonders), a tech lead at Google and one of the hackthon’s organizers, described it best by saying our goal was not to create a “scheduling app” or one of the other dozen ideas that people always try. “How many people in the world do you think would benefit from a website that enables you to double major?” he asked us today, right after complimenting every group on a job well done. The answer is not too many.

Everyone demoed a practical app today that could easily improve the quality of life of some people and possibly safe others.

Our app may not have won anything, but I still learned a lot and was very grateful for the opportunity. Many valuable lessons were learned, and not just on the coding side, and I hope to apply them at future hackathons.


Prime numbers.

Hopefully, by now, we know what they are. Do we, however, know an efficient way to add up every single one between 1 and 2,000,000?

This question, also known as problem 10 on Project Euler, seems pretty simple at face value. For beginners, the simplest and obvious solution is to solve it by brute force:

Exhibit A

long sumOfPrimes = 0;
boolean isPrime;

for (int x = 2; x < 2000000; x ++) {
    isPrime = true;
    for (int y = 2; y < x; y ++) {
        if (x % y == 0) {
            isPrime = false;
    if (isPrime) {
        sumOfPrimes += x;


This way would be fine if our upper limit was not two million. Running this code works, but it takes over 10 minutes to get the answer on my Acer Aspire S7-392 Ultrabook (a great computer, but it is not Watson).

XKCD 303 - Compiling

I am pretty sure Project Euler has a one minute compiling “rule” where you should get your results in under a minute, so this solution was obviously not acceptable.

The next thing I tried was putting the numbers in an ArrayList, from two to two million. Then, I ran them through a loop that deleted the numbers from the list if they were not prime. You can imagine this was not very efficient either - in fact, it was faster to brute force. After messing around with this (very messy) code for an extended period of time, I did some research on algorithms and stumbled upon the Sieve of Eratosthenes.

Sieve of Eratosthenes

The Sieve of Eratosthenes works on the idea that, if number (n) is prime, any multiples of (n) will not be prime. So where n is prime, 2n would not be.

It took me a few tries to implement it correctly, but I believe I did it farily efficiently here:

Exhibit B

static ArrayList<Integer> primesList = new ArrayList<Integer>();

public static boolean isPrime(int testDigit) {
    for (int primeVal : primesList) {
        if (testDigit % primeVal == 0)
            return false;
    return true;

long sumOfPrimes = 0;

for (int x = 2; x < 2000000; x ++) {
    if (isPrime(x))

for (int primeVal : primesList) {
    sumOfPrimes += primeVal;


Instead of comparing each digit to every possible factor, we now only compare it to known prime numbers. Starting with 2, there are no digits in our array list of primes, so 2 is added to the prime list. The number 3 is compared to the array list of primes, which only consists of 2 - it cannot divide evenly, so 3 is added to the list of primes. When 4 is compared against the list of primes, it can divide evenly against 2, so it is not added to the list and the program breaks out since the number cannot be prime. And so on.

This, of course, makes our program take a fraction of the time to find a solution, because we are comparing each digit against a lot less numbers. When the upper limit is 2 million, the efficiency gains are much more noteworthy.

Time to find the solution in exhibit B: ~46 seconds.


Why does everything need a specific category?

Lets replace the classified advertisements with a more modern approach. I could post a wanted classified advertisement on Craigslist, or pay for one in my local newspaper, but there is no guarantee anyone will ever see those. Even if the right person does, who is to say it will be in time?

It is almost 2015: we live in an age where instant feedback is expected. We could hit refresh every 10 seconds on Craigslist but that does not seem very time efficient.

The idea (and a rough example).

Lets create a platform where people can post their requests and display them to relevant, applicable people. For example, pretend I want to play frisbee right now:

I post a request (except lets call them buzzes) saying “Looking for people to play some frisbee.” Sounds great, but this could potentially be millions of people. So lets narrow it down a bit.

In my buzz criteria, I will put “Drexel University students in a one mile range”. I can also add “able to play sometime this afternoon”.

By adding specific criteria, we can make sure not to spam random people and guarantee our buzz is seen by people who want to see it! If all went well, Drexel students currently in the area who like frisbee and are not marked as busy would see this post and be able to respond to it. If someone was really into frisbee, they could setup push notifications whenever a buzz was posted with the keyword frisbee.

That still sounds like categories.

Sure, it may seem like that example falls in the sports or frisbee category, but the point is not to limit people. What if you wanted to eat toast while skateboarding by a river while its raining?

I want to create the declassifieds, because no experience ever falls into one category. Something like playing frisbee may seem basic at face value, but it will always be a unique experience!

Buzz at 10:21pm: Looking to bake a cake; I can provide eggs, flour, and an oven. 
               	 Open to all friends of friends within a mile.
               	 This buzz will expire in one hour if not answered.

This is the first post in my new Ideas series, where I will be blogging about different ideas that come to mind.


Everyone has ideas.

Some are better than others. Ordering a pizza at 3am on a Tuesday may seem like a good idea at the time, but you will probably regret it the next morning.

I like ideas because there are very little parameters. Ideas do not have to be a certain size, conform to a certain standard, or be executed in a specific way. The best idea in the world may be something people use everyday but do not even think about. Having a cell phone alert you by vibrating instead of playing a loud, obnoxious ringtone was a great idea.

Anyway, having ideas is a good idea. And everyone has them, all the time. That is why I want to dedicate part of my blog to discussing my own ideas! If I think of something during the day, I will blog about it. It may be a horrible idea, like using your oven to heat up bath towels in the winter, but the point is to find out what makes it so bad. (Or good. Sometimes I have good ideas too.)

Thanks for reading!

Hello Jekyll

Hello, internet.

I have decided to rebuild my website and blog using the Jekyll framework. My old site was built on Wordpress, which is a bit heavy for what I need to do. Hosting this website on my Github page allows me to do the following.

  • Securely update it from anywhere.
  • Avoid using an unncessary database.
  • Save money by using Github's hosting.

Plus, since the site is a git project, I can easily grab the latest version from whatever workstation I am using, and work offline. Over the next few weeks I plan on giving the site a facelift and adding more content, including links to my current web projects.

Stay tuned!