Written by Susan Shu Chang
on October 02, 2019

How to choose a data science project - and maximize your motivation to complete it

I’ve often been asked the question: How do I pick a good data science side project? Here is my approach to come up with an idea shortlist, plan out concrete steps, and most importantly, execute on the project to completion.

Coming up with a shortlist of ideas
- Your “Why”
- Your skillset wishlist
Execution plan
Parting words
- Bonus: My talk for how to choose a data science (capstone) project

Coming up with a shortlist of ideas

Your “Why”

First of all, let’s examine why you are interested in doing a data science side project in the first place.

Be honest with yourself! Understanding your motivation can make the biggest of differences when it comes to executing the projects. When you’re not honest with yourself, ideas could easily become abandoned, if you aren’t able to sustain the motivation.

Reasons could be:

Forcing myself to learn [x] programming language or technology
I want to improve my portfolio (for school or job applications)
Everyone’s doing it… why shouldn’t I?

I want you to genuinely think about your reasons. When I look back at my projects that I succeeded in completing, versus those that I didn’t, the why and meaning of the project to me, made the biggest difference. It was a better predictor of project completion than programming difficulty, unfamiliarity with tech stack, and so on.

Now that I hope you’ve had a conversation with yourself, note down your reason for doing a data science project. There is no right or wrong. It could be something like “I just want to look cool to my classmates”. You don’t have to tell me; you don’t have to tell anyone, just be honest with yourself.

Your skillset wishlist

The second thing I’d like you to write down is: What skillset(s) do you want to learn and demonstrate with this side project?

You can start with writing down your wishlist of skills - feel free to write down multiple, as long as you can clearly articulate what you want to learn, and what the outcome is. Since this is a wishlist - I encourage you to even write down things you think are wildly out of your reach at the moment! (I wishlist things that I can’t afford on Amazon, all the time…)

An example:

[x] language, [y] package: for example, “I want to learn Flask (Python web framework)”
A concrete vision of the outcome: “I want to demonstrate I can make a web app that can take in user input and output something back to the user” (we can refine the details later, for now focus on a brain dump)

Now, you should have a wishlist of skills you want to learn and some ideas of how the outcome would look like. Here’s an example of what my wishlist might have looked like, when I was starting out in data science:

I want to learn how to do all the things I was doing in Stata (a statistical software that was required in undergrad), but in Python
- Outcome: replicate an assignment (linear and logistic regression) that I had done in Stata, but using only Python

You might note that this example doesn’t particularly have the names of Python packages (e.g. NumPy, Pandas, etc.). This was because at the time, I was just starting out, and didn’t quite know the specifics of what goes into Python for data science. Which is precisely the purpose of that side project - to learn the details and be able to do it! This could be your case as well - perhaps you want to build a web app, but don’t know what frameworks or technologies to use…

Which brings us to the next major section, creating your execution plan.

Execution plan

Refining your wishlist

Now, here comes one of the most difficult parts. In my case, there were so many things I wanted to learn, right now!

I was trying to come up with a way to mash all the skills I wanted to learn into one project. “How do I do something that involves Tensorflow, Flask, and [x], [y], [z], all at once?”

After some trial and error, I found that this is a good thing! But I suggest you try to cap it once it gets to a point it would cause severe inconvenience for the project. For example, I’d wanted to toss in learning Go (commonly referred to as Golang) in a project idea. After some more research, it didn’t seem to be able to “play nice” with the other skillsets I had in mind to learn. While creatively getting multiple parts to work together could be a fun and challenging side project in itself, it wasn’t on my skillset wishlist.

If you ever encounter scope creep (e.g. “I want to cram just one more thing to learn into this project!), I suggest cutting scope based on a combination of your highest priorities on your skillset wishlist, and research on what is achievable in the time frame you can afford on that side project. Start small - this side project isn’t going to be your life’s work or anything. (That’s why it’s called a side project!) Its purpose is to help you go somewhere higher and further.

In this step it is common to encounter analysis paralysis - I suggest using this method to save 20+ hours of research to decide skills to learn in data science.

Designing the project

Remember the question “why am I doing this data science side project” from the first section?

Here’s where I really want to suggest a one size fits all solution, which you don’t have to accept, but hear me out here…

Whether your real reason is to look cool, show off, “so my GitHub isn’t empty”, to learn [x], or whatever - I’m suggesting you design the side project with one criteria.

Do something that is meaningful to you.

Yes, I know it’s cheesy, but this is what I find makes the biggest difference between me completing a project or not.

When I was developing a video game, I could be thinking about “what is most popular; what sells?” and try to make a multi-player battle royale first person shooter. (Mashups of genres popular with the general public.)

Well, if I had only considered external factors such as buzzwords (or “employability”), I would have been doing a side project so generic and boring to me, I would probably be miserable and start finding excuses to stop. But if you don’t carry it to completion, you wouldn’t be able to demonstrate your [skillset wishlist], which you wanted to learn because of [your reason of doing a data science side project]!

So, the way I focus on designing my projects are: What will let me have fun? What is meaningful to me?

It’s rare that you’re going to have a project all to yourself, with full control; cherish it to do something that’s distinct!

Feel free to do something silly! Do something that only you and your friends would understand!

I know that you might have questions now:

What if it’s something I want to impress a [recruiter], [other external party] with? Maybe they won’t care about [topic x] that I care about - I shouldn’t make my side project about [topic x]!
What if it’s safer to showcase my code from doing assignment [n] from week [m] from Andrew Ng’s Deep Learning course, to show that I’ve really done that course on Coursera?

Well, I’m not saying that you can’t do the above, but I feel that the more unique the project is, the more a [recruiter], [other external party] will remember it. As someone that’s seen quite a few resumes and interviewed quite a lot of candidates, side projects that look like coursework or assignments are difficult to remember, and thus doesn’t make the candidate stand out.

Meanwhile, those that truly look like passion projects tend to stand out to me, and as an interviewer, I love to ask the candidate about them!

Another question I encourage you to ask yourself, as a thought exercise, is: Would I talk about this at a conference?

When you think about it that way - no one talks about their code from assignment [n] from [Coursera course], [x course from 4th year that 100s of other people also did] at a conference (in general). When I think of conference talks, I usually think about a unique product the speaker created, or unique problem they solved, that means something to them.

Now, spend some time designing your side project with those lens, focusing on the outcome:

What will the side project look like, when it’s “complete”? i.e. defining an end point to the project

Keep in mind that all of this is variable: things may change once you get started with the project, and that’s okay. The key is to get started, and not get stuck in planning limbo.

Once you have some details written down, you’ll be ready to move on to the next step.

Taking your first step

Alright! Here we are, at the last step of planning, which will help you take your first step on your side project!

To recap, you’ll have the following components ready from the previous sections:

Your skillset wishlist
- Example: Tensorflow, Keras, Flask…
A design of your expected outcome of the side project
- Example: A web app that will take in images that the user uploads, feed the image through an image recognition neural network (which you trained), and output the result (i.e. which character from the video game, League of Legends, is in the image?) to the user.

At this phase, I often encountered some paralysis - maybe I’d get frustrated or stuck due to the tech stack being unfamiliar to me. And that’s fine, I want to help you avoid that. Your side project might often be your first time implementing a certain skillset, which is totally the point, so it’s natural to encounter some difficulties that push you outside your comfort zone.

Here, what works best for me is breaking up the problem. In the above example, it turns out that it’s like two projects in one:

Training a neural network to classify character from the video game, League of Legends
Having a web app (with a front end) that can somehow connect the user-uploaded image, pass it through the neural network, and get the result

At this point, I usually start with the portion I find slightly easier, in this case, 1., and break it down even further into tasks.

Training a neural network to classify character from the video game, League of Legends

Collecting training images
Selecting a neural network architecture
Setting up environment to train the neural network

Then, recursively breaking down the easiest, or sequentially earliest step into yet smaller steps:

Collecting training images

Using Beautiful Soup or Scrapy to get training images (from where? Find out)
Cropping or resizing the images to a way that’s suitable input for the neural network

Using Beautiful Soup or Scrapy to get training images

Find a website to scrape from (Google image search?)
Write code! But I’ve never used Scrapy before!

Now, these two steps don’t look as daunting as the larger step we had, “Training a neural network to classify character from the video game, League of Legends”. In fact, these two steps are fairly easy. The first step doesn’t require coding, just browsing some image hosting sites for something that works for the project. The second step can be started with googling, but at this point it’s easy to begin your actual steps into this side project - even if it’s with a Stack Overflow question about “how to use Scrapy?”

One you complete a small step, that’s great! You just got started with your side project!

Next up, you work your way up the branches, till the bigger steps are done as well. Then, you’ll slowly see the outcome you envisioned coming to life. And not only that - you’ll be learning and demonstrating your [skillset wishlist] as well as fulfilling [your reason of doing a data science side project].

Parting words

During my side projects, I’ve had many times I got stuck, or ran out of motivation. I’ve learned so much from overcoming those challenges, and this article is written with all my experience in mind - to help you create your action plan and get started!

If you have any questions or feedback, feel free to reach out. If this guide has helped you, I’d love to hear about it.

Now go get started!