Home > research resources > How do I become a data scientist?

How do I become a data scientist?

Strictly speaking, there is no such thing as “data science” (see What is data science?). See also: Vardi, Science has only two legs: http://portal.acm.org/ft_gateway…

Here are some resources I’ve collected about working with data, I hope you find them useful  (note: I’m an undergrad student, this is not an expert opinion in any way).

1) Learn about matrix factorizations

  • Take the Computational Linear Algebra course (it is sometimes called Applied Linear Algebra or Matrix Computations or Numerical Analysis or Matrix Analysis and it can be either CS or Applied Math course). Matrix  decomposition algorithms are fundamental to many data mining applications and are usually underrepresented in a standard “machine learning” curriculum. With TBs of data traditional tools such as Matlab become not suitable for the job, you cannot just run eig() on Big Data. Distributed matrix computation packages such as those included in Apache Mahout [1] are trying to fill this void but you need to understand how the numeric algorithms/LAPACK/BLAS routines [2][3][4][5] work in order to use them properly, adjust for special cases, build your own and scale them up to terabytes of data on a cluster of commodity machines.[6] Usually numerics courses are built upon undergraduate algebra and calculus so you should be good with prerequisites.  I’d recommend these resources for self study/reference material:
  • What are some good resources for learning about numerical analysis?

2) Learn about distributed computing

  • It is important to learn how to work with a Linux cluster and how to design scalable distributed algorithms if you want to work with big data (Why the current obsession with “big” data? ).
  • Crays and Connection Machines of the past can now be replaced with farms of cheap cloud instances, the computing costs dropped to less than $1.80/GFlop in 2011 vs $15M in 1984: http://en.wikipedia.org/wiki/FLOPS .
  • If you want to squeeze the most out of your (rented) hardware it is also becoming increasingly important to be able to utilize the full power of multicore  (seehttp://en.wikipedia.org/wiki/Moo… )
  • Note: this topic is not part of a standard Machine Learning track but you can probably find courses such as Distributed Systems or Parallel Programming in your CS/EE catalog.

3) Learn about statistical analysis 

  • See what interests you more, do your market research. Would you prefer working with vendor tools and do mostly modeling and reporting, or build data mining systems yourself and write a lot of code? Do you see yourself as a corporate employee, a researcher in academia or a startup founder in the future? What data interests you? Structure your studies based on that.

4) Learn about optimization

5) Learn about machine learning

6) Learn about information retrieval

7) Learn about signal detection and estimation

8) Master algorithms and data structures

9) Practice

If you do decide to go for a Masters degree:

10) Study Engineering

I’d go for CS with a focus on either IR or Machine Learning or a combination of both and take some systems courses along the way. As a “data scientist” you will have to write a ton of code and probably develop distributed algorithms/systems to process massive amounts of data. MS in Statistics will teach you how to do modeling and regression analysis etc, not how to build systems, I think the latter is more urgently needed these days as the old tools become obsolete with the avalanche of data. There is a shortage of engineers who can build a data mining system from the ground up. You can pick up statistics from books and experiments with R (see item 3 above) or take some statistics classes as a part of your CS studies.

Categories: research resources
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: