Blue Collar Data Science

Written on October 26, 2017

[ misc wwe ]

Automation: it’s what I do best.

From webscraping and API calls, to statistical analyses, report generation, and automated emails: a large part of being a data scientist is all about automation. Of course the role varies company to company, but I’ve found many other data scientists to discuss similar issues (e.g., “80% of my time is spent cleaning data!”).

But what about the fancy stuff?

Yes, there are also the fancy-sounding things that data scientists get to work on and apply, like machine learning models, deep neural networks, recommendation systems, and so on. These things get all the press in magazine articles and newspapers (if and where those things exist). But, from a project perspective, that stuff is like that last mile in a marathon!

In the beginning of this year, I was enrolled in Udacity’s Deep Learning nanodegree. Each morning, I’d learn all this cool stuff and have a ton of ideas to play with for the day. But going from the classroom to the workplace, as always, was much more involved than this.

The dirty stuff

It often seems to me that the uglier parts of a data-centric job are glossed over by the unlimited numbers of online courses and schools dedicated to data science and/or machine learning. People pay data collection, cleansing, and preprocessing some lip service, but little else.

That’s one of the reasons I write these posts: in the data real world, your employer will expect SO MUCH MORE from you.

For example,

I required access to a GPU, which means
- pitching the idea to your direct manager and selling it,
- and it might require further negotiation with IT, the data engineering team, etc.
I was provided with a Linux g2.2xlarge EC2 instance on AWS, but a corporate asset requires much more secruity than spinning up an EC2 instance for the course
- this required dealing with SSH keys and .pem files just to get on
- I had to learn about and set up using SSH tunnels to use Jupyter notebooks
Also, I had to set up and maintain my python environment, which involved some intracacies
- e.g., getting TensorFlow installed properly

Kevin Urban

Blue Collar Data Science

But what about the fancy stuff?

The dirty stuff

More on the Mud