Data Science Outside The White Collar

Written on April 16, 2019

[ ]

There will be some overlap in the subsections. That’s ok. I’m just trying in general to come up with some lessons learned over the past ~13 years or so.

The Drone Risk

This bit here is applicable to job applications, interviews, and 
presentations in general.

Let’s say you apply for a job that is interested in the “popular craze”, and say that have been asked to come present on something you’ve done that can showcase your skill set.

The thing with a “popular craze” is that it’s like a GPA or GRE scores when applying for graduate school: more of a checkbox item than a differentiator. Yes, a graduate school will expect a GPA to be above some threshold, but other than that – well, you have to actually be interesting to get the spot. That might be something that comes out in your essay or cover letter, or a personality quirk that manifests during the interview. One thing is for sure: a lot of other candidates with a high GPA will not get selected.

You bet your ass that in the data science field of the 20-teens, the “popular craze” thing is neural networks. Like in the example above, your interviewer will certainly expect your familiarity with this to be above a threshold: you should be aware of it, and even particularly skilled at it, but these things won’t necessarily differentiate you, and your interviewer also must optimize for throwing out any potentially false positives! There are going to be a bunch of candidates who can do the “neural network” thing enough that it will be hard to differentiate them on an application, or even in a short presentation. If you come in and have 10 minutes to talk, and if you choose to talk about the broadstrokes of a neural network project, you are going to sound like a billion other candidates who came in and did the same.

You’re better off presenting on a linear regression you’ve done! At the very least, this will wake up the interviewers and their pals in the room.

Interviewer: (thinking): "Who is this ridiculously crazy,
rebel bad ass coming into my room and showing me a goddam 
linear regression!"

Others in room: (smirking, whispering)

Interviewer:  (aloud):  "Interesting.  Wonder why you chose 
to do it that way?"

YOU:  (aloud):  <...excellent explanation about how this worked 
nearly as well as other methods, is highly interpretable, was easy 
to explain to the marketing team and help deploy with the engineering 
team, and helped increase revenue streams, etc, etc... (all the stuff your 
interviewers truly care about on a day-to-day level)...>

Interview's Crony:  (aloud):  "Surely, a hidden layer or two would 
have made this better."

You:  <...take opportunity to show that you are fully aware of neural 
networks, hidden layers, activation functions, 
and whatever...>

Interviewer:  (aloud):  "All good points."

Others in room:  (nodding in agreement)

Avoid the Sales Pitch, Learn to Listen

This bit mostly focuses on working with other people, after you've 
already landed a job, but have not yet solidified the scope, strategy, 
and objectives of a particular project.

Do you know when pitching a “deep neural net” works really well?

One example is when you’re talking to a bunch of data science enthusiasts that are currently seeking out their first job. Another great place to pitch this might be on Twitter (#saveTheWorldWithAI #cureCancerWithDeepLearning).

Prospects rapidly decline from there.

In the business world, you might gain some traction when casually discussing new, exciting things with the department head (say 3-4 notches above you on the totem pole) who is sincerely interested in all the buzzwords. Other members on your data science team might also share a lament session with how it would be awesome to do X, Y, and Z with Backwards ConvNetical RNN-ified ResNets.

But when push comes to shove, you need to work with other people from other departments. And, unless you are working on a problem classifying a bunch of images or text where DL truly excels, most people outside of the hype and buzz won’t be nearly as excited about black boxes and best predictive models as you might think. Seriously. Sometimes people really want the better understanding that comes along with a proper statistical analysis – and will take it from there, thank you!

In my experience, if you are on a data science team that interfaces with other departments, like marketing, advertising, or legal, then take heed: they do not particularly care whether you can make a cat look like a dog, or a monkey-parrot hybrid give a Shakespearean soliloqy generated by multiple GANs. These definitely serve as great ice breakers during a first meeting… But, if you keep pushing it… Unless you have a really solid idea of how it translates to the business/domain issue at hand, you’re not going win hearts or get the applause or adulation you might be looking for.

Sometimes you get pushback because no one believes the idea if feasible by the end of Q2. Other times, your partner from the other deparment has no idea how they’d pitch the idea to their ultra conservative manager. Or they just don’t understand your vision because you speak too technically! And other times yet, you will be working with someone who has great Excel chops and has thrown together a bunch of sophisticated regressions and survival analyses over the years – and though they are busy doing other things, they think they could do the job better than you…if given the chance, when they get the time.

The reality is that it’s possible that folks you might be partnered with in other departments think much less of you and data science than you expect. They might think they know better (and they might) and might not want or need your help, but there is pressure is on them from up-on-high to work with the data science team.

Point is, a lot of “doing data science” is working with other people. There will be debate. There will be compromise. There will be pushback and quarrels over who owns what. You might have to fight for what you believe in, you might have to accept that your vision was suboptimal, and/or you might have to just do whatever dirty work needs to get done – and to quit your bitchin’!

Automate this. Analyze that. Model X. Build a pipeline for Y.

Expect to be questioned at every move before you build trust and common language… Expect people to disbelieve the hype… Expect yourself to be reasonable, humble, to listen, to understand where they are coming from… Do not foolishly push your own agenda without good reason!

Blue Collar Data Science

This bit of advice assumes you've developed some working 
relationships with various departments and stakeholders across 
your company, and that you've defined a project that you will be 
working on (or at the least, got access to some data).  

I wrote another blog post a while back called “blue collar data science.” I like the phrase because I think it speaks to the reality of a data science job, in contrast to what you might be sold in an online course or a hyped up buzz article. It’s also why I called this post “Data Science Outside the White Collar.”

Be aware: you will not have the data required to directly solve the problems these departments are having, and you will often have messy data that appears almost nonsensical to you. You’re going to get your hands dirty. Data science is not for the squeamish.

Everyone talks about the end of “feature engineering”, but at the same time you’ll hear about how 80% of data science work is feature engineering and data clean-up. The end of feature engineering is mostly hype. What you hear from the trenches is what matters!

So be the best damn data cleaning, feature engineering maniac you can be!

Get comfortable starting on unfamiliar data sets, the reshaping process (e.g., transforming categorical variables), the imputation-or-not process, and so on. Get comfy asking questions and coming up with strategies with the domain experts. Hell, get them to think everything was all their idea. Leave ego at home.

You might also have find or buy the data. Scrape it from the web. Learn some JavaScript. Fix a database. Manage data in the cloud. Learn Excel better because those guys from Finance keep asking if you could send your final analyses in Excel!

In real world (“blue collar”) data science, you are a renaissance (wo)man! Yes, you may predict like a prophet, model like a mighty warrior, and even teach machines to learn like a god – but you’ll also have to grep a file, do a regex search, and maybe even count with your fingers :-p

Does it mean what you think it means?

This is where language, understanding, and mathematical 
representations converge.

Always ask yourself if a metric means what you think it means… Forget models for a second. If you are given a data set, the first thing you will do is try to understand it and explain it to stakeholderes… You will inevitably hear someone say something like, “On average, people watch 2 hours of TV a day.” What does that mean? I can guarantee there are people in the room who think it means that people more-or-less watch 2 hours of TV a day. But if you look at the data distribution, it’s not normal – log-normal at best! Maybe power-law distributed.

Median is a little bit easier to understand: “On median, people watch 0 hours of TV a day.” This means that the bottom 50% of customers watch 0 hours a day, and the top 50% watches 0 or more hours a day. This will run into some problems though: (1) the boss won’t want to hear it, and (2) be careful about integration time. Both are kind of related: the boss knows TV is being watched because the average is 2 hours per day, and so the median just sounds wrong to them – and more importantly, doesn’t help tell the story that needs to be told to their manager! Importantly, what does this tell you: the daily median might be 0, but the weekly median might be 1 hour, and the monthly might be 8 hours. The daily median makes it sound and feel like people don’t watch TV, but it might be better interpreted that people do not watch every day, but they do typically watch at one point every week, and in general have heavier watching at certain points during the month!

Point is, don’t fool yourself, don’t fool your boss, don’t believe the too-good-to-be true metrics, but also don’t believe the too-negative-to-be-true metrics either… Metrics help tell a story: your job is to look at enough metrics to ensure you are telling the right story. Or, importantly, you need to help tell the a persuasive story that helps your manager and, ultimately, your company. So, seriously, if you have to opt between the “0 hours on median per day” story and the “2 hours on average per day” story, choose the latter – for starters. Then move into the daily, weekly, and monthly medians when people are ready to listen – and discuss how this is an opportunity: the company can target resources during optimal times, at the aggregated and individualized levels!

Avoid Stupid Mistakes

If your model is awesome on the first go, then you’ve done something wrong – or at least, you should convince yourself of the likelihood that you’ve done something wrong, then investigate!

The newbiest of newb mistakes is leaving in IDs – customer ID in marketing, patient ID in healthcare, and so on. If you use a deep learning network with enough parameters, it will simply learn to become an ID/outcome dictionary.

Second up is something like area code or zip code. Unless you have multiple units/customers/patients per area/zip code (say 30+), you are likely going to run into the dictionary principle again.

Choose Interesting Over New

If you so happen to be on the more research-oriented side of data science, then some other advice might be in order. The basics here are maintaining a sense of self and following your intuition over your ego… There will be hypes and crazes, and you’ll likely feel drawn to them, but my generic advice here would be to avoid playing catch up, skip the funding fight, and respect the law of limiting originality!

(Actually, after reading through them a few times, I would say that this advice is very similar and applicable to the more business-oriented discussions above as well.)

Avoid Playing Catch-Up

One of my PhD advisors would always warn against working on what’s popular – unless you just so happened to be the original, or in on the topic early enough. It’s inevitably a game of catching up, and trying to stand out in a crowd of very homogeneous work. There are countless other related problems that everyone is ignoring because it’s not cool at the moment: work on them, get all the low-hanging fruit, and if/when the topic suddenly gets caught in a hype spotlight, then watch your citation counts sky rocket. Either way: avoid stagnation or a feed frenzy, and move on to something else interesting and ignored.

Skip the Funding Fight

Another reason to choose interesting over new is pretty straightforward: the currently hot topic is almost certainly overblown, and if not, there is almost certainly several research groups focused on the issue and any related issue, and so the funding is difficult to get… More importantly, because of these hype clusters, there is often a huge assortment of interesting problems ignored and unsolved…

In data science circles these days, it’s often something to do with neural networks.

My advisor’s advice here would be to run.

Respect the Law of Limiting Originality

So the latest craze inspires you, and you think you have some great ideas! Well, be honest with yourself: how original are those ideas likely to be? At least treat yourself to a thorough Google search: how many blog posts or StackExchange comments come dangerously close to your idea? Or on Google Scholar: how many papers look like they basically cover your idea? The point is to be rigorously honest with yourself: yes, you can work on your idea and push the envelope further, but will it be an incremental push in a long succession of similar, incremental pushes? Will your voice be heard in the raging sea of similar voices? Not everyone will, can, or should be working on original or unpopular ideas, but at least be honest with yourself about the expectations: for example, at a conference, if your presentation is yet another “popular craze” piece, expect general interest, but some yawns, eye rolls, and stupid questions from people who are also excitedly trying to keep up with the craze and (a) want to prove they know things too, or (b) want to prove that you do not know things – thus shouldn’t be trying to hang with the cool kids.

Point is, once you stop following the craze, the crowd and mentality will shift: how many times have you been at a conference of nearly everyone doing “popular craze” pieces, except that one presentation you sat in on where the presenter talked about something completely unrelated, but so damn insightful – and rebellious! That person stood out, right?