A Few Thoughts on Enabling the Data Scientist

Just some thoughts on what a DS-enabled environment should look like, at least from the perspective of needs I’ve had and projects I’ve worked on. The needs are staged in orders of approximation of what my ideal would be:

  • 1st Order: have access to a higher-powered machine than a laptop
  • 2nd Order: having access to multiple higher-powered machines that fit different needs
  • 3rd Order: having access to multiple machines from the same central location
Read More

Intel at the Edge (Deploying an Edge App)

In this lesson, we focus more on the operational aspects of deploying a edge app: we dive deeper into analying video streams, partcicularly those streaming from a web cam; we dig into the pre-trained landscape, learning to stack models into complex, beautiful pipelines; and we learn about a low-resource telemetry transport protocol, MQTT, which we employ to send image and video statistics to a server.

Read More

Intel at the Edge (The Inference Engine)

In this lesson, we go over the basics of the Inference Engine (IE), what devices are supported, how to feed an Intermediate Representation (IR) to the IE, how make inference requests, how to handle IE output, and how to integrate this all into an app.

Read More

Intel at the Edge (The Model Optimizer)

This lesson starts off describing what the Model Optimizer is, which feels redundant at this point, but here goes: the model optimizer is used to (i) convert deep learning models from various frameworks (TensorFlow, Caffe, MXNet, Kaldi, and ONNX, which can support PyTorch and Apple ML models) into a standarard vernacular called the Intermediate Representation (IR), and (ii) optimize various aspects of the model, such as size and computational efficiency by using lower precision, discarding layers only needed during training (e.g., a DropOut layer), and merging layers that can be computed as a single layer (e.g., a multiplication, convolution, and addition can all be merged). The OpenVINO suite actually performs hardware optimizations as well, but this aspect is owed to the Inference Engine.

Read More

Intel at the Edge (Leveraging Pre-Trained Models)

In this part of the course, we go over various computer vision models, specifically focusing on pre-trained models from the Open Model Zoo available for use with OpenVINO. We think about how a pre-trained model, or a pipeline of them, can aid in designing an app – and how we might deploy that app.

Read More

Intel at the Edge (OpenVINO on a Linux Docker)

My ultimate goal is to get OpenVINO working with my Neural Compute Stick 2 (NCS2). This is a little trickier than getting OpenVINO working on my MacOS, primarily because you need Windows, Linux, or Rasbian to use with the NCS2. This is despite the OpenVINO Installation Guide for MacOS ending with a small note on hooking up the Mac with NCS2. At the time of writing, that note basically just says, “You’ll need to brew install libusb.” Nothing more.

Read More

Pitting Pandas vs Postgres (a Refresher)

Oftentimes when working with a database, it is convenient to simply connect to it via Python’s Pandas interface (read_sql_query) and a connection established through SQLAlchemy’s create_engine – then, for small enough data sets, just query all the relevant data right into a DataFrame and fudge around using Pandas lingo from there (e.g., df.groupby('var1')['var2'].sum()). However, as I’ve covered in the past (e.g., in Running with Redshift and Conditional Aggregation in {dplyr} and Redshift), it’s often not possible to bring very large data sets onto your laptop – you must do as much in-database aggregation and manipulation as possible. It’s generally good to know how to do both, though obviously since straight-up SQL skills covers both scenarios, that’s the more important one to master in general.

Read More

Intel at the Edge (Getting Started)

The term “edge” is funny in a way because it’s literally defined as “local computing”, or “computing done nearby, not in the cloud.” 10 years ago this was just called “doing computer stuff.”

Read More

Intel at the Edge (Udacity Scholarship)

If you’re just starting out in data science, machine learning, deep learning (DL), etc, then I can’t recommend Udacity enough. Years ago, after I graduated with my PhD in physics, I wanted to get in on AI research in industry, say at Google or Facebook. As a stepping stone, I took a job as a data scientist at WWE developing predictive models for various types of customer behaviors on the WWE network. Early on at this job, I serendipitously stumbled upon and enrolled in Udacity’s first offering of their DL nanodegree.

Read More

Preserving History While Merging Two Repos

At work we have this predictive modeling project that has taken on multiple iterations due to the project starting and stopping over time, and different individuals taking it on after others had moved on to other things.

Read More

Wearables Weekly (W1)

A lot of my job is working with wearables in one way or another: wearing them, reading about them, using their data. Sometimes I can lose sight of what’s going on though while I get lost in the particulars of a specific project, or the mundanities of the work week. So, in an attempt to keep current, I present to you (me) – Wearables Weekly.

Read More

Ai4 Healthcare

Some notes and anecdotes about my experience at the Ai4 Healthcare conference in NYC, Nov 11-12.

Read More

Hello, Static Duck! (A Pythonic Type Tale.)

Let me just start out by saying: I’m proud of this title! Oftentimes I get a little lazy with the titles. Other times I think a boring, straightforward title is simply the way to go. But not this time! No, this time I let the morning coffee do the talking. (Thanks, morning coffee!)

Read More

Paper References from "Deep Learning School 2016"

I’ve been watching through the video lectures of the “Deep Learning School 2016” playlist on Lex Fridman’s YouTube account. While doing so, I found it useful to collect and collate all the references in each lecture (or as many as I could distinguish and find).

Read More

Experimenting with Random Forests on UCI ML Data Sets

For certain types of data, random forests work the best. Maybe more accurately, I should say a tree-based modeling approach works best – because sometimes the RF is beat out by gradient boosted trees, and so on. That said, I’ve been kind of obsessively vetting random forests as of late, so this is in that vein.

Read More

Time Series Forecasting with a Random Forest (1 of N)

I’ve been playing with random forests, experimenting with hyperparameters, and throwing them at all kinds of datasets to test their limitations… But prior to this month, I’d never used a random forest for any kind of time series analysis or forecasting. In a conversation with my brother about recurrent neural nets, I began to wonder if you can get any traction out of a random forest.

Read More

Keeping Your SSH Session Alive

Strap a wearable on your wrist. Your other wrist. Your finger. Transmit the data through a pipeline ending in an S3 bucket on AWS. Now, SSH into an GPU-powered EC2 instance because we’re about to flex the power of a recurrent + convolutional deep neural network hybrid on this sucka.

Read More

Creating and Modifying CookieCutter Project Templates

A while back, I wrote about CookieCutter Data Science, which a project templating scheme for homogenizing data science projects. The data science cookiecutter was a great idea, I think, and my team uses it for all our projects at work. The key is in encouraging/enforcing a certain level of standards and structure. It takes a little cognitive load to take on the data science cookiecutter your first time, but ever thereafter it will lighten the cognitive load for you and your team. This is essential when working on multiple projects at once, leaving projects idle for weeks to months at a time, and frequently swapping projects with others on the team.

Read More

Karabiner + Kinesis Update

A little while back, I bought a new, ergonomic keyboard called the Kinesis Freestyle2. In tandem, I installed Karabiner so that I can really customize the keyboard. Ultimately, I had found a great set up where my hands could relax on the keys all day – even had the mouse motion and clicks mapped!

Read More

Memory-Efficient Windowing of Time Series Data in Python: 3. Memory Strides in Pandas

In the first post in this series, we covered memory strides for default NumPy arrays – or, more generally, for C-like, “row major” arrays. In the second post, we showed that a NumPy array can also be F-like (column major). More importantly, for those who like to switch back and forth between Pandas and NumPy, we found that a typical DataFrame is F-like (not always, but often – and in important cases). We also found that if one builds a windowing function based on NumPy’s as_strided assuming a C-like array, but instead uses F-like arrays in production, one shall be screwed.

Read More

More Notes on Missing Data for Statistical Inference

As I’ve mentioned in previous posts, many of the references one will encounter when looking up methods for dealing with missing values will be oriented towards statistical inference and obtaining ubiased estimates of population parameters, such as means, variances, and covariances. The most mentioned of these techniques is multiple imputation. I saw value in digging deeper into this area in general, despite it not being optimized reading for developing predictive models – especially those that might run in real time in an app or at a clinic of some sort. The reason is that, in tandem with developing a great predictive model, I generally like to develop corresponding models that focus on interpretability. This allows me to learn from both inferential and predictive approaches, and to deploy the predictive model while using the interpretable model to help explain and understand the predictions. However, once one becomes interested in interpretability, one becomes interested in inference – or, importantly, unbiased estimates of population parameters, etc. That is, I’m actually very interested in unbiased estimates of means, variances, and covariances – but in parallel to prediction, not in place of it.

Read More

Karabiner for your Keyboard

My thumb joints have been bothering me… And my wrists. Turns out, I’m probably plugging away at the keyboard way too much, teaching machines to learn and whatnot.

Read More

On the Fly Neo4j Exercise

It’s been over a month since I was doing a Neo4j-oriented project. On another project today, some set data was provided to me in a CSV file. The goal was to come up with a decent visualization of all the sets and their intersections… I thought it might be helpful to a graph visualization, then realized it would be a perfect excuse to make sure I’m not getting rusty with Neo4j and Cypher.

Read More

Missing Values in a Live Prediction Model (Take 2)

Ok, so it’s been about a week of reading, thinking, toying around… My original objective was looking into various ways to treat missing values in categorical variables with an eye towards deploying the final predictive model. Reading over several ideas that included continuous variables (e.g., Perlich’s missing indicator and clipping techniques), I’ve re-scoped it a bit to missing values in general, specifically for predictive models.

Read More

deeplearning.ai's Intro to TensorFlow (Week 2)

This week’s content got a little more into actual machine learning models, namely simple multiperceptron-style networks – i.e., going from a linear regression to a network with hidden layers and non-identity activation functions. Instead of using MNIST as a starting point, the course creators buck that trend and dive into Fashion MNIST. Very briefly, the fact that implicit biases may be inherent in a data set is mentioned, and it is pointed out that such biases can unknowingly leak into machine learning models and cause downstream issues. However, most of this content was optional: one either explores the provided reference and follows down the rabbit hole via references therein, or not. The week’s quiz doesn’t even mention it. My 2 cents: take the detour, at least briefly.

Read More

Cookiecutter Data Science

Way back when circa 2012 I was heavily into R and got introduced to ProjectTemplate, which is an R package that allows you to begin data science projects in a similar way – with a similar directory structure. It was very useful, and navigating projects became intuitive. Again, when digging a bit into single-page web apps, I was introduced to this idea of maintaining a level of homogeneity in the file/folder structure across projects. And from my limited forays into Django and Flask, I know a similar philosophy is adopted. But what about various data engineering and data science projects primarily developed in Python. For my Python-oriented projects at WWE, I kind of came up with my own ideas on how each project should be structured…but this changed over time as I worked on many distinct, but related projects. Part of this leaks into the concept of how Git repos should be organized (subtrees? submodules?) for similar projects, but in this post I’ll stick with just deciding on a good project structure.

Read More

Relearning LaTeX

Haven’t used LaTeX much since I finished my PhD dissertation. I did try using it when I first began at WWE to write up mini papers on the projects we were working on… My boss said, “This is nice, but unnecessary.” Haha – and that was that.

Read More

Exploring Google Colab (Part 1 of N)

One: this is called “Part 1 of N” b/c I know I’ll be posting more on this, and don’t want to have to come up w/ crazy, catchy titles for each, and I don’t know how many there will be! Two: on a side note, I originally dated this file as 2018 instead of 2019, which I’ve done for almost every 2019 post so far. It’s 2019, Kev! Get with it! Three: it’s Pi Day. I love math and physics so much, but Pi Day? Meh.

Read More

deeplearning.ai's Intro to TensorFlow (Week 1)

Just started working on a new-to-me TensorFlow-oriented project at work. The project is dusty, having been on the shelf for a year or so. I’m also dusty having been working on other, non-TF-y things for the past 6 months. For the past few days, I’ve waded through another man’s Python code, editing, googling, and finally getting things to run – and that’s when I signed onto LinkedIn and saw Andrew Ng’s post about this new course.

Read More

Refresher on AWS CLI

Straight to the point: working on a new project and needed to start using AWS CLI again. Figured I’d write down a few useful commands as I go.

Read More

Laughing Like Hyde

I’m in a room filled with impatient, frustrated citizens awaiting their civic fate: to be called for jury duty, or not?! People are chortling maniacally at their own jokes (“I should be honored to be here – chortle, chortle.”)

Read More

Emfit QS Data Streams

Learning about a lot of different wearables and home sensors these days. Notes and links clutter my screen, sprinkled across a host of applications. Scribblings of tables and graphs charting out the landscape lay scattershot atop my desk, alongside printouts and to-do lists…at work…at home…and in my bookbag. Madness and mental mutiny is just around the corner!

Read More

Accuracy is not so Accurate

Here’s a weird thing: I could have sworn I wrote a post about this already, and that I called it the same: “Accuracy is not so Accurate.” But I’ve looked and looked: it’s not on my blog, it’s not in my email, it’s not…well, actually I gave up looking after that. But it probably would have not been in the next spot either! Like many of my writings, it likely exists somewhere – but out there in a nebulous quantum state, or somewhere in the aether.

Read More

Crash Course in Causality (Take 1)

Wright [1921] and Neyman [1923] were early pioneers of causal modeling, though the field did not fully mature until the 1970s with Rubin’s causal model (Rubin [1974]). In the 1980s, Greenland and Robins [1986] introduced causal diagrams, Rosenbaum and Rubin [1983] introduced propensity scores, and Robins [1986] introduced time-dependent confounding (followed up in Robins [1997]). Then there the later work on causal diagrams by Pearl [2000] (side note: Pearl is from NJIT!). Post-2000, there are the methods of optimal dynamic treatment strategies (Murphy [2003]; Robins [2004]) and a machine learning approach called targeted learning (van der Laan [2009]).

These methods are primarily for observational studies and natural experiments – things as they are in the world. The ideal is to use randomized control… But you just cannot do that for many things (think weather, astronomy, disease, etc).

This post is based on watching through the first week’s lectures from Coursera’s Crash Course in Causality. Stay tuned for more…

Read More

Query Performance in Neo4j

I was going through the Cypher Manaul today, and started playing around with EXPLAIN and PROFILE to learn more about how Neo4j formulates execution plans. The most important piece of advice to extract is: Do not be lazy while writing your queries!

Read More

Data Structures in Neo4j

What’s cool about Neo4j is that you can output the data almost any way you want: you can return nodes (which are basically shallow JSON documents), or anything from flat tables to full-on JSON document structures.

Read More

Cypher Syntax Highlighting for Vim

Not going to lie to you: this post is not any better than googling “cypher syntax highlighting for vim.” In fact, that might be a better thing for you to do! But since you’re already here–

Read More

Printing Pretty Paths in Neo4j

Neo4j is awesome for working with relationship-heavy data – things that might be considered JOIN nightmares in a relational DB. For example, say you have designed a knowledge graph that maps lab tests to disease states, where a pathway might look like:

Read More

Navigating the NoSQL Landscape

At work, we want to have a database detailing various wearables devices and home sensors. In this database, we would want to track things like the device’s name, its creator/manufacturer, what sensors it has, what data streams it provides (raw sensor data? derived data products? both?), what biological quantities it purports to measure (e.g., heart rate, heart rate variability, electrodermal activity), whether or not its claims have been verified/validated/tested, whether or not its been used in published articles, and so on. For much of this, a standard relational database would be just fine (though there can be some tree-like or recursive relationships that begin to crop up when mapping raw sensor data to derived data streams).

Read More

Accessing CloudWatch with Boto3

At my last job, my main cloud was a Linux EC2 instance – there, I used Python and Cron to automated almost everything (data ETL, error logs, email notifications, etc). The company at large commanded many other cloud features, like Redshift and S3, but I was not directly responsible for setting up and maintaining those things: they were merely drop-off and pick-up locations my scripts traversed in their pursuit of data collection, transformation, and analysis. So, while it was all cloud heavy, it felt fairly grounded and classical (e.g., use a server to execute scripts on a schedule).

Read More

First Foray into Data Modeling with Graphs

In a previous post, we used the mighty pen (remember: fuck pencils) and rock-crushing paper to model data using entity relationship diagrams (ERDs). Today, we will use a pencil… Just kidding: today, we will use the pen-and-paper approach to model our data with graphs. Importantly, we will discuss how to translate between ERD and graph models.

Read More

First Foray into Data Modeling with Entity Relationship Diagrams

There are a lot of options out there for software to design these diagrams in, and I’ll get to using MySQL Workbench in another post (or who knows: maybe at the end of this post). But right now, I want to emphasize the good ol’ fashioned pen-and-paper approach (pen, b/c fuck pencils).

Read More

Getting Started with MySQL Software

At work, we have data assets – both potential and in-hand – and a set of business interests and use cases for these assets. How complex is this data? How much of it do we have, or will we potentially manage in the future? What is the best way to model and store this data? These are the types of questions that I’m now confronted with.

Read More

What's What When Selenium Crashes

Recently, the Selenium component of my Facebook Graph code broke. This is important b/c we need Facebook data for during and after live events streamed on Facebook, as well as collecting insights data from 400+ Facebook Pages at a regular, daily cadence.

Read More

Metrics on the Instagraph

In the last article on the Instagram graph, I covered some fields and edges on the Instagram Account, Story, and Media nodes. In this post, I talk a little bit about the available metrics on the Instagraph.

Read More

Investigating the Instagram Graph

If you working for a media juggernaut, the Facebook Graph API is helpful in collecting engagement/consumption data on Facebook pages, posts, videos, and so on. Of course, its utility is based on how well you can automate the process of obtaining a user token and swapping that user token for a page token (or 100’s of page tokens!). With a page token, obtaining data is as simple (strategizing how to best collect, clean up, and store all the data is another story). Just issue a GET request: this can be done in the browser, using a Python package like requests, and so on.

Read More

AWS: Identity and Access Management

Long story short: in Redshift at work, the default when you create a table is that the table is private to you unless you decide to grant permissions to other users. However, for those of us using Hive, this is not the case: instead, we all use the same user account, which has damn-nigh admin privileges, and we store the external table data in an S3 bucket that is accessible for reading/writing by a much larger, extended team. Why such sophisticated user and group permissions in Redshift, but not in S3 or Hive? My goal is to figure out how to develop more granular permissions for S3 and Hive users such that the defaults are more like we have in Redshift (and ultimately make the case to our AWS admins/overlords).

Read More

Linux Killswitch

Recently, I’ve been adminstering a Linux server for some folks on the Content Analytics and Digital Analytics teams. With this great power comes–you guessed it–great responsibility. And one such responsibility is ensuring that the system’s resources are available and in good use. A few of our team members are developing Selenium scripts, which employ a Chrome webdriver and Xvfb virtual display, both of which seem to have a problem of hanging around long after they’ve stopped being used. This seems to happen when a function or script someone wrote crashes, or if they don’t properly close things (e.g., display.stop(); driver.quit()). I mean, that’s somewhat of a hypothesis on my part – the main point is that these processes build up into the 100’s and just hang around.

Read More

No Market, No Cry

Imagine that the Marketing Team at your company is designing an email campaign to ~1-2 million customers. Their goals are to increase customer retention and decrease subscription cancellation. Your data science team has been asked to help Marketing achieve these goals and, importantly, to quantify that success. How would you do this?

Read More

Updating ChromeDriver on Linux Ubuntu EC2 Instance

About 2 weeks ago I returned to an old Selenium scraping project on my MacBook Pro, only to find errors strewn across the screen like the dead bodies of a recent war my script did not win… The solution turned out to be updating the chromedriver. Last week, the same thing happened to my coworker’s Selenium scripts on Windows. He independently arrived at the same conclusion and solution, which he brought up this morning, wondering aloud if we need to check the chromedriver on our Linux server. I had already been monitoring the situation: for whatever reason, the few daily automated scripts had not crashed, and in fact are still working as I write this. However! (Dramatic suspense: see next paragraph.)

Read More

Hello, Node.js!

There are no DOM or window objects in Node; no webpage you are working with! The essence of Node is “JavaScript for other things.” Get out of the browser and onto the server!

Read More

SQLAlchemy LIKE '%%WTF%%'

This is just a public service anouncement. If you are using SQLAlchemy to connect to Redshift and you issue a LIKE statement using the % wildcard, you will confront some difficulty. This is because the % symbol is special to both SQLAlchemy (escape symbol) and Redshift’s LIKE statement (wildcard). In short, your LIKE statements should look more like this:

con.execute("""
  SELECT * FROM table WHERE someVar LIKE '%%wtf%%'
""")
Read More

VirtualBox Randomly Stopped Working

I have no idea why, but yesterday when I tried opening VirtualBox to play around, my MacBook Pro told me that it failed to open. Huh? Whatever, let me reboot and try again… Failed to open! Ok, ok – let me google it this. What do I write? “VirtualBox won’t open.” Too general… Oh, I know! Try opening it in Terminal.

Read More

Gittn on the Diff Train

I have literally rebelled against using diff and git diff forever. Thought it all looked intimidating a long time ago, and learned to cope without them. By cope, I mean, I’ve survived diff’n files the brutish, caveman way line-by-line. The reality is, before this past year, it wasn’t even a huge concern: I worked on a few, mostly one-off coding/research efforts at a time. But nowadays? I’m constantly switching between projects at work, and working on new ones that are extremely similar to old ones. This means, I have a lot more need for organization, intelligent reuse of code, and the ability to quickly determine changes I made to projects I worked on last week, or last month, or just 1 of the 3 I worked on yesterday.

Read More

How to Import a Python Module from an Arbitrary Path

Time and again in my various data collection and reporting automations, I use similar functions to connect to Redshift or S3, or to make a request from the YouTube Data API or Facebook Graph. I keep these automations in a single repository with subdirectories largely based on platform (e.g., youtube, facebook, instagram, etc). Within each subdirectory, projects are further organized by cadence (e.g., hourly or daily) or some other defining characteristic (e.g., majorEvents). The quickest/dirtiest way to re-use a function I’ve already developed in another project is to copy-and-paste the function into a script in the current project’s directory… But then something changes: maybe it’s an update to the API, or I figure out a more efficient way of doing something. Or maybe I just develop a few extra related functions each time I work on a new project, and eventually start forgetting what I’ve already created and which project it’s housed under.

Read More

How to Install a Linux Ubuntu Virtual Box

What if you work on a Windows machine, but actively develop Python scripts that run on Linux Ubuntu? Installing a Linux virtual machine on your computer might sound pretty good! However, this isn’t my problem: I work on a MacBook Pro and rarely have an issue switching scripts between the two OS’s. So my problem is that I don’t have a problem, but still think the idea of running a Linux Ubuntu virtual machine on a non-Linux Ubuntu computer is cool AF.

Read More

Gittn' a Grip on Merging Git Repositories

Let’s say we have two Git repositories: team_automations and side_project. The team_automations repository houses many of our more recent, more maturely designed and versioned automation projects, while the side_project repository contains one of our older, more scrappy automations. The more recent projects are all housed together, but respect an easy-to-understand directory structure. This layout and its corresponding best (better?) practices weren’t obvious to us when we were working on side_project (or on other_side_project and experimental_project, for that matter). But over time, we’ve gained experience with project organization, Git version control, and maintaining our own sanity. For a long time, we’ve respected the “if it ain’t broken, don’t fix it” rule, but our idealist tendencies are overcoming our fear (and possibly our wisdom). Our goal is to merge side_project with team_automations, while minimizing how many things we break, bugs we introduce, and hopefully maintaining Git histories across both repositories.

Read More

Facebook Graph: Page/Album Edge Recap

In the previous post, the Page/Albums Edge allowed us to get a list of all Album nodes associated with a given Page Node. From this, we created a simple page_id/album_id mapping table (called page_album_map). Now, if your only working with a single Facebook Page, then this table might seem kind of silly. But! The second you begin working on a second Facebook Page, it starts to take on some meaning. If you work for something like a record label, publishing company, or sports empire, then it’s likely you’re tracking tens to hundreds of Facebook Pages, easy. No question here: the mapping table becomes pretty dang important!

Read More

Crontab Backups

Let’s say you have a server: it’s yours and yours only. Your life’s work is on it. Now let’s make it interesting: say that, suddenly, other people were given the same credentials to log in and out.

Read More

Scraping Instabilities

Last year I developed a lot Selenium/BeautifulSoup scripts in Python to scrape various social media and data collection platforms. It was a lot of fun, and certainly impressive (try showing someone how your program opens up a web browser, signs in to an account, navigates to various pages, clicks on buttons, and scrolls aroun without impressing them!).

Read More

Querying Hive or Presto Remotely in Python

Using the YouTube Reporting API several months ago, I “turned on” any and every daily data report available. That’s a lot of damn data. Putting it into Redshift would be a headache, so our team decided to keep it in S3 and finally give Hive and/or Presto a shot.

Read More

Facebook Instant Articles

Over the past several months, I’ve been on several reconnaissance missions to uncover what data is available from Facebook, how one might get it, and whether it can be automated.

Read More

Selenium Still Hangin' Round!

At work, I have a particular python Selenium script that is scheduled in Crontab to scrape some data every two hours. Little did I know that somehow, at one point, the code broke… Assuming everything was working, I wondered:

Why the hell are there so many chromedriver processes hanging around in the process list?

Read More

HTML Reports: Highlight on Hover

While trying to figure out how to ensure that the HTML report’s numeric values had commas is select columns, I stumbled across a neat little tidbit in the documentation for styles.set_table_style() that gives the email an interactive vibe: highlight the row that the cursor is resting on.

Read More

Pandas Remote Crash Prevention

The Issue: When I SSH into my AWS/EC2 instance at work and romp around in iPython, the command line completion functionality works just fine… That is, until I import pandas and dare hit the tab key! Then it pukes its guts out and dies.

Read More

Google Developer Scholarship

A few months back I saw that Google was offering scholarships for a mobile web development nanodegree they sponsor on Udacity. Seemed a little outside of my day-to-day, but well within my span of interest. Mostly, it just seemed too cool to pass up, and without any real risk. Only the potential for growth and opportunity!

Read More

Beep! Script Complete.

Sometimes you need to run a script that takes long enough to warrant web surfing, but not long enough to really get too involved in another task. Problem is, sometimes the script completes and you’re still reading about whether or not Harry Potter should have ended up with Hermione instead of Ron. (Good arguments on both sides, I must say!)

Anyway, point is, it would be nice to have some kind of notification that the script completed its task. It’s actually pretty simple!

python myScript.py && echo -e "\a"

On this StackExchange page, some people said they had issues with this when remotely logged in via SSH. I did not have this problem when remotely logged into my EC2 instance on AWS, but in case you do, the recommendation is to “redirect the output to any of the TTY devices (ideally one that is unused).”

./someOtherScript && echo -en "\a" > /dev/tty5
Read More

Crontab Script Sequences

This is a quickie, but a goodie! Let’s say during an ETL process, you only want to run a second script if a first one completes successfully. For example, say the first script uses an API to extract some cumulative data from a social media site (e.g., views, likes, whatever), transforms it into some tabular form, and loads it into S3 on a daily basis; and say the second script computes deltas on this data to derive a daily top 10 table to be used by some interested end user in Redshift (e.g., daily top 10 viewed YouTube videos, daily top 10 most-liked posts, daily top 10 whatever).

Read More

AWS/Boto3 Encryption

So you have a personal AWS account, do ya?! And you’ve used the boto3 python package to transfer files from your laptop to S3, you say?!

Read More

Arguments with Python

So you use crontab to automate some data collection for one of some big media company’s many YouTube channels. They think it’s awesome: can you do it for all the channels?

Read More

The Skeletal Structure of HTML Reports

Let’s say you want to automate some HTML reports to be emailed daily. The reports should have a consistent look, but clearly need to be dynamically generated to take into account newly collected data and observations. This little tidbit should get you started!

Read More

Comments on GraphQL

At work, I’m been diving deep on everything and anything Facebook:

  • What is the Open Graph Protocol?
  • How do I use the Graph API?
  • What data can we get from Facebook Insights or Analytics?
  • How about Audience Insights? Automated Insights? Facebook IQ?
Read More

Some CUDA AWS/Ubuntu Notes

Been cleaning up my work email and found these notes on installing CUDA on an EC2 instance from back in April. Figured some of it could potentially come in handy one day.

Read More

Hive-Minded Big Data

We are starting to store some of our data in Hive tables, which we have access to via a Hive or Presto connection. To dive right in, I turned to a few short courses on Lynda.com. In this log, I document a few things from the course Analyzing Big Data with Hive.

Read More

A 2nd Foray into the FB API

In a previous post, I covered a little bit about accessing FaceBook’s Graph API through the browser-based Graph Explorer tool, as well as how to access using the facebook Python module. Though the facebook module made things appear straightforward, it did seem to have some setbacks (e.g., it only supported up to version 2.7 of the API).

Read More

The Indefinitive Guide to Email in Python

The function below captures bits and pieces of the things that I’ve learend over the past several days. It is usable, for sure! But might not be ideal, e.g., I’m working on an object version of it that provides a little more flexibility (no need to specify everything up front!). Also, there seem to be better ways to include tables in an HTML email (e.g., pandas). That said, here is the state of my art at this time. It certainly puts together a lot of pieces that you can take or leave in your own code (attaching images, using inline images, including data tables, etc).

Read More

Multi-Processing in Python (the Multiprocessing Module)

Here’s the gist: by default, the Python interpreter is single-threaded (i.e., a serial processor). This is technically a safety feature known as the Global Interpreter Lock (GIL): by maintaining a single thread, Python avoids conflict. Each computation waits in line to take its turn. Nobody cuts the line. There is no name calling, spitting, or all-out brawls when things are taking too long. Everyone’s friendly, but no one is happy!

Read More

Multithreading in Python (Take One)

Here’s the situation:

  • I have a selenium script that scrapes and collates data from a javascript-heavy target source
  • There are multiple target sources of interest
  • The data collection from each source should happen at about the same time
    • e.g., more-or-less within same ~20 minute span is good enough
  • If each source scrape were quick, one might target them in sequence, however this is not the case!
    • Several of the targets are large and take way too much time to meet the “quasi-simultaneous” requirement

Some solutions include:

  • create separate python script for each target and use crontab to schedule all at the same time
    • however, any changes in the codebase’s naming schemes, etc, could make upkeeping each script a nightmare
    • ultimately want one script
  • create single python script that takes parameters from the commandline, and use crontab to schedule all at the same time
    • this is a much better solution than the first
    • however, I’m dealing with 30+ targets and would ideally like to keep my crontab file clean, e.g., can this be done with one row?
  • create single python script that takes parameters from the commandline, make a bash script and run multiple iterations of the python script simultaneously by forking (&), and use crontab to schedule bash script
    • this is getting pretty good!
  • learn how to use multithreading in python
Read More

Sending Email Notifications from the Command Line

In a python program, I can use a try/except statement to avoid a crash. In the case of a would-be crash, I can save some data locally and email myself about the failure. What if the program is supposed to log data every hour and something outside the try/except statement fails? Though it may crash, we can minimize losses by having the program run on a schedule using cron…

Read More

Revisiting YouTube's Data API for Content Owners

Let’s face it: my first post on the Data API is just brain spew and chicken scratch!
It’s certaintly been a helpful reference for me as I’ve further played with the Reporting and Data APIs – but it’s time for an update!

Read More

The Google Client API

Previously, I wrote about how to use the YouTube Reporting API from Python. My hope was that by dissecting and restructuring the provided code snippets into a more procedural format, it would be easier for a newcomer to get a sense of what each line of code is doing… Or at the least, allow them to use the code in an interactive python session to see what each line does (whether or not they gain any further sense of it).

Read More

The YouTube Reporting API

For most content, YouTube provides daily estimated metrics at a 2-3 day lag. If you are working on a project that requires recency or metric estimates at a better-than-daily cadence, scraping is probably the way to go, and will allow you to obtain estimates of total views, likes, dislikes, comments, and sometimes even a few other quantities.

Read More

Downloading Files with Selenium

Many platforms you want to extract data from will provide CSV or Excel files. Manual download is easy, but doing it everyday is laborious. Furthermore, if you work for a media company, you might have 100’s of Facebook Pages, YouTube Channels, and so on. At one point, an automated solution could be beneficial!

Read More

Data Shepherd

Data doesn’t always come neatly packaged in a table or streaming through some API.
Oftentimes it’s just out there — free range flocks of it on the Wild, Wild Web, just waiting for a cowboy to come by and herd it to the meat factory for slaughter.

Read More

Signing in with Selenium

There are many services we use that provide data from their website… You just have to sign in! Usually there is a dashboard, an Excel/CSV file, or both. Scraping a dashboard is incredibly specific to the platform you are interested in, depending on how that dashboard is coded. Scraping a file is something I need to figure out and will cover in a future post.

Read More

The Beautiful Soup

I use this package all the time, but only figure it out as I go along…and I often forget what I did the last time. Here are a few refresher commands:

```python import requests import bs4

Read More

Solarizing the Terminal

In RStudio, I use the “Material” color theme, which has a midnight blue background similar to the popular “solarized” theme. In my never-ending quest to transform my Terminal/Vim/iPython set up into something more like RStudio, I wanted to figure out how to do this in Vim.

Read More

Jupyter Notebook + Console

In my previous post, I was trying to figure out how to use Tmux to integrate a remote vim and iPython session, while displaying images locally. For example, I wondered, “Is it possible to create a TMux session that allows one to place QTConsole side-by-side with Vim, Bash, etc?”

Read More

Notes on Autoencoders (Decoded into English)

An autoencoder is designed to reconstruct its input. In a sense, a perfect autoencoder would learn the identity function perfectly. However, this is actually undesirable in that it indicates extreme overfitting to the training data set. That is, though the autoencoder might learn to represent a faithful identity function on the training set, it will fail to act like the identity function on new data — especially if that new data looks different than the training data. Thus, autoencoders are not typically used to learn the identity function perfectly, but to learn useful representations of the input data. In fact, learning the identity function is actively resisted using some form of regularization or constraint. This ensures that the learned representation of the data is useful — that it has learned the salient features of the input data and can generalize to new, unforeseen data.

Read More

Customing R with .RProfile

Is there a custom function you always use that is too specific to really create a library around, but that you use so frequently that doing something like source(path_to_file) gets annoying? For me, its one I call rsConnect(), which allows me to connect to Amazon Redshift without having to remember the specifics every single time.

Read More

The Perils of PCA

Need to reduce the dimensionality of your data? Principal Component Analysis (PCA) is often spouted as a go-to tool. No doubt, the procedure will reduce the dimensionality of your feature space, but have you inadvertently thrown out anything of value?

Read More

Artificial Funklord

As a kid, I loved to play Toe Jam & Earl with my brother. The game’s music was epic and inspirational, at least as far as we were concerned. Later on, in the future–now the past!–we would learn some musical instruments and jam out funk-rock style improvs largely inspired by TJ & E sounds.

Read More

Accessing Jupyter Notebooks and TensorBoard on AWS

To use a Jupyter Notebook or TensorBoard on AWS is straightforward on a personal account. It’s a little trickier if you are using a work account with restricted permissions. In this post, I will lay out my experience with both scenarios.

Read More

Better Jupyter Notebook

At one point, long ago, I was using MatLab for this project, IDL for that project, and R for yet another. For each language, I used a separate IDE, and this introduced a productivity bottleneck.

Read More

Conditional Aggregation in {dplyr} and Redshift

The responsibilities of my job and the projects I work on can vary from one day to the next. Turns out that a clever solution to a problem isn’t something I necessarily remember 100% when confronted with the same issue several weeks or months down the line.

Read More

Linear Regression in Tensorflow

Linear regression is a go-to example of supervised machine learning. Interestingly, as with many other types of well-known data analysis techniques, a linear regression model can be represented as a neural network: using known input and output data, the goal is to find the weights and bias that best represent the outputs/response as a linear function of the inputs/predictors. The neural network representation seamlessly integrates simple (one predictor), multiple (more than one predictor), and multivariate (more than one response variable) regression into one visual:

Read More

My Mac From Scratch

I’ve spent nearly a decade honing my .bash_profile, .vimrc, and other startup files on my personal laptop. But can’t say that I think about them too much. It’s set-and-forget for long periods of time. To be clear, taking things for granted goes far beyond startup files: XCode, HomeBrew and hordes of brew-installed software, R and Python libraries, general preferences and customizations all around!

Read More

Jupyter Jumpstart

In Udacity’s Deep Learning nanodegree, we will be developing deep learning algorithms in Jupyter Notebooks, which promotes literate programming. These Notebooks are a great way to develop how-to’s and present the flow of one’s data science logic. They are similar to notebooks in RStudio, but run in the browser and have a “finished feel” by default (whereas, with R notebooks, you have to compile it to get the finished look).

Read More

Conda Quick Guide

We will be using Python in the Deep Learning Nanodegree, specifically the Anaconda distribution. Conda is a package management and virtual environment tool for creating and managing various Python environments.

Read More

Running with Redshift

While working with some tables in Redshift, I was getting frustrated: “I wish I could just drag all this into R on my laptop, but I probably don’t have enough memory!” A devlish grin: “Or do I?”

Read More

Connect to Redshift from R with RPosgreSQL

At work, a lot of people use DBVisualizer on Windows computers… This was ok to get up and running, e.g., learning about the various schemas and tables we have in Redshift. But at one point, its utility is lacking: you can’t really do anything with the data without writing it to a CSV file and picking it up in R or Python. So why not cut out the middle man and just query data from R or Python?

Read More

What is CouchDB?

When choosing a database, there is MongoDB, Cassandra, MariaDB, PostgreSQL, and on and on. One question that plagues my mind when reading about all these possibilities is: How do you know which one to use? What use cases is a particular DB optimized for? Which ones is it terrible for?

Read More

Authorship

It is rare that a scientist works in isolation, despite the existence of single-author papers. The problem is there is only one label: author. And this fudges with everyone’s heads.

Read More

Wave Polarization: Some Basics

In a spatially 3D universe, one encounters two major types of waves: transverse and longitudinal. The waves are categorized into these two groups in response to the question: Does the wave vary parallel or perpendicular to its direction of propagation?

Read More

Google Trends

There’s no way around it: you’re not going to find the meaning of life in this blog post.

Read More

Riometer Movies with ImageMagick and FFMPEG

Let’s say there exists a server with a few years worth of daily riometer images. And suppose we want to download all of ‘em and make some time-lapse-like movies for each year of data. First, how do we quickly download 1000s of images? Secondly, how do we glue ‘em all together into a movie?

Read More