Pitting Pandas vs Postgres (a Refresher)

Written on December 20, 2019

[ sql python database postgres pandas easi ]

Oftentimes when working with a database, it is convenient to simply connect to it via Python’s Pandas interface (read_sql_query) and a connection established through SQLAlchemy’s create_engine – then, for small enough data sets, just query all the relevant data right into a DataFrame and fudge around using Pandas lingo from there (e.g., df.groupby('var1')['var2'].sum()). However, as I’ve covered in the past (e.g., in Running with Redshift and Conditional Aggregation in {dplyr} and Redshift), it’s often not possible to bring very large data sets onto your laptop – you must do as much in-database aggregation and manipulation as possible. It’s generally good to know how to do both, though obviously since straight-up SQL skills covers both scenarios, that’s the more important one to master in general.

In this tutorial, I look at a Spotify database, which is available online (code to access it below) via a shared Google CoLab notebook… But! Make sure to MAKE A COPY of that notebook before you use it. (I began playing around with the original notebook, thinking my changes wouldn’t be saved…but I was wrong. Sorry to its original author!)

The questions below all come from the shared notebook. Just like my solutions below, the original notebook has its own solutions: if you want some practice, ignore them and try developing the queries yourself. (That’s what I did.)

NOTE: The data set has 74,200 rows and 6 columns, which amounts to ~3.6MB (74200*6*8/1e6) for 64-bit numbers. This can easily fit into memory, so we can technically do all of our aggregations and transformations out of the database. For fun and instructional purposes, we’ll also look at how we would do the same aggregations and transformations in SQL.

Connect to the Database

from sqlalchemy import create_engine

# Connect to Database
db = 'db_strata'
host = 'db-strata.stratascratch.com'
user = 'malabdullah'
psswd = 'nf64VlNxO'
conStr='postgresql://'+user+':'+psswd+'@'+host+':5439/'+db
conn = create_engine(conStr)

# Query into DataFrames
from pandas import read_sql_query as qry

# Easier-to-Remember Table Name (use f strings in queries)
tbl = 'spotify_daily_rankings_2017_us'

What percentage of total streams come from the top 10 artists?

Assumption: I can pull entire table into memory

df = qry("SELECT * FROM spotify_daily_rankings_2017_us", conn)
# List table columns
df.columns
  Index(['position', 'trackname', 'artist', 'streams', 'url', 'date'], dtype='object')
result = df.groupby('artist')[['artist','streams']].sum().sort_values(by='streams', ascending=False)
round(100 * result.streams[:10].sum()/result.streams.sum(), 1)
  29.1

Assumption: I cannot pull entire table into memory (in-db query)

# List table columns
qry("SELECT * FROM spotify_daily_rankings_2017_us LIMIT 0",conn)
  position track name artist streams url date

# Query time!
qry("""  
  WITH A AS (    
    SELECT artist, (100 * streams::numeric / SUM(streams) OVER ()) AS perc    
    FROM spotify_daily_rankings_2017_us  ),  
  B AS (    
    SELECT artist, ROUND(SUM(perc), 2) as perc    
    FROM A     
    GROUP BY artist    
    ORDER BY perc DESC    
    LIMIT 10  )  
  SELECT SUM(perc) 
  FROM B
""",conn)

What percentage of total streams come from the top 10 tracknames?

Note something important here: when aggregating, it’s important to group by artist AND trackname (not just trackname) – songs often have the same name!

Assumption: I can pull entire table into memory

df = qry("select * from spotify_daily_rankings_2017_us", conn)
# Reminder of what columns there are...

# Get streams by artist and trackname 
#   -- important to include artist since tracknames are often duplicated
streams_by_artist_track = df.groupby(['artist','trackname'])[['artist','trackname','streams']].\
  sum().\
  sort_values(by='streams', ascending=False)

# Answer
round(100 * streams_by_artist_track.streams[:10].sum()/streams_by_artist_track.streams.sum(), 1)
  10.0

Assumption: I cannot pull entire table into memory (in-db query)

qry("""
  WITH A AS (
    SELECT artist,
      trackname,
      SUM(streams) AS streams
    FROM spotify_daily_rankings_2017_us
    GROUP BY artist, trackname
    ORDER BY streams DESC
  ), B AS (
    SELECT artist,
      trackname,
      (100*streams::numeric / SUM(streams) OVER ()) AS pc_streams
    FROM A
    LIMIT 10
  )
  SELECT ROUND(SUM(pc_streams), 1)
  FROM B
""", conn)
  9.9

I get 10.0 for first answer and 9.9 for second… This might mean something is wrong with my big SQL query… But it also might be due to a roundoff error. (If someone see an obvious error, let me know!)

Which top 10 artists had the most streams in 2017?

First, if you don’t remember, look at the column names:

qry(f"SELECT * FROM {tbl} LIMIT 0")

Next, figure out what format the date is in:

qry(f"SELECT date FROM {tbl} LIMIT 3")
  1/1/17
  1/1/17
  1/1/17

Unfortunately, we cannot simply use Postgres’ SUBSTRING function to extract the year for date in this format since the prefix is variable length (e.g., 1/3/17 vs 10/12/17. Fortunately, there is another Postgres function for this, called RIGHT.

qry(f"SELECT RIGHT(date, 2) AS yy FROM {tbl} GROUP BY yy")
  17
  18

Cool! Let’s proceed.

This is such a straightforward query; no need to really munge around outside of the DB with it!

qry(f"""
  SELECT artist,
    SUM(streams) as streams
  FROM {tbl}
    WHERE RIGHT(date,2) = '17'
  GROUP BY artist
  ORDER BY streams DESC
  LIMIT 10
""", conn)

	artist	streams
0	Drake	1243212918
1	Kendrick Lamar	1142667504
2	Post Malone	949111981
3	Lil Uzi Vert	756277795
4	Ed Sheeran	711107331
5	Migos	676582502
6	Future	565462534
7	The Chainsmokers	552246036
8	21 Savage	470721805
9	Khalid	459771711

List the artists by the number of track names in the top 200 in descending order

We are given an additional note: “You’ll need to take into account the number of regions the artist is in.” This note weirded me out at first since we do not have a “region” column in the table we’ve been using, where we only have position, trackname, artist, streams, url, date. Then I saw that the original author of the notebook was using a different table for this, which brings us to another important lesson:

How do you quickly see what tables are available in your database?

Well, for one, you can do this:

qry("""
  SELECT tablename
  FROM pg_catalog.pg_tables
    WHERE schemaname != 'pg_catalog'
      AND schemaname != 'information_schema'
""",conn)

However, for the database we’re currently connected to, this brings up a few hundred tables corresponding to other SQL interview tutorials, so we have to be a little more specific.

qry("""
  SELECT tablename
  FROM pg_catalog.pg_tables
    WHERE schemaname != 'pg_catalog'
      AND schemaname != 'information_schema'
      AND tablename LIKE '%spotify%';
""",conn)

    tablename
  0	spotify_worldwide_daily_song_ranking
  1	spotify_daily_rankings_2017_us

The table we have not been using is spotify_worldwide_daily_song_ranking, so let’s check for a region variable:

qry("SELECT * FROM spotify_worldwide_daily_song_ranking LIMIT 1")

    id	position	trackname	artist	streams	url	date	region
  0	0	1	Reggaet√≥n Lento (Bailemos)	CNCO	19272	https://open.spotify.com/track/3AEZUABDXNtecAO...	1/1/17	ec

This one DOES have a region variable. Nice!

One thing though: do we really need this table at all? The original table we were working with has both position and date. Shouldn’t we be able to answer the question directly from those variables?

I guess we can check both… But if the answers come up different, then I don’t know what to tell you (without looking much more deeply into where these tables come from, etc).

My first inclination was to issue a WHERE statement to make sure we only count tracks that made it in the top 200…but then I thought to look at what the max position actually is – and it’s 200.

qry(f"""
  SELECT MAX(position) as max
  FROM {tbl}
""", conn)
  200  

That said, in a general rankings list, the rankings might go up into the 1000’s, so I’ll show how to do that below anyway.

Remember, we want to know how many tracks an artist has that made it into the top 200.

Assumption: I can pull entire table into memory

This query is actually so simple, it’s probably just worth doing the SQL before bringing anything into Python… But for consistency, here is the Pandas version of the query.

df = qry(f"SELECT artist, trackname FROM {tbl};", conn)
df.groupby('artist').\
  nunique('trackname').\
  sort_values(by='trackname', ascending=False)[:10].\
  drop('artist',axis=1)

	artist	track_count
0	Future	53
1	Drake	38
2	Linkin Park	28
3	Big Sean	27
4	Taylor Swift	25
5	Lil Uzi Vert	25
6	Ed Sheeran	25
7	21 Savage	24
8	Bryson Tiller	23
9	Eminem	21

In a general rankings list, the rankings might go up into the 1000’s. This is not the case here, so I didn’t explicitly reference the position variable…but that almost feels like cheating, so here is the more general query where we want to see the artist who appeared in the Top 10 the most often.

df = qry(f"SELECT artist, trackname, position FROM {tbl};", conn)
df.query('position <= 10').\
  groupby('artist').\
  nunique('trackname').\
  sort_values(by='trackname', ascending=False)[:10].\
   drop(['artist', 'position'],axis=1)

	artist	track_count
0	Drake	14
1	Kendrick Lamar	12
2	Huncho Jack	5
3	Lil Uzi Vert	5
4	Post Malone	4
5	Big Sean	4
6	Ed Sheeran	4
7	Eminem	3
8	The Chainsmokers	3
9	Migos	3

Assumption: I cannot pull entire table into memory (in-db query)

qry(f"""
  SELECT artist,
    COUNT(DISTINCT trackname) as track_count
  FROM {tbl}
  GROUP BY artist
  ORDER BY track_count DESC
  LIMIT 10
""", conn)

And here is the more general form of the query for when the ranking variable is actually necessary to consider.

qry(f"""
  SELECT artist,
    COUNT(DISTINCT trackname) as track_count
  FROM {tbl}
    WHERE position <= 10
  GROUP BY artist
  ORDER BY track_count DESC
  LIMIT 10
""", conn)

You might find that both these SQL queries result in slightly different tables than shown above. This is because we only order by track_count, so when the artists tie, the artist ordering is somewhat arbitrary. To ensure a consistent return, we must also order by artist. You will notice however that this means some folks who tied for the last few spots won’t make it into the table… So depending on your business use case, you might have to get more technical, e.g., Top 10 including all ties for the 10th spot (making the list a bit longer than 10).

How do the number of streams in the top 10 differ than the number of streams in the top 50? top 100? top 200? Find the average number of streams in the top 10, 50, 100, 200.

In Python, this is easy, right? You can just use a for loop and get some numbers really quick.

But what if you had to do everything in SQL? This is just for extra credit. Let’s assume we use someone SQL workbench GUI and don’t know Python. How could we return an answer?

Here’s one way using a bunch of CTEs and JOINs:

qry(f"""
  WITH top10 as (
    select 1 as joinvar, 
      avg(streams) as top10avg
    from {tbl}
      where position <= 10
  ), top50 as (
    select 1 as joinvar, 
      avg(streams) as top50avg
    from {tbl}
      where position <= 50
  ), top100 as (
    select 1 as joinvar, 
      avg(streams) as top100avg
    from {tbl}
      where position <= 100
  ), top200 as (
    select 1 as joinvar, 
      avg(streams) as top200avg
    from {tbl}
      where position <= 200
  )
  SELECT top10avg, 
    top50avg,
    top100avg,
    top200avg
  FROM top10 
    JOIN top50 on top10.joinvar = top50.joinvar
    JOIN top100 on top100.joinvar = top50.joinvar
    JOIN top200 on top200.joinvar = top100.joinvar
""",conn)

	top10avg	top50avg	top100avg	top200avg
0	1.152252e+06	695913.378329	506541.295795	355588.68128

Another way could be with a bunch of CTEs (again) and a bunch of UNION operations.

qry(f"""
  WITH top10 AS (
    SELECT 'top10'::text AS top_x,
      avg(streams) AS avg_streams
    FROM {tbl}
      WHERE position <= 10
  ), top50 AS (
    SELECT 'top50'::text AS top_x,
      avg(streams) AS avg_streams
    FROM {tbl}
      WHERE position <= 50
  ), top100 AS (
    SELECT 'top100'::text AS top_x,
    avg(streams) AS avg_streams
    FROM {tbl}
      WHERE position <= 100
  ), top200 AS (
    SELECT 'top200'::text AS top_x,
      avg(streams) AS avg_streams
    FROM {tbl}
      WHERE position <= 200
  )
  SELECT * FROM top10
     UNION SELECT * FROM top50
     UNION SELECT * FROM top100
     UNION SELECT * FROM top200
  ORDER BY avg_streams DESC
""",conn)

	top_x	avg_streams
0	top10	1.152252e+06
1	top50	6.959134e+05
2	top100	5.065413e+05
3	top200	3.555887e+05

In Python, this all simplifies to a loop. This can be a more SQL-esque loop, or more Pandas-esque.

SQL-esque Loop

avg_streams = dict()
for num in [10, 50, 100, 200]:
  avg = qry(f"""
    SELECT avg(streams) AS avg_streams
    FROM {tbl}
      WHERE position <= {num}
  """, conn)
  avg_streams[num] = avg.avg_streams

Pandas-esque Loop

Sometimes Pandas feels more efficient and useful… Other times, it feels like overkill. To my mind, this is one of those cases. But whatever works!

df = qry(f"SELECT * FROM {tbl}", conn)
avg_streams = dict()
for num in [10, 50, 100, 200]:
  avg_streams[num] = df.query(f'position <= {num}').streams.mean()

How many different artists are there in the top 100 vs top 101-200? Compare the number of artists in the top 100 vs the top 101-200.

Assumption: I can pull entire table into memory

df = qry(f"SELECT * FROM {tbl}", conn)

data = {
  'position <= 100': None, 
  'position > 100 & position <= 200': None,
}

for condition in data.keys(): 
  data[condition] = df.query(condition).artist.nunique()
data
  {'position <= 100': 291, 'position > 100 & position <= 200': 478}

Assumption: I cannot pull entire table into memory (in-db query)

qry(f"""
  WITH A AS (
    SELECT 'Rank < 100'::text AS category,
      COUNT(DISTINCT artist) AS distinct_artists
    FROM {tbl}
      WHERE position <= 100
  ), B AS (
    SELECT '100 < Rank <= 200'::text AS category,
      COUNT(DISTINCT artist) AS distinct_artists
    FROM {tbl}
      WHERE position BETWEEN 101 and 200
  )
  SELECT * FROM A
    UNION SELECT * FROM B
""",conn)

  	category	          distinct_artists
  0	100 < Rank <= 200	  478
  1	Rank < 100	        291

The author in the notebook just shows the number of distinct artists in the top 100 and top 200, which doesn’t quite answer the question… But as a sanity check, let’s just see if we get the same number for the top 200:

df.query('position <= 200').artist.nunique()
  487

Ok, yes we do. Great!

Which artists should the marketing team invest in? Why? Support your answer with your analytical research.

References

https://www.stratascratch.com/spotify-case-study.html https://colab.research.google.com/drive/1XG-TZbwU2oIZfZOIuX82cAqneK7-1ZSZ#scrollTo=l74CxV8UF4Mt