Querying Hive or Presto Remotely in Python

[ python aws hive presto databases wwe ]

Using the YouTube Reporting API several months ago, I “turned on” any and every daily data report available. That’s a lot of damn data. Putting it into Redshift would be a headache, so our team decided to keep it in S3 and finally give Hive and/or Presto a shot.

Here, I show how to connect to remote instances of Hive and Presto (we’re on AWS, but this will work for whatever I would think). We have all the daily reports in Hive database, call it hiveYtReps. These reports are just CSV files in S3 buckets, but through some magic (serialization, I believe it’s called), our DE team makes ‘em appear as tables if accessed through Hive or Presto.

PyHive

I first installed PyHive and various dependencies… [Write more on this (find the notes where I had to pip install sasl and all that)]

Hive

from SQLAlchemy import create_engine
con = create_engine(
    'hive://'+username+'@'+IP+':10000/'+hiveDbName,
     connect_args = {'configuration': {'hive.exec.reducers.max': '123'}},
)

Presto

from SQLAlchemy import create_engine
create_engine(
    'presto://'+username+'@'+IP+'/hive/'+hiveDbName,
    connect_args={'session_props': {'query_max_run_time': '1234m'}}
)

On the PyHive webpage, you will see that connect_args has another key word:

connect_args={'protocol': 'https', 'session_props': {'query_max_run_time': '1234m'}}

In my case, this caused an error. By removing the protocol keyword, everything worked fine! (We have other security measures in place…but still kind of feels weird getting something to work by making it less secure looking.)

Written on January 26, 2018