Facebook Graph: Data Collection Strategy (and Miscellaneous Updates)

[ facebook-graph  python  automation  etl  wwe  ]

Our ultimate goal is to treat the Page Node as the root node of a graph, and to strategize how to traverse that graph and what to collect along the way.

We have found that fields often hold managerial info, which can be uninteresting, yet is still important for housekeeping purposes (e.g., it’s good to know the privacy setting of a video if/when people are wondering why its generated no views; e.g., the engagement of a video or post is important by itself, but can be better understood in context of the object’s publication date). Sometimes, fields also contain high level metrics, like fan count.

A Facebook Page Node has many types of Edges, e.g., /albums, /bc_sponsored_posts, /events, /insights, /instant_articles, /instant_articles_insights, /likes, /photos, /posts, /ratings, /tagged, /video_broadcasts, /videos, and /visitor_posts, among many others. Each edge contains related nodes, which also have fields and edges. Using field expansion and nested requests, you can start to envisage collecting everything and anything all at once… But this approach can actually get unwieldy if your goal is to “collect everything.”

So, importantly, we have already developed some strategy: do not collect everything at once!

Instead, I think it’s better to piece together the Page’s graph little by little. For each edge like (page_id, x_id), simply create a mapping table. Then for each x_id, create a high level (“managerial”) fields table, several (x_id, y_id) mapping tables for important edges of x_id, and the corresponding high level y_id tables.

There are some caveats, e.g., you have to choose where to stop this process along each graph path:

  • say you create the (page_id, album_id) mapping table
  • then you might create the high level album_fields table and the (album_id, photo_id) mapping table
  • STOP: you might next decide to create a high level album_photo_fields table, but this would only result in a subtable of the photo_fields table you would create in conjunction with the (page_id, photo_id) mapping table

The Album Node Summaries

The album node’s engagement edges (/comments, /likes, and /reaction) allow one to get a total_count by setting the summary parameter, summary=true. By default, these edges return a list of the engaged user nodes. If you want to cut down on the noise and just return the summary, then set the limit parameter to zero.

token.get(album_id+'/likes?limit=0&summary=true')

Side Note About Page Likes

I’ve looked into this a few times now… Whereas an album node has a likes edge, which ostensibly shows you all users (User Nodes) who have liked the album, the Page node does not seem to have this feature. At first, I thought I didn’t know what I was doing, but I’m not so sure I’m a dummy anymore!

I find that the fan_count field of a Page Node gives you the number of Facebook users who have liked your page, which (btw!) is different than the number of folks “following” the page. From what I can tell, “followers” are “likers who have not unfollowed the page.” On Facebook, you can unfollow a page (not receive updates on your feed), while still maintaining that you “like” the page.

The “likes” edge of a Page Node returns what other Facebook Pages have been “liked” by the Page Node, which departs from the behavior of the “likes” edge of most other nodes (albums, photos, posts, videos).

So, basically, if you want to keep tabs on Facebook users who are likely fans of your page, then you can collate Facebook users who have liked (or reacted to) any of your page’s posts, photos, videos, etc. This can include users who have not liked your page, and it might also not include all users who have liked your page. But what can you do? (Seriously, if you know – leave a comment! (Yes, I know I’m likely only speaking to Future Kevin. Hey, Future Kev: leave a comment, ya bum!))

One possibility: if you really need to know whether or not the user has liked your page, you may be able to look at /{user-id}/likes to see if your page is in there. (Haven’t checked myself, so I’m not sure if you will run into a privacy barrier or not.)

Funnily enough, I’ve read that Facebook does this to “protect the privacy of its users.” But if that were the case, then why can I get the names and Facebook IDs of users who liked anything else that has to do with my Facebook Page? Basically, if you like the a Facebook Page, you’re quasi-anonymous to that Page’s owner – but the second you like anything on that page, you are not. And let’s be clear: you can get some likers directly by manually going to your Facebook Page > Insights > People, which shows fans, followers, and more. However, this is apparently only a small subset if you have a large following.

Relevant: https://developers.facebook.com/bugs/147185208750426

Bugs in the Graph

Can’t believe I haven’t perused the Bugs Page before! So many interesting and relevant questions/answers (e.g., when I search “graph api” w/ no filter tags).

Here is a relevant bug that is open at the time of this writing:

I’ve been debugging an issue where the numbers that we get from the insights API sometimes change between different runs. It seems that the results of async jobs that we do are sometimes missing some fields and some numbers are 0 instead of the correct number. We’ve seen differences in spend and impressions previously, but currently it seems to be mainly some other fields, such as reach or inline_link_clicks, etc. that are sometimes 0.

This bug here also seems squeamishly relevant:

We’re trying to query post reactions, likes, and shares on specific page post (not a final link). However, Insights API are returning strange numbers. Also, on analysis window through Business UI, numbers do not match as well.

And another creep crawler that makes my skin crawl a bit: the developer says, that post_stories_by_action_type and post_story_adds_by_action_type return the same value, which doesn’t seem to make sense: “I think post_stories_by_action_type should return actions made only in the page, and post_story_adds_by_action_type in the page and in user shares.” A Facebook rep responds:

I enquired internally with the engineering team responsible for this and it looks like this is a valid bug that’s already been brought up. Unfortunately, the underlying issue is inherent of the current system design. We are working on a long term solution that would address this issue but due to the complexity of the issue, however, I don’t have an ETA at this time.

Well… That’s comforting, huh? Comforting or not, these things are good to know! I can imagine this bugs page saving me a lot of grief in the future, trying to explain why things aren’t as they seem or should be.

For example, this bug discusses why a certain insights metric isn’t populated everyday. The page_fans_online_per_day metric is not populated unless the number meets a certain minimum threshold. What is that threshold? “Due to a policy issue, we cannot tell you what the threshold value is.”

Crazy!

Some Code Misc

Recap of SQL Tables Thus Far

CREATE TABLE page_high_level (
  page_id varchar,
  page_name varchar,
  fan_count int,
  talking_about_count int,
  as_on_date datetime
);

CREATE TABLE page_album_map (
  page_id varchar,
  album_id varchar
);

CREATE TABLE album_fields (
  album_id varchar,
  album_name varchar,
  album_type varchar,
  cover_photo_id varchar,
  description varchar,
  event varchar,
  link varchar,
  place_id varchar,
  privacy_setting varchar,
  created_time datetime,
  updated_time datetime
);

CREATE TABLE facebook_places (
  place_id varchar,
  name varchar,
  street varchar,
  city varchar,
  zip varchar,
  state varchar(2),
  country varchar, 
  latitude float,
  longitude float
);

CREATE TABLE album_photo_map (
  album_id varchar,
  photo_id varchar
);

CREATE table album_sharedpost_mapping (
  album_id,
  sharedpost_id
);

CREATE table album_likes_mapping (
  album_id,
  user_id
);

CREATE TABLE page_photo_map (
  page_id,
  photo_id
);

CREATE TABLE photo_fields (
  photo_id,
  back_dated_time,
  back_dated_time_granularity,
  created_time,
  event,
  link,
  page_story_id,
  place,
  height,
  width,
  created_time datetime,
  updated_time datetime
);

Some Python

# This uses a fb_token object I define elsewhere
def get_page_fields(
  token,
  me = None,
):
  if me is None:
    me = token.me
  data = token.get('me?fields=id,name,fan_count,talking_about_count')
  return data


get_page_albums(
  token,
  me = None,
  limit = 50,
):
  if me is None:
    me = token.me
  data = []
  next_url = 'me/albums?limit='+str(limit)
  while next_url:
    response = token.get(next_url, me)
    data += response['data']
    try:
      next_url = response['paging']['next'].split(token.fbg)[1]
    except:
      next_url = None
  df = pd.DataFrame(colums=['page_name', 'page_id', 'album_id', 'album_name'])
  for idx in range(len(data)):
    itme = data[idx]
    df.loc[idx, ['album_id','album_name']] = [item['id'],item['name']]
  page = token.get('me', me)
  df.page_id = page['id']
  df.name    = page['name']
  return df

def get_page_albums_for_all_accounts(
  token,
  limit = 50,
):
  pages = token.page_to_token.copy()
  _discard = pages.pop('user')
  df = pd.DataFrame(colums=['page_name', 'page_id', 'album_id', 'album_name'])
  for page in pages.keys():
    temp = get_page_albums(token, me=page)
    df = df.append(temp, ignore_index=True)
  return df
Written on February 14, 2018