Scraping Instabilities

[ webscraping  python  automation  etl  wwe  ]

Last year I developed a lot Selenium/BeautifulSoup scripts in Python to scrape various social media and data collection platforms. It was a lot of fun, and certainly impressive (try showing someone how your program opens up a web browser, signs in to an account, navigates to various pages, clicks on buttons, and scrolls aroun without impressing them!).

Over time, I ran into various problems with these scripts. Most problems need a time.sleep(x), but not all. Some problems seem to be outside of my control altogether.

One such problem is on YouTube: recently, they changed the class name of the anchor tags I was scraping. This meant that BeatifulSoup’s select method began churning out empty lists, instead of the expected data elements.

For reasons like this, I’ve changed a lot of automated scripts, preferring available APIs over Selenium/BeautifulSoup when available. Don’t get me wrong: Selenium is a great tool, and excellent way to supplement the various APIs I now use. But the “stability” of your script is so dependent on little tweaks in the HTML, CSS, and JavaScript…

Now that I’m writing this, I’m thinking: “Yea, but APIs are known for abruptly changing as well.” So, maybe there is no rest for the automated!

Example:

## Formerly...
vlist = page.select('a[class="style-scope ytd-playlist-video-renderer"]')

## Now...
vlist = page.select('a[class="yt-simple-endpoint style-scope ytd-playlist-video-renderer"]')

As you can see, YouTube added an additional class name (“yt-simple-endpoint”) to the anchor tag. This is likely to happen again!

Written on January 29, 2018