Something's in the Soup!

Written on July 13, 2017
[ python webscraping wwe  ]

Here is a quick 1-2 about BeautifulSoup:

import requests
import bs4

# Scrape YouTube Main Page
page = requests.get("http://www.youtube.com")
page = bs4.BeautifulSoup(page.text, 'lxml')

# Look at the Page Title and HTML Head
page.title
page.head

# Look at first link tag in <head></head>
page.head.link

# Find all link tags
page.head.find_all('link')

# Find first link in body
page.body.a

# Look at first span tag in first link in body
page.body.a.span

Using this tag-chaining syntax is nice, but how do you know what children tags exist in a parent tag? To see what direct children a tag has, view ‘em all at once (e.g., page.body.a.contents) or access them in generator fashion (e.g., [tag for tag in page.body.a.children]). To see all descendants of a tag, generate them with .descendants. One can also look at the parent tag (tag.parent) and ascendants ([parent for parent in tag.parents]).

And don’t forget about siblings:

page.head.meta
page.head.meta.next_sibling
page.head.meta.next_sibling.next_sibling

More:

  • To look at all child strings associated with a tag: [tag_str for tag_str in tag.strings]
  • To strip the white space: [tag_str for tag_str in tag.stripped_strings]
  • The parent of a string is a tag: tag.string.parent is tag

Need to update this with my other notes.