Extracting URLs (faster) with Python
The recommended approach to do any HTML parsing with Python is to use BeautifulSoup. It's a great library, easy to use but at the same time a bit slow when processing a lot of documents. In this blog post, I would like to highlight some alternative ways on how to extract URLs from HTML documents without using BeautifulSoup. I added a performance test at the end to compare each alternative.
First of, below is the source code to extracts links using BeautifulSoup. We will use LXML as the parser implementation for BeautifulSoup because according to the documentation it's the fastest. The code uses the
find_all functions with a
a tag filter to only retrieve the URLs.
import bs def extract(content): links =  soup = BeautifulSoup(content, 'lxml') for tag in soup.find_all('a', href=True): links.append(tag['href']) return links
Just for the performance test, I added a slightly modified code below which doesn't use the
a tag filter. Will there be any difference in execution time?
import bs def extract(content): links =  soup = BeautifulSoup(content, 'lxml') for tag in soup.find_all(): if tag.name == 'a' and 'href' in tag.attrs: links.append(tag.attrs['href']) return links
One of the underlying parsers used by BeautifulSoup is LXML. While BeautifulSoup provides a lot of convenient functions on top of it, you can use LXML directly. Specifically for our URL extraction case, the code isn't even complicated but strips away all the overhead.
import lxml.html def extract(content): links =  dom = lxml.html.fromstring(content) for link in dom.xpath('//a/@href'): links.append(link) return links
The Python framework has an HTML parser built-in, and the following snippet uses it to extract URLs. It's a bit more complicated because we need to define our own HTMLParser class.
Btw. by default BeautifulSoup uses the Python parser instead of LXML as the underlying parser. This is great in case you need a Python-only implementation.
from HTMLParser import HTMLParser class URLHtmlParser(HTMLParser): links =  def handle_starttag(self, tag, attrs): if tag != 'a': return for attr in attrs: if 'href' in attr: self.links.append(attr) break def extract(content): parser = URLHtmlParser() parser.feed(content) return parser.links
During my research I found Selectolax. It's a super fast HTML parser. Under the hood, it uses the Modest engine to do the parsing.
from selectolax.parser import HTMLParser def extract(content): links =  dom = HTMLParser(content) for tag in dom.tags('a'): attrs = tag.attributes: if 'href' in attrs: links.append(attrs['href']) return links
As a final alternative, the following code snippet uses a regular expression to parse HTML tags. Because it doesn't parse the actual HTML DOM it won't be a fit for every use case - especially when the document is malformed. At the same time, this can be a big plus because it will use less memory and regular expressions are very fast. In any case - use with caution.
import re HTML_TAG_REGEX = re.compile(r'<a[^<>]+?href=([\'\"])(.*?)\1', re.IGNORECASE) def extract(content): return [match for match in HTML_TAG_REGEX.findall(content)]
For the performance test, I downloaded the HMTL Wikipedia page which is around 353KB big and contains 1,839 links at the time of writing this article. I did run each extraction method 1,000 times on this file and used the average runtime for the result.
I didn't expect the differences in execution time to be so big for each method. It matters a great deal which of them you use. Interestingly doing the manual filtering with BeautifulSoup is faster than using the
a tag filter, something I wouldn't have expected. While the Regex implementation is the fastest, Selectolax is not far off and provides a complete DOM parser. If you don't want to use a Regex or Selectolax then LXML by itself can still offer decent performance.