Regex vs Libraries vs OCR - Which Parsing Process is Right for You
Collecting web data can be a challenging, multi-step process. One stage that often causes difficulties for those just starting out is parsing. But what exactly is parsing, and which tools and methodologies will best fit your needs? In this article, we’ll take a closer look at three different approaches to parsing HTML, and examine the advantages and disadvantages of each one in order to help you decide.
Parsing Defined
Before we dive into the different approaches, it’s important to first explain what parsing is, and what makes it challenging. Simply put, parsing is the process of dissecting data in order to extract only the useful parts, and then converting those parts into a format that can be easily stored and used in data analysis algorithms.
For example, consider the following HTML:
<!DOCTYPE html>
<html>
<body>
<h1>This is a list of planets.</h1>
<ol>
<li>Mercury</li>
<li>Venus</li>
<li>Earth</li>
<li>Mars</li>
<li>Jupiter</li>
<li>Saturn</li>
<li>Uranus</li>
<li>Neptune</li>
</ol>
</body>
</html>
A web scraping bot will return something similar to this example. Although the list of planets has been returned, it’s contained within the tags and syntax of HTML, and needs to be extracted before the actual desired data can be used, displayed, or stored.
In the simple example above, it may be easy enough the select certain lines and find/replace the HTML tags, but in real world use, HTML is far more complex. Being able to accurately and precisely select the data you need can be hard depending on how the website is structured, and whether or not the data you need has unique identifiers that can be used to pick it out.
Furthermore, regardless of which approach is used, webpages are dynamic, living entities that change frequently. Although some parsing methods may be more durable than others, every method will require maintenance and monitoring to ensure smooth operation.
Parsing Method #1: Regular Expressions
Regular expressions, often called Regex or Regexp, is a commonly-used method for parsing text by defining a search pattern. Regex assigns special values to certain characters in order to define the pattern. For example, the expression “^The” will filter out any string that begins with the word “The”, and the expression “The$” will filter out any string that ends with the word “The”.
Advantages
- Powerful matching capabilities - unlike some other methods we’ll discuss later, Regex can parse out data even when it has no distinguishing identifiers. Regex is very precise and exact, which allows is to match many kinds of patterns, including those that don’t necessarily derive from a particular language like HTML. However, this also leads us to one of Regex’s big disadvantages.
Disadvantages
- Unforgivingly sensitive - Regex is incredibly sharp and precise with its definitions, and a single character out of place will throw it off. Regex will only match exactly what fits the requested pattern, which also makes it less durable than other approaches.
- CPU Intensive - HTML pages can often be very large (in the dozens of kilobytes). Processing that much text through Regex is often hard on CPUs, which often leads to CPU hogging and poor performance.
The combination of advantages and disadvantages mean that Regex is not often used as the primary parse method for HTML, but in a pinch, or in combination with other methods, it can be very useful.
Parsing Method #2: HTML Parsing Libraries
HTML parsing libraries work by building a “tree” out of the elements within the HTML - much in the same way a web browser would. The branches of the tree reflect the nested HTML containers, and elements can be referred to by their tag, id, class, or other identifiable attributes. The parse tree is stored in memory, which makes it very fast and responsive, and due to the tree structure, when an element is requested the path to it is much shorter because irrelevant preceding elements can be skipped.
There are dozens of parsing libraries, and there are many reasons to choose one over another, but two important factors to consider are programming language and locater method. Which programming language you choose is mostly down to personal preference, and which language is in use otherwise in your application. When it comes to locater method, there are two main locaters to choose from.
CSS Locators
Cascading Style Sheets (CSS) is the language used to stylize HTML. It’s used almost universally in designing websites, and has a very powerful targeting system that allows developers to describe the sizes, colors, positions, and other visual attributes of webpage elements.
This powerful targeting system is also used by many parsing libraries to help select relevant elements and extract them from the HTML source. For example:
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<h1 id="mainTitle">Page Main Title</h1>
<div class="contentBody">
<p class="subtitle">Attention Grabbing Subtitle</p>
<p class="mainContent">This paragraph is the main content of the page, </p>
</div>
</body>
</html>
Notice the id
and class
tags attached to the h1 and paragraphs in the above HTML. A CSS designer would use these tags to identify and stylize these elements, but a CSS locator parsing library allows for selecting elements using these same tags. CSS has many advanced targeting features, including nesting, variables, regex (yes, within CSS!), and more.
Advantages
- Fast - CSS locators are very fast - even amongst parsing libraries (faster than XPath, the other locater we will cover).
- Easy - CSS selectors are very simple and straightforward, with a low learning curve.
- Effective - The granularity of CSS combined with the it’s ubiquity in modern websites means it’s highly effective at parsing out even hard-to-reach elements.
Disadvantages
- Memory Intensive - like all parsing tree libraries, CSS locators need a lot of memory. Because the entire structure of the page is loaded into memory, the size of the page will influence memory requirements, but a large page could require significant resources.
- Maintenance - although this is not unique to CSS locators, this method of parsing may require a lot of maintenance, as changes to the CSS of the page can cause parsing issues. Stylistic changes to the webpage may affect CSS locators more than other methods.
XPath Locators
XPath stands for XML path, and is fundamentally a tool for parsing XML data. XML is a markup language similar to HTML, but instead of being designed to build visual pages, XML was designed for storing and transmitting data. Is it most often used to share data over the web, such as with the popular implementation in RSS.
Although XPath was made with XML in mind, this locator works just as effectively with HTML because of similarities between the two markup languages. The way XPath works is by tracing a a hierarchical path (similar to a file path on your computer) that leads from the top of the webpage down to the target element. If we reuse the previous example:
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<h1 id="mainTitle">Page Main Title</h1>
<div class="contentBody">
<p class="subtitle">Attention Grabbing Subtitle</p>
<p class="mainContent">This paragraph is the main content of the page, </p>
</div>
</body>
</html>
The XPath string that would lead us to the contentBody
would be:
/html/body/div
Like CSS locators, XPath also has more advanced features. This table compares the syntax and abilities of CSS and XPath locators:
Advantages
- More Powerful - XPath has abilities not present in CSS locators, such as referencing up the tree (from child to parent elements), selecting previous elements, and more.
- Search - XPath can search for visible text using the text() method, which is impossible to do using CSS locators.
Disadvantages
- Performance - although faster than Regex, XPath is slower than CSS locators.
- Complexity - the syntax and expanded toolkit in XPath makes it harder to learn and to use than CSS locators.
Because of their powerful targeting abilities, high performance, and modest learning curve, HTML parsing libraries are currently the standard approach to parsing HTML. However, with the growth of machine learning technology, there’s a new approach that has been capturing more and more attention in recent years.
Machine Learning OCR
Optical character recognition (OCR) is a computer-vision technology that enables the extraction of text from an image, PDF document, or other resource where text is displayed but inaccessible. Although OCR is not new technology, advancements in machine learning have improved the accuracy of OCR models by incredible margins.
This growing success rate has inspired some to begin experimenting with OCR for web data collection. Machine learning is involved in two phases of the process, which could be broadly outlined as:
- The target webpage is accessed by the data collection bot, and all CSS and javascript on the page is fully rendered.
- A screenshot of the target webpage is saved, with machine learning models either directing the area of the screen to capture, or cropping the image so as to retain only the section that contains the relevant data. This is done to reduce irrelevant OCR output and save resources.
- A machine-learning trained OCR engine analyzes the resulting image and extracts the relevant text data into a text file, CSV, or other storage option.
Advantages
- No HTML - because the OCR engine analyzes the page as it is displayed to a user, there is no need to parse through the inconsistent syntax of HTML pages.
- Durability - depending on the accuracy of the machine learning model, the area of relevant data on the page can be recognized very effectively, making this approach more durable to changes in the target website.
- Team Collaboration - even if there are team members who do not write code, or can only write basic code, operating this form of data collection bot can be done graphically with the right interface. The user can visually select the area of the page he/she wishes to collect, and that data is then targeted by the bot.
Disadvantages
- Recognition - Training the machine learning models that identify the area of the page to screenshot/save is very difficult, and requires a background in machine learning and computer vision.
- Complex - Building OCR and Machine Learning models is difficult, and has a steep learning curve.
- Resource intensive - Images are generally much larger than text files, and thus storing and analyzing them requires more memory and CPU resources than text.
Conclusion
Parsing is an inseparable and often-overlooked stage in the web scraping process. Although it can be challenging to contend with inconsistent HTML and shifting website structures, there are many tools available in a variety of programming languages that were designed to help parse out relevant data. We hope this overview helped give you a better understanding of the available approaches to web data parsing! If you’re interested in learning more about the web scraping process in general, we recommend checking out our blog Three Approaches to Modern Web Scraping.
FAQ
Answers to frequently asked questions