March 23, 2022

The invisible technology bridging WWW and SQL

clock
13
min read
Copied!

Nimble's Expert

linkedin
The invisible technology bridging WWW and SQL

Gathering and cataloging the world’s public data is a counterintuitively difficult goal to accomplish. Google has solved it successfully, and offers an excellent solution for B2C applications, but there are no good solutions for B2B implementation.

As a result, businesses suffer from a variety of data-related shortcomings, including not using external data at all, building expensive and inefficient infrastructure, and often using incomplete, inaccurate, or outdated data.

Nimble’s novel browser technology aims to alleviate the issues that have thus far prevented a robust B2B data platform, and empower any businesses to collect public web data reliably and at scale.

Introduction

We are surrounded by increasingly sophisticated technology that is embedded in our modern life so intuitively that their underlying complexity is assumed to be simple. When we think of brands such as Amazon or Apple, convenience and quality are the first things that come to mind, and only later do we remember their incredible technical achievements.

For example, Netflix spent 10 years developing “Open Connect” [1], an in-house content distribution network that spans 17,000 servers in over 158 different countries [2]. The result? When the show Squid Games garnered an overnight audience of 111 million viewers, Netflix’s infrastructure held up flawlessly [3].

Proxy map for data collection and n2i2m5b2ll3

Over the last few years, I’ve focused on solving another seemingly simple problem - making the web an accessible data source for businesses. It would be game-changing if we could query the internet, as simply as we run SQL or a Google search, and receive a structured response drawing from the vast amount of publicly available data.

With modern data stacks, we are able to analyze huge volumes of data and wield powerful computing clouds and artificial intelligence networks (see my blog on the subject here), so indexing public web data at scale seems to me like an attainable goal.

Google handles more than 3.5 billion searches per day [4], and makes it easy for consumers to discover information online. Data has become more accessible and relevant than ever, and getting more data with less work has never been easier. However, for data-driven businesses looking to gather web data to refine their decision-making, there are currently very few options that support the needed scale.

Although it may seem straightforward to collect data from public websites, there are a number of counterintuitive obstacles that have thus far prevented it from being possible for the majority of companies. This is notwithstanding the fact that businesses stand to gain enormously from using external data.

I occasionally find myself explaining what scanning the internet actually means, and how it will revitalize E-commerce websites with dynamic price optimization, marketers with SEO and promotional strategies, and broadly help every business monitor their competitors, connect with their customers, and elevate their business intelligence.

In this article, I outline the engineering challenges my team and I have tackled in recent years, what led us to develop novel headless browsing technology, and how we created a data platform that makes collecting external data effortless.

Google - the first successful model for web data collection

One way to approach this question is to study the technology developed by Google, which transformed how we use the internet. Google’s web data collection approach can be broadly divided into two phases: web crawling and indexing.

Web Crawling

First, an automated program called a web crawler (also known as a bot or a spider) discovers any and all web pages by following links from other pages or by direct submission to Google via a sitemap. The crawler analyzes the information and structure of each webpage in order to develop an understanding of the page’s contents.

Indexing

The information from the crawler is stored in Google’s Index - a massive repository of all the collected webpage data that spans dozens of data centers and millions of servers around the globe [5]. When you search for a phrase, Google’s algorithms determine which results are most relevant and load them from the Index.

Data collection pipeline simplified for collecting web data using internet n2i2m5b2ll3

This is just a brief overview of how Google’s technology works (read more here). The reality is complex on multiple dimensions, but at its core, it derives from the ethos that making data accessible to users at the largest possible scale produces immense value.

The same value could serve businesses if it were possible for companies to build their own internal data collection service. From my perspective, there are two central challenges that must be overcome in order to create a data platform that fully realizes the potential of connecting businesses with external data.

The first challenge: web browser inefficiency

Essentially, the role of a web browser is to communicate with web servers and allow end users to interface with and access networks and information which is central to the OSI model Application Layer. Netscape, Explorer, and other first-generation web browsers processed simple, static, and lightweight websites. Over time, websites developed into web apps, which provide better experiences and are significantly more complicated. Uber recreated their entire mobile app experience as a Progressive Web App (PWA), and in doing so expanded its reach to more devices and 2G networks [6].

To keep pace, modern browsers need to run robust Javascript frameworks and programming logic. This race to the top has made it impractical to render websites accurately without a browser. Therefore, whenever web data needs to be accessed, companies use web browsers to facilitate the effort.

The drawback of having these powerful rendering engines is resource consumption - web browsers use a lot of resources. Web browsers are also difficult to automate due to their user-centric interfaces. RPA tools such as UI Path can automate web browsers and significantly reduce the manual work an operator would otherwise do to collect data, and although this solution works well, scaling up to 100M actions per day isn’t feasible.

Many companies tried to overcome the resource issue by using headless browsers, which function similarly to ordinary browsers - but have no graphical interface. Headless browsers were mainly created to automate QA tasks and save engineering time. Frameworks such as Selenium, Puppeteer, and Playwright are designed to be controlled programmatically and eliminate GPU processing. Advanced configurations are available via managed QA platforms, and Chrome DevTools Protocol (or other APIs) instead of through a GUI.

Having said that, headless browsers introduce a number of meaningful disadvantages when compared to real browsers. Firstly, although the functionality of headless browsers is very similar to real browsers, there are various APIs that may be disabled or modified, as well as other settings that distinguish them from real browsers and which can lead to unexpected behavior online.

Additionally, because headless browsers were not designed for web scraping, they do not have useful methods for rate-limiting requests and will often receive inaccurate data or will not render the content outright by source websites. Finally, although headless browsers save resources by not rendering elements graphically, they still (by design) perform the same networking activity that a real browser would, which can lead to significant resource consumption when scaled to millions of queries.

The shortcomings of headless browsers meant my team and I were spending most of our time debugging rather than progressing. We quickly realized there was a dire need for new tools and frameworks dedicated to solving the unique challenges we were facing. Our solution was to develop a new browser from the ground up, based on chromium, that provides a unique feature stack optimized for web data collection at scale, but more on that later.

The second challenge: crawling dynamic content in the modern web

In addition to becoming more technically complex, modern websites collect a variety of parameters that are used to personalize every page’s content to the user. The amount of parameters is infinite, and it’s impossible to create a database of all the public data on the Internet. Simply put, the data structure is too complex. At most, you can build a context map or layers of Vector database (such as what Google did or Diffbot).

collecting dynamic content over the internet n2i2m5b2ll3

One such parameter is the user’s country of origin. Using a VPN to access a product page from different countries and hunt for a lower price is a well-known tactic amongst savvy users and is a good example of how websites adapt to user metadata [7].

Another example involves websites serving higher prices to users who browse on Apple devices - the logic being that a person using an Apple device is more likely to spend more money on purchases [8].

The list of factors used by companies to regulate and tailor their content is constantly growing. Websites consider the user’s browser, operating system, user agent, whether or not the user is using a mobile device, the type of network connection, the user’s behavior on the website, and on and on [9, 10].

This intense personalization also applies to web crawlers. Google’s web crawler is well-known and receives preferential treatment, but most bots are generally treated with indifference (at best) or suspicion.

Dynamic personalization is a key challenge because it can degrade accuracy. It’s no longer about how big your dataset is, it’s about the relevance of the data. Inaccurate data negatively impacts companies’ ability to integrate external data in real time and make critical marketing decisions like pricing strategies and user LTV predictions (check Voyantis).

Existing methodologies

Businesses today employ engineers who manually code web data pipelines. Each pipeline is uniquely hard-coded for a specific data source (e.g. set of URLs or target website), and configured to emulate a set of parameters (such as location and device type) as well as the business logic needed to extract the desired data.

This logic is simple for humans but can be unexpectedly tedious or fragile when followed by a machine. Clicking on links or buttons, filling forms, scrolling to a particular area of a page, and selecting text blocks, prices, figures, or titles are just some examples of the kind of steps that are easy and intuitive for a person to follow, but that must be programmed into the web data pipeline’s scraping logic.

This kind of manual construction makes existing pipelines inefficient for multiple reasons:

  • Inflexibility - once built, much of the code used for the pipeline is not reusable for other sources and commonly not even for other pages of the same source.
  • Intra-organizational breakdown - The engineering team builds the pipeline, which is then handed off to the data team who actually uses the data. However, almost any change to the data or pipeline requires the data team to make a request to the engineering team, which bogs down the engineering team’s work on new tasks or pipelines, and in the meantime prevents the data team from progressing until the changes are implemented.
  • Unmanageable codebases - as more pipelines are written, the amount of code that needs to be managed grows substantially and can overwhelm engineering teams. Furthermore, data pipelines pose a high risk of repeated code, which can cause versioning and deployment issues.
  • Fragility - any change in the structure, design, layout, or content of the website risks breaking the data extraction algorithm and the whole pipeline.

existing data collection methodologies n2i2m5b2ll3

Current pipelines are complex, inflexible, fragile, and unmanageable.

Having said that, the benefits and resulting competitive advantage enjoyed by companies who use external data have driven many large organizations to hire vast in-house engineering and data teams. Others, seeking to avoid these issues, have opted to outsource the work to third-party data aggregators.

Although data aggregators eliminate the heavy development burden, they introduce a host of new limitations and restrictions. Firms have little or no input on the data collection process, which is often made specifically broader and more general to appeal to more clients but loses much of the specific, contextualized data an individual company needs to get the most accurate insights. Companies cannot experiment and explore the data to discover unknown unknowns or mix data sets to discover unexpected correlations.

Furthermore, companies cannot get real-time, streaming updates and must instead rely on the update pace set by the aggregator, which often reaches up to a month in intervals. For example, People Data Labs is a great company that aggregates B2B person profiles but whose data is updated on a monthly interval, which could lead to significant data inaccuracies in Sales or HR applications.

These approaches do not resolve the underlying issues with current generation data collection pipelines and instead, cause inefficiency and friction - which limits the scale and accuracy of the web data collection ops.

The Nimble approach

After years of building and maintaining web data collection ops, we’ve realized that new technology is urgently needed. There is a widespread demand for infrastructure that will be capable of scanning the internet with the speed, accuracy, and flexibility necessary to maximize the potential of external data.

Our approach at Nimble was to create a platform that eliminates engineering complexity and makes web data gathering accessible to all. We developed the first browsing technology uniquely designed for data collection purposes. We call it the Nimble Browser.

The Nimble Browser emulates a wide variety of devices and browsers and has a light footprint that can dynamically increase resource usage in accordance with the requirements of the source webpage.

It balances performance and compatibility by modulating between several levels of Javascript rendering and user behavior fingerprinting and can scale using Cloud Computing platforms without vendor lock-in. This is achieved using a combination of real and headless browsers that we heavily modified in order to better simulate real users.

By using Nimble’s IP Services, the Nimble Browser accesses IP addresses from anywhere in the world to maximize data accuracy, with access to high-resolution geographic targeting features down to the city level, as well as real-time load balancing and optimized IP allocation.

It was important to us that this technology elevates businesses on both sides of the aisle, and so a very stringent code of “web etiquette” is the foundation of the Nimble Browser. This ethos ensures respect for source websites by responsibly rate-limiting requests, restricting access to public web data only, and protecting user privacy with built-in compliance protocols.

Furthermore, the Nimble Browser has a number of different methods of scraping and parsing web content. These formats are intended to give users options - from the raw output that can be parsed on the client-side, to a structured, machine-readable parsed format that can be quickly ingested into a data lake or lakehouse.

We are building a Data Platform that will empower any business to easily establish its own web data pipelines. Although gathering web data is incredibly challenging today, Nimble’s browsing technology and web data platform are a big step towards making it effortless.

It’s just the start and there is a lot more to be done, stay tuned for the coming.

Albert Einstein said about truly understanding science:

You do not really understand something unless you can explain it to your grandmother.”

I wish you were here to listen, and I hope my explanation was clear!

__

Many thanks to the Nimble engineering team, the innovative approach of Ori Hamama our head of research, and Yuval and Alon’s leadership. Together, we are making it a reality.

  1. http://www.theverge.com/2012/6/5/3064182/netflix-open-connect-cdn-isp
  2. http://www.theverge.com/22787426/netflix-cdn-open-connect
  3. http://www.theverge.com/2021/10/12/22723452/netflix-squid-game-biggest-ever-show-at-launch
  4. http://www.internetlivestats.com/google-search-statistics/
  5. http://www.google.com/about/datacenters/locations/
  6. http://www.simicart.com/blog/progressive-web-apps-examples/#8_Uber
  7. http://surfshark.com/blog/how-to-save-money-with-vpn
  8. http://www.wsj.com/articles/SB10001424052702304458604577488822667325882
  9. http://www.techtarget.com/searchcustomerexperience/tip/How-to-comprehensively-personalize-the-customer-experience
  10. http://brainstation.io/magazine/home-depot-engagement-up-238-with-personalization-data

FAQ

Answers to frequently asked questions

No items found.