Advertisement

Stepping into Web Data Parsing: An Overview

By on

Click to learn more about author Juras Juršėnas.

As far as I can remember, the usefulness of online public data was always pitched against the efforts of extracting and structuring it. However, going from raw data to a well-structured and parsed output takes a considerable amount of time, effort, and resources. Even once the initial prototype has been deployed, constant maintenance will be required. 

Once scale comes into the picture, data parsing may often become something that only the select few will do. What’s more, web data parsing has unique challenges due to the nature of how HTML has been used over the years. 

What is Data Parsing?

All web scraping activities rely on one particular action – extracting data. It all starts by downloading the HTML.  Unfortunately, while in most cases it holds all the necessary information, it is not structured for any further analysis. This is at no fault of HTML itself. It’s a language intended to be read by browsers and translated to a visually great end-result for the user. HTML has a lot of flexibility within the structure, allowing developers to take creative approaches to create the end-result. 

These creative approaches, however, often leave data scattered in a different manner. In order to glean something from HTML, analysts have to find ways to parse, structure and standardize the data points. Usually, the next steps involve writing custom scripts or some helper tools that define the rules for data processing and make it structured.

Retrieving data in a structured format such as JSON or CSV would be ideal. However, this is simply untenable at least in the current landscape. Therefore, data derived from HTMLs needs to go through parsing in order to become structured.

Challenges in Data Parsing

When it comes to externally acquired data (usually, from raw HTML), there is one primary issue that appears due to two factors – absolute necessity of parsing and the primary use of HTML. As HTML is used to visually represent content through a browser. Thus, it’s inefficient to perform any type of analysis on raw HTML.

Since there are numerous differences in web development practices between websites, getting a one-size-fits-all parser is extremely hard. For example, the same product page on two ecommerce websites might look very similar and contain the same information, however the underlying HTML will be different. Therefore, the same parser will not work and one will either need to develop a custom solution for each, or a more complex parser that handles various differences. And it will continue getting more and more complex with each additional website added to the parsing scope.

Even for the same source of data, one parser might not be enough. For example, ecommerce platforms will often have multiple layouts and page types scattered throughout. Parsing will require using a custom approach for each. There are some helper tools which require some effort, however, even they are not permanent as pages are bound to change over time.

Combining the flexibility and the nested nature of HTML, creates a rather challenging task for analysts. Unfortunately, the complexity doesn’t stop here. Websites are prone to changing layouts, or adding new features – all of which will impact parsers. These changes necessitate updates in parsers to match the new look.

Coupled with the need to acquire data from numerous sources and the difficulty of parsing HTML for data analysis purposes, the task can quickly become unsustainable. Large businesses that already have dedicated developer teams may have a chance at maintaining numerous scrapers and parsers. Smaller businesses often have to outsource this process altogether.

Outsourcing parsing does have its benefits. Often, for smaller use cases, the overall costs in human resources and capital will be lower than bringing together a developer team, getting them onboard, creating a parser, and maintaining it. Additionally, while outsourcing parsing makes a small business more dependent on external powers, it removes the headache of adapting to changing layouts and the potential loss-of-service (loss-of-service still happens albeit for shorter periods of time).

The AI Advantage

Writing a simple parser for one task might not be unreasonable for smaller development teams. However, the real challenge lies in scaling. Each new source requires at least several new custom parsers. Maintaining a growing number of parsers is extremely resource intensive. Since data on the web is scattered in numerous newspapers, forums, social media, and other outlets, gathering and loading all of it will require a significant amount of time and resources just to finish the parsing process.

There is some hope to be had by taking advantage of machine learning. At the end of the day, HTML is used to create websites that are readable by people. While there are many ways of doing the same thing, in most cases everyone can still use a website when it undergoes a redesign. Yet, at the same time, the coding differences are never too large from one website to another. This suggests that some type of ML approach is possible.

We already did something in a similar fashion. One of our solutions, Next-Gen Residential Proxies, brings together the principles of data acquisition and AI. Skipping over all the other fancy features, one of our greatest accomplishments so far has been adaptive parsing.

Before getting onto how we created the first versions of adaptive parser, I have to mention that we didn’t do it all on our own. We didn’t even have a lot of experience in-house on machine learning and artificial intelligence. We only understood the possibilities. Therefore, we gathered experts on machine learning from all around the world and included both academic researchers and people with practical expertise. Our AI advisory board helped us develop our solution and, in turn, advance the adaptive parsing feature. People can use the adaptive parsing feature to acquire structured data from any ecommerce product page. For a visual representation, watch this YouTube video

We used supervised machine learning models to feed the required data. As you may imagine, the process turned out to be not as complex and challenging as we had first imagined.

However, getting a large enough training dataset with labeled fields is very labor (or finance) intensive. Even for larger businesses that can support in-house development teams and dedicated machine learning experts, I would keep an eye for a suitable service provider. Most of the time that decision is best made by taking stock of the pricing model and features provided by the third-party scraping service.

Conclusion

Web data parsing is an extremely labor-intensive process that is absolutely necessary in order to acquire usable information. HTML parsing has a fair share of its own unique issues that plague the parsing process. While creating one HTML parser might not be challenging, for any large-scale web data acquisition process dozens of parsers might be required.

Therefore, the approach to parsing is changing. Maintaining an in-house web scraping and parsing solution is no longer a necessity. There exists the next breed of AI-based solutions that truly promise to deliver data as a service. With the advancement of parsing, new, previously unseen data analysis use cases will appear. Even now, all it takes to glean insights even for the most granular cases is sending a few requests.

Leave a Reply