Back in 2020, when the COVID-19 pandemic was in its earliest, scariest stages, researchers found that localized search trends could predict outbreaks more accurately and quickly than other measures. User-generated inputs on the internet provided access to useful and actionable insights. All of the predictions had been garnered from Google Trends and similar search-engine-based tools. However, there is an even more powerful asset that may help us uncover such invaluable insights: web scraping. In short, it’s the process of automatically gathering public data from all over the internet.
Web scraping has typically been reserved mostly for business. Scientists have acquired research data with web scraping here and there, but widespread adoption is lacking. Yet, web scraping could be immensely beneficial to both scientists and policymakers.
An Issue of Perception
As businesses are always looking for ways to innovate, it’s no surprise that web scraping has taken hold. Gaining access to external data, especially nowadays, is beneficial to any industry. Accurate data, created by consumers, allows businesses to gain a better understanding of the market needs.
Moreover, the business world has an unexpected benefit – it reuses its own financial resources. Since the end goal of any business is to turn a profit, ventures that have a pretty obvious way to boost overall ROI usually don’t take much time to be adopted. It’s a little different when public money gets involved.
Usually, there’s a lot more convincing to be done when public money has to be spent. That convincing gets exponentially harder when the technology to be deployed is new, difficult to explain to a layman, and the process of web scraping itself lacks legal clarity.
The greatest hurdle, I believe, is the above-mentioned lack of legal clarity with regard to web scraping. While those who are in the industry can clearly see the legitimacy of web scraping as long as it is done properly, a lack of universal industry regulation surrounding the process may make it seem too burdensome to outsiders.
Unfortunately, it takes a considerable amount of effort to convince someone otherwise. At my company, we have been pushing for more transparency and creating ethical guidelines, but that can only go so far.
Web Scraping for Science
Science rests on the notion that statements about the world can be tested and verified. Some sciences can directly test their hypotheses through experiments. Others, like economics, would struggle to create proper experiments and instead use data acquired from the field.
Web scraping isn’t likely to help the physical sciences perform experiments. But for social sciences where data and analysis are kings, web scraping is a tool that scientists have been waiting for all this time.
Google Ngram Viewer is a good example since it’s essentially web scraping for books. A simple search can reveal the use of a particular word over time since the 19th century with links to all the books found. Ngram Viewer is immensely useful for the study of language that wasn’t available until just recently.
Humanities in general have a lot to benefit from web scraping. Research is usually heavily limited by the amount of data and the inability to search within it. Scraping provides a solution to both as long as the sources have been digitized.
However, web scraping can bring science even further. Economists have been giving web scraping some thought for some time now, as evidenced by Benjamin Edelman’s article “Using Internet Data for Economic Research” in the Journal of Economic Perspectives. He even mentions scraping data.
The possibilities are endless. Urban economists can measure the prosperity of cities and regions by scraping data about available restaurants, bars, and other entertainment venues, adding monthly review amount deltas to signal economic activity.
Macroeconomists can use data from large online retailers or e-commerce websites to measure the impact on prices across a particular country. Behavioral economists can gather data from marketplaces that sell the same item with different conditions (e.g., larger product price, lower shipping costs) to estimate the impact particular factors could have on an irrational actor.
These examples are not even exhaustive for the study of economics. Other social sciences (e.g., sociology) have even more to gain from web scraping.
Web Scraping for Policy
Information scattered across the web can reveal interesting insights, to say the least. For example, cybersecurity companies usually use web scraping to find a wide range of crimes (like online copyright infringement). On the other hand, web scraping can be used to inform the public and governments about systematic abuse. The U.S. Center for Public Integrity used web scraping to create “Copy, Paste, Legislate” – a tool that uncovers where influential business and special interest groups might be pressuring lawmakers to introduce unfair legislation. A similar project, Manolo, tracks lobbyists that visit Peru’s governmental entities.
Yet, web scraping can be used to inform about specific inequalities that otherwise might be completely invisible. The COVID Tracking Project has provided data that the virus disproportionally affects people of color due to social inequality.
Web scraping can be used to enhance democratic processes by delivering public data about the effectiveness of social programs, possible cases of corruption or extremism, and numerous other important social and political issues.
It can even be used to enact legislation more effectively. That was the goal of our GovTech Lab challenge project – to provide law enforcement with a way to automatically receive alerts about the publishing of illegal explicit content in the Lithuanian cyberspace.
Finally, it may even be used to reveal cases of mistrust and legal inefficiencies, such as when Detroit homeowners were heavily overcharged (totaling hundreds of millions) for property tax before the city went bankrupt.
Conclusion
Web scraping holds immense power that can help the world progress in science, business, and policy. However, there are a lot of hurdles currently stopping many people from fully utilizing its power to do good. We can only call for the more universally uniform legitimization of web scraping with the hope that more people will gain access to it.