Influential websites blocking AI from their data

New data from AI content detection software, Originality.AI, has surfaced claiming that up to 20 percent of the top 1000 globally ranked websites are blocking AI crawling software, preventing the use of their content in machine learning.

This makes the usage of their content to train such AI models as ChatGPT or MidjourneyAI an extremely difficult if not an impossible endeavour.

Regulatory failings

The lack of clear and concise legal measures regarding the use of data, content and any other intellectual property in the training of AI models is probably the most significant factor in this case.

The lack of compensation for the commercial use of data to train an AI model could be seen as the most comprehensive issue plaguing these websites.

Further to this, the misuse or misappropriation of data and content may be seen as an additional aggravating factor.

The theft of intellectual property has also been cited in recent months with many artists complaining of the usage of style and numerous aspects of their work without any form of compensation.

The numbers

Open AI’s GPTBot crawler, which scraps data from the internet to train and enhance the seminal ChatGPT product, was introduced in August.

This was met with a significant amount of pushback right out of the gate with numerous high-profile sites including the New York Times, Reuters, and CNN blocking it almost immediately.

Shortly after many more websites followed suit so much so that in the single week between August 22nd to 29th, the amount of the top 1000 globally ranked websites blocking the bot rose a hefty 3 percent.

Today the largest websites blocking the bot including Amazon and Quora, deny access to potentially hundreds of millions of individuals’ data.

A similar percentage of websites blocking other crawlers is also common making it an industry-wide issue.

AI development disregard

Although these websites have strongly instructed crawling bots not to scrap their data, the lack of regulation means that these cannot be enforced, thus many of these bots may disregard their instructions.

Tech conglomerates such as Google and Microsoft most notably see their data crawling bots’ work as fair use despite the objections of many copyright and intellectual property holders from the very beginning of the AI surge.

The way forward

Alternatives to this hostile environment have already been proposed, with some media companies already in the early stages of negotiating an agreement to license data to AI firms against compensation.

Time will tell if this method will prevail as there is still a massive hole where AI regulation should be present.

Actions toward a well-tailored legislative framework surrounding AI may eventually come in the form of case law developed when legal action is taken against AI companies that use data without prior authorisation.

A snail-paced balancing act

Regulators across the world have been extremely reluctant to devise a framework to control AI development with many worrying that any legislation that is not conducive to progress will greatly hinder any of the culturally altering phenomena stemming from generative AI.

The lucrative nature of the AI industry is also a factor administrative bodies continue to be painfully aware of. In many ways, AI could generate a multitude of financial opportunities both privately and publicly.

On the other hand, however, the full risks AI holds are not known due to its incredible rate of development. The infringement of intellectual property rights is most certainly an issue among many that will persist if left unattended.

The slow pace of regulatory action may actually be the cause of this unfortunate balancing act. Many have suggested that the development of regulatory guardrails would be beneficial not only to mitigate harm but also to ethically direct AI’s continued development.

AIBC Balkans/CIS

As a globally recognised nexus for networking, AIBC sets its sights next to the Balkans this September, when the AIBC Summit heads to Limassol Cyprus.

A host of networking opportunities and industry-leading knowledge will be emanating from the much-anticipated event which will pack panel discussions, keynote speeches, start-up pitches and much more into 3 days in the diverse Cypriot city.