Te Kete o Karaitiana Taiuru (Blog)

Protecting Mātauranga on web sites from ChatGPT and Search Engines

This article is primarily for Māori web site content owners who may have some mātauranga Māori on their web sites that they do not want consumed by ChatGPT and other AI or search engines, or who simply want to exercise their own Māori Data Sovereignty principles and rights as afforded by Te Tiriti o Waitangi.

There are a number of other issues as a content owner that you should be aware of including where your web site is hosted, the licence and privacy agreements with your provider or their provider etc. This article assumes prior knowledge of these issues.

ChatGPT

It’s relatively easy to disallow GPTBot from crawling your site if you don’t want OpenAI using your content by either adding a new Robots.txt file or by adding to an existing one.

Robots.txt is a plain text file used to tell bots what they are allowed or disallowed to crawl on a website. The file is located in the root directory of your site. You can view it by adding “robots.txt” after your site’s URL.

You will need to access/edit the robots.txt file either via FTP or a File Manager is your web host panel.

The text below should be added to your existing Robots.txt file or you will need to create a new file called robots.txt and upload it to the root of your web site. There is no other code required.

User-agent: GPTBot
Disallow: /

To allow ChatGPT to access only selected parts of your site, you can add the following to your site’s robots.txt where ‘directory-1’ is your folder on your web site, and ensuring the robots.txt file if uploaded to the root of your web site:

User-agent: GPTBot
Allow: /directory-1/
Disallow: /directory-2/

Below is the GPTBot’s full user-agent string

“Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)”/

For further and perhaps more up to date information about GhatGPT documentation can be found here and OpenAI also published the IP ranges that GPTBot uses which can be accessed here.

Blocking all Search Engines

If you would prefer that your site is not searchable at all by any search bots, you can do it by putting an asterisk (*) next to User-agent. And if you want to prevent them from accessing the entire site, just put a slash (/) next to Disallow.

User-agent: *
Disallow: /

Blocking specific Search Engines

In the future specific Search Engines will likely use bots to crawl web sites to contribute to various AI. If you wish to block only one Bot and not all, then the following code should be added to your Robots.txt.

Just replace the Googlebot with the bot you wish to block.

User-agent: Googlebot
Disallow: /

National Library Web Harvest

Since 2008, the National Library of New Zealand has been harvesting web sites in the .nz space with over 585 millions URL’s that have since been harvested. I made public comments in 2010 expressing concerns as there are a number of Māori Data Sovereignty discussions to be had and more meaningful engagement with Māori site owners.

The more information page is at https://natlib.govt.nz/publishers-and-authors/web-harvesting/domain-harvest

If you are a content owner and have concerns, the crawler uses the user agent string is:

NLNZ_IAHarvester[year]

Non Technical resource to identify what country your web site is hosted

Māori Data Sovereignty principles state that data should be hosted in New Zealand so as to protect the mātauranga with New Zealand laws., though sometimes this is not feasible.

If you are a content owner with a web site, and would like to know where your web site is actually hosted, please use the following online tool: https://www.who-hosts-this.com/

DISCLAIMER: This post is the personal opinion of Dr Karaitiana Taiuru and is not reflective of the opinions of any organisation that Dr Karaitiana Taiuru is a member of or associates with, unless explicitly stated otherwise.

Leave a Reply

Archive