Te Kete o Karaitiana Taiuru (Blog)

Treatment of Māori language in language modelling

This is my contribution as a critical Indigenous Researcher to the Nature Journal’s article “Increasing the presence of Black, Indigenous, People of Color (BIPOC) researchers in computational science” regarding language revitalisation and Artificial Intelligence, along with many other international voices.

The Māori language was banned by native schools and other government led assimilation practices in the late 18th century, so effectively, that by 1980 we had less than 20% of native speakers nationwide, and within my own tribe we had 3 native speakers. Tribal dialects were also replaced with one standard version of Māori, influenced by the introduction of the written Bible and by ethnographers who chose to ignore the rich and diverse tribal dialects throughout our country in favor of a standard dialect. Large Language Models (LLMs) have the ability to revitalize dialects, it is not likely a preference for Māori as it is a local distinguishing treasure that is used by tribal members in physically meetings of cultural significance.

Community activism throughout the late 1970’s and the establishment of language training for preschoolers and other educational facilities led to the Māori language being recognized as an official language in 1987. The key lessons learned for Māori was that the language had to be normalized in our lives and society, and then to be spoken by at least three generations within one family in order to be preserved. We have now reached that lofty dream, but moving forward we need to address the controversial topic of LLMs that are widely used and that already incorporate our language.

The Māori language has a substantial amount of digitized online resources such as legal records, parliamentary corpus, journals, newspapers, archives and audio-visual materials, and with tens of thousands of new words created to accommodate the translation of products such as Microsoft Office, Windows, and Google. Google had a Māori software engineer contribute to and help develop Google Translate for Māori. This has generated a huge amount of data for artificial intelligence (AI) to incorporate the Māori language and for it to be used relatively well. For instance, ChatGPT already speaks Māori at a reasonable level of accuracy. I estimate that in less than two years it will be as fluent and accurate as a Māori language expert.

A recent spike with people wanting to learn the Māori language has resulted in a demand issue for teachers that far outweighs the supply. This has led to a large uptake of learners of Māori language using LLM’s as both a supplementary tool and an alternative method to learn Māori.
Still, many dangers exist. For example, we provide AI technologies with our sacred rituals and esoteric knowledge, our language and history will not be our own and risks being commercialized and changed by international corporations. As Māori, we need to revisit our traditions, go back to our tribal lands and re-engage with elders and tribal members to learn the sacred aspect of our language and ensure that those aspects remain in our human world and in our tribal homes and lands. LLM developers can assist in this area by acknowledging copyright in source data and being transparent about data sources.

Other risks include misogynist ethnographers’ historical texts being used to train some LLMs, resulting in incorrect statements about Māori, written in the Māori language. This is predominantly with our creation stories and historical knowledge, where the Māori translated King James Bible is being offered as Māori creation stories and the LLMs are mixing and matching tribal stories to create new ones.

Another emerging risk is phishing attacks or identity theft, as spammers are using LLMs to correspond with Māori and it’s becoming more difficult to discern real from fake. Previously, Google Translate often made mistakes when translating, making it easier to identify. LLM developers should consider how best to prevent their tools being used in phishing, bullying and other common forms of online scams and bullying. Moving forward, we hope that LLMs are key to the long-term revitalization and normalization of the Māori language.

DISCLAIMER: This post is the personal opinion of Dr Karaitiana Taiuru and is not reflective of the opinions of any organisation that Dr Karaitiana Taiuru is a member of or associates with, unless explicitly stated otherwise.

Leave a Reply

Archive