Bard, ChatGPT race: whose data drives today’s AI?

With AI innovations like ChatGPT and Bard making headlines, it’s important to consider their global impact – particularly for communities which don’t have large datasets online to train AI models.

The recent announcement of Google's new chatbot, "Bard," has sparked excitement in the technology community as a potential rival to OpenAI's chatbot, ChatGPT. Media reports cited pressure to compete with OpenAI as a key factor in the timing of Google’s announcement, after ChatGPT’s viral success over the past few months.

But the competition for Google is not finished heating up. Microsoft, a significant investor in OpenAI, just announced a “limited preview" of a new ChatGPT-enabled Bing search feature. As conversational AI becomes more ingrained in the online search tools that many people use online, it’s important to consider the wider implications of this competition between tech giants.

In particular, the fairness and inclusivity of these systems are determined by the inclusivity of their data. That leaves the question: whose data is included and who will be left out? How can we ensure that the benefits of AI are distributed equitably among all members of society?

AI models are often driven by North American data

From “old” algorithms like regression and decision tree models to “modern” deep learning models like BERT and ChatGPT, most of the machine learning models we encounter rely on patterns in training data to generate output. We can alter the math in their algorithms to reflect our values to some extent, but their parameters and effectiveness are still determined by data.

Discussions around fairness in machine learning, then, are really discussions about the fairness of data. Data is a product of tech infrastructure, which is unfairly distributed globally – leaving the Global North with the most data to work with for training AI models. The Global South, on the other hand, has less infrastructure to generate the data needed for machine learning, meaning it can be difficult to create AI models based on the language and culture of these communities.

In turn, the output of large language models risks replicating the biases inherent in its data – particularly in language models that are trained with large datasets scraped from the internet. AI bias in these cases tends to reflect the views of a white, English-speaking, and male audience. Machine learning benchmarks are also skewed towards North American values and standards for success, further incentivising AI firms to build models that comply with North American culture – and leaving out the perspectives of populations that don’t have the same advantages.

This data bias creates a risk for the Global South: the algorithms that we increasingly rely on will further entrench power dynamics on a global scale. In order to prevent this, more needs to be done to gather data from communities without the same tech infrastructure as North America. But another hurdle in achieving this is aligning investment dollars with AI ethics.

AI investment is unevenly distributed

North America also has the largest investments in AI. In 2021, the US invested $52.88B in AI, more than any other country in the world – China was second place at $17.21B. Greater investment in AI means greater power to decide what AI is used for and who benefits from it.

Often, these decisions are made with profit and engagement as primary goals; AI research tends to focus on performance rather than ethics. And even in cases where AI research begins with ethical interests at the forefront, the expense of hardware and research required to develop it tends to require large investors looking to profit from it.

That means that in addition to incentivising AI that serves North Americans, there is an incentive to outsource difficult tasks such as content annotation and moderation to the lowest-paid workers. Recently, that has resulted in AI companies relying on underpaid gig workers in the Global South for these tasks, which further entrenches economic power dynamics as well as cultural ones.

Who benefits from AI?

The availability of AI datasets, funding, research, and tech infrastructure has an enormous impact on who benefits from it. In the current economic climate, it is difficult to produce AI that is completely free of bias, even for companies with the best intentions. That means that AI firms have a responsibility to make an active effort to reduce bias where they can and to help distribute the benefits of this powerful technology more evenly.

Only 6% of the world’s chatbots speak local languages. That is why part of Proto’s mission is to help ensure that AI supports citizens in the emerging world, using language data from clients in communities without large datasets online. Proto is also committed to paying workers equally with global colleagues and employs a diverse team of talent to help design our AI models. For example, Proto’s recent investment in Rwanda has created new jobs in Kigali.

Achieving fairness in AI data is a problem that requires sustained effort, collaboration, and insight from individuals and organisations around the world. Proto is making every effort we can to be a part of the solution.

About Proto

Proto is the leading generative AICX platform for local languages. Its inclusive text and voice AI assistants excel at usecases for customer experience, consumer protection, employee experience, and indoor navigation. The Proto AICX Platform is powered by large language models and the proprietary ProtoAI engine for high accuracy in underserved languages. Proto deployments feature enterprise-grade security and capabilities such as on-premise hosting, customised analytics, and a 24/7 prompt engineering service. Headquartered in Waterloo, Canada, Proto is a global team operating across Latin America, Africa, and Asia.

View Case Study

Use Proto for free, forever

Get 1,000 interactions every month.