The company launched diverse training data sets for natural language processing initiatives.
Training data provider Appen data just launched recently developed diverse training data sets for natural language processing initiatives in an effort to ensure end-users will receive the same experience, regardless of language variety, dialect, ethnolect, accent, race or gender.
Appen said it realized that AI projects that are based on biased or incomplete data don’t work for everyone. It is enabling organizations to launch, update and operate unbiased AI models through a variety of projects and partnerships focused on the diversity of languages and dialects, the company announced on its website.
In March, Proceedings of the National Academy of Sciences found that popular automated speech-recognition systems used for virtual assistants, closed captioning, hands-free computing and more, “exhibit significant racial disparities in performance.”
The report concludes “that more diverse training datasets are needed to reduce these performance differences and ensure speech recognition technology is inclusive. Language interpretation and natural language processing systems suffer from the same challenge and require the same solution.”
“The quality and diversity of training data directly impacts the performance and bias present in AI models,” said Mark Brayan, CEO at Appen, in a press release. “As a data partner, we can supply complete training data for many use cases to ensure AI models work for everyone. It’s critical that we engage a diverse group of individuals to produce, label, and validate the data to ensure the model being trained is not only equitable, but also built responsibly.”
With a goal to create AI for everyone, Appen developed a variety of projects and partnerships which focus on the diversity of languages and dialects.
As an example, the Appen website explained:
- Translators without Borders partnership: “Appen, in partnership with TWB, Amazon, Carnegie Mellon University, Facebook, Google, John Hopkins University, Microsoft, and Translated joined the Translation Initiative for COVID-19 (TICO-19), which supported the development of language technology to make COVID-19 information available in as many languages as possible, including languages in developing countries like Congolese Swahili, Tigrinya, and Nigerian Fulfulde.”
- The Inuktitut translation project: “In collaboration with the Government of Nunavut, Microsoft added Inuktitut, an Indigenous language in North America spoken in the Canadian Arctic, to Microsoft Translator, using Appen services.”
- The Canadian French translation project: “Appen coordinated with native language consultants to help Microsoft add ‘Canadian French’ as a language option in Microsoft Translator.”
- African American Vernacular English off-the-shelf data sets: “Most existing training datasets used in ASR, search engines, voice assistants and sentiment analysis are not representative of AAVE. To make high-quality AAVE data available, Appen is working with AAVE speakers among its crowd of annotators to collect data for an OTS dataset based on conversations about a broad range of topics.”
Without setting out to do so, biased AI data can set off a wave of information that is not only not valuable toward research, but can actually be detrimental.
SEE: Analytics: Turning big data science into business strategy (ZDNet/TechRepublic special feature) | Download the free PDF version (TechRepublic)
“Biased AI data leads to projects that can fail to deliver the expected business results and harm individuals they are supposed to benefit,” said Dr. Judith Bishop, senior director of AI specialists at Appen. “The scale and complexity of AI projects makes it impossible for most companies to acquire sufficient unbiased high-quality data without partnering with an AI data expert.” She added, “Developing the most diverse and expert crowd of data annotators provides the industry with a clearly differentiated resource for building fair and ethical AI projects.”