It's a lazy summer's day and you are lounging at home. You pick up your iPhone, the screen automatically identifies your face and unlocks in milliseconds. You browse absentmindedly, flicking through content. As you check out the latest TikTok dances, your toes halfheartedly mimicking the movements - the muggy air overwhelms you. "Alexa, living room fan ON". The fan above your head immediately activates, you barely pay it a second notice.
Thousands of miles away, a young man in Mumbai, India sits in a dimly lit and crowded warehouse. He spends eight or more hours a day annotating and labeling images of faces, ears, noses and eyes, squeezed between his colleagues doing the same. His eyesight is often blurry and his cornea red from staring intently at the screen. At the end of his shift, he will have made less than USD $2/ INR 160, barely covering the cost of his food and the 40-mile commute back home.
Similarly, in New York, a young college sophomore in need of some extra cash sits on her bed and logs into one of the many digital micro-tasking platforms. Like her "colleague" in India, she spends endless tiring hours generating English speech data, unlike her colleague, she gets to work from home. On a good day she might make USD $6.50 per hour, less than the already unsubstantial minimum wage.
As users, we rarely look behind the curtain to see the mechanics of how we benefit from these luxuries at such a low cost. These two workers, one in India and one in the US, are part of the same AI training dataset generation gig-economy. Unfortunately, neither one of them is respected for their contributions, compensated fairly, or given the opportunity to make digital work a full-time sustainable career choice.
Too often, when people speak about unethical AI practices - they largely refer to technology biases or the harvesting and use of personal data. This narrative often ignores the most fundamental aspect of all AI technologies - the datasets generated and used to train models. By ignoring how our technologies come to be, we are actively turning away from the plights of hundreds of thousands of data workers who can barely make a living thanks to widespread unethical employment practices.
Exposing unethical "data factories"
Since the International Labor Organization's (ILO) 2018 report, highlighting the uncertain and exploitative trajectory of digital work, numerous stories of digital exploitation have surfaced. One of the first exposés on the data generation industry was done by the Atlantic, who put the spotlight on Amazon's Mechanical Turk (AMT) platform - a commonly used freelancing data generation platform that was found to offer its workers significantly less than the claimed USD 7 per hour, most freelancers reported making an average of USD $2 per hour Semuels, 2018.
Most recently, Facebook was called out for their partnership with Samasource, a microtasking platform who outsourced their tasks to Kenyans, allowing them to pay significantly lower wages and operate their centers with little to no regulatory oversight Perrigo, 2022. Samasource, however, did not begin with this nefarious intention. Starting as a non-profit organization, Samasource changed its pricing models and strategy over the course of its growth to remain market-competitive. Even after their recent exposé, Sama has continued with the same practices to pad their bottom line. They are not the first company to have been exposed yet carry on with their modus operandi. What will it take for these companies to stop? Does public shame have any impactful role on our ubiquitous, unrelenting, capitalistic economy and mindset?
Research on the growth and value of the training data industry shows no reason for the chronic underpayment and exploitation of data factory workers. Most of the technologies we use today have a component of AI or ML in their backend; voice-assistants, maps, even your CAPTCHAs. The lure of these technologies is in their interactivity, responsiveness, and "smart" features. These cannot exist without training AI models on large-scale high-quality datasets. Recent market reports predicts the worldwide data industry will grow to USD$ 8,607 million by 2030, with a compound annual gross rate of 22.2%, an almost 5x increase from today's valuation (Research and Markets, 2022).
This means there are no excuses for unfair wages. Median hourly wages were around USD $2 per hour (Berg et al., 2018). This is drastically low compared to the price of data when eventually sold to technology companies - these datasets are often valued at an average of $60-$140 per hour of data. This disparity only means one thing - as workers continue to slave away in front of computers for minimal pay without long-term employment security - technology companies profit handsomely off their unethical practices. Most workers attracted to this type of work are in dire need of supplementary income. This means that despite low remuneration, workers continue to choose this type of work to keep their livelihoods afloat (Guess, 2018). The newness and invisibility of digital micro-tasking work mean there are currently no unions, legislations or regulations for workers to turn to - unethical employment practices are rampant and frequently go unchecked (Semuels, 2018; Inevitable Human, 2018).
The world needs to change, and quickly. Data workers are poised to become unintentional members of the world's newest sweatshops. It is unacceptable that only a small percentage of revenues go to the actual dataset generators. It is unacceptable that every time we use our technologies, we are actively contributing to the exploitation of hundreds of thousands of data workers. As we move more and more towards a world where technology centers our lives, we must act quickly to alter the course of this injustice and exploitation before we create a world where technologies financially cripple their workers.
Cooperatives - the new future of data?
Cooperatives have long been seen by international development practitioners as a sustainable way for economically disadvantaged individuals to move out of poverty. Farming and agriculture cooperatives can frequently be seen across India, Africa and South-East Asia. Notable cooperatives such as Amul Dairy in India, Land O' Lakes in the US and Rabobank Group in the Netherlands, have changed the landscape of their respective industries and the lives of its contributors. Cooperatives are community-owned and operated businesses, ensuring prosperity and empowerment for all workers involved. Together, cooperative workers pool resources and ideas to work on commonly identified goals for their community's success. Cooperative communities work together to find collaborative and contextual solutions to their economic or social needs. Research on the effectiveness of agriculture and healthcare cooperatives report significant impacts on core socio-economic outcomes such as: poverty alleviation, improved healthcare through financial mitigation, and the creation of resilient societies ( Dave, 2021 Sizya, 2001 Aazami & Panah, 2016 Yang & Stanley, 2017 ).
Cooperatives and collectives are one under-tested and novel way of addressing the problem of unethical data practices. Imagine a world where you can be confident that the apps you use aren't causing suffering and indentured employment. A world where the lucrative industry of AI training data generation can be as lucrative for the workers as well as companies. A world where digital work can be a sustainable, impactful and fulfilling career. To do this, we must find a way to overturn the currently exploitative culture of dataset generation, creating an expectation and ecosystem in which ethical data practices can thrive.
Data cooperatives, however, are not new ideas. A 2021 Stanford article highlights the great power imbalance between data producers and the companies that profit from this data. Here, the idea is to give data producers (any one who uses the internet) the opportunity to own and manage their own personal data. These scholars propose using data cooperatives as "intermediary fiduciaries" who negotiate with data companies to establish shared guidelines and protocols on the use of the data Miller, 2021. Julian Tait echoes this position, since 2010 he has co-founded Open Data Manchester, which helps people, organizations and communities work together to ensure fair and responsible data usage Tait, 2021. In just the past 5 years, the idea of shared data ownership has skyrocketed. Companies such as Driver's Seat, FairBnB and Data Worker's Union have emerged to consolidate power and bring true data ownership to its members. These organizations understand the value and the power of the data they generate - ensuring that when people's data is used, it compensates them fairly, and is used towards developing technologies that empower their communities Mehta, Dawande & Mukherjee, 2021.
Cooperative data, however, does not only refer to the ownership of personal data - tying this idea to the reality of digital factories is crucial. If we are finally understanding the need to pay people for their information, it should be non-negotiable that people who generate data for a living should be firstly equally fairly compensated for their work, and secondly, become the owners of the data they produce.
Karya - a cooperative alternative
Bengaluru - Microsoft Research India, 2018. Manu Chopra and Vivek Seshadri need Indian language speech data for their new AI model; as they research data companies who offer dataset generation services, they notice a large disparity between the sale price of the datasets and the cost of production. This spurred the following two questions: What if data collection tasks could be a source of fair wages for India's vast under employed and economically disadvantaged communities? How can digital work be used to transform livelihoods and societies?
Over the next four years, Manu and Vivek conducted research on the viability of producing ethical datasets that offer fair compensation, dignified workplaces, and knock-on social change. Most notably, their research highlighted two learnings. First, data collected from low-income participants is of comparable quality to the data collected from university students. Second, participants in these studies indicated enthusiasm for the work and were grateful for the opportunity to easily earn more than through physical labor ( Abraham et al., 2020; Chopra et al., 2019). These findings indicated the potential for using crowdsourcing as a viable mechanism for collecting speech data from low-income workers and spurred the creation of Karya Inc.
Karya is the world's first ethical data cooperative to bring dignified, digital work to economically disadvantaged Indians, giving them a pathway out of poverty. Our model engages workers in tasks related to speech dataset generation and image annotation; with a plan to expand to higher-skilled, lucrative tasks in the coming year. Unlike current India and Africa-based digital micro-tasking platforms, Karya's user-friendly application allows a fully work-from-anywhere model.
Karya's vision is a world where ethical data generation practices are the standard and not an exception. We plan to create this ethical ecosystem in four ways:
First, Karya uses a social enterprise model to prioritize worker profits. Unlike for-profit data collection counterparts, our commitment to ethical data starts with compensation; in India, where the minimum wage is USD $0.28/INR 22.25 per hour, the average Karya worker makes USD 4.40/ INR 350 per hour of work.
Second, Karya developed a novel Public Data License to upend the current data-ownership siloes. This gives workers perpetual ownership of the data they generate, allowing them to continuously profit from its sale.
Third, Karya is currently pursuing avenues to up-skill their workers through a learn-to-earn model. By turning education content into digital micro-tasks, workers are incentivized to learn new skills (such as natural farming and entrepreneurship) through completing paid tasks.
Fourth, Karya realizes that an ethical data ecosystem cannot be created alone. In the spirit of inclusion and awareness, Karya plans to develop the world's first ethical data collective.
Despite these ambitions, myriad challenges, roadblocks and fundamental questions present themselves along the journey to ethical data generation. Three lessons lay the roadmap to understanding how to tackle the issue of ethical data production:
Creating sustainable employment opportunities through digital work is at the antithesis of how current dataset generation companies work. Most companies, including Karya, focus on data that is generated through low-skilled micro-tasks. While these offer opportunities for quick and easy work, they do not offer much outside of supplementary income. Karya's ultimate goal of sustainably moving people out of poverty can only occur in two ways: if we are able to create increasingly complex digital workstreams that can guarantee 15-20 years of employment, or if we are able to adequately skill people through micro-tasks so that they can enter other industries. Although the question of how remains unanswered, this is one of Karya's key research and development priorities for 2023.
Demand for ethically generated data can only occur through collective action and desire to change. One of Karya's key challenges is encouraging industry leaders to see the importance of sourcing ethical data. While Karya simultaneously offers companies comparatively lower sales prices while maintaining data quality and ethical wages, this is not necessarily a motivating factor for technology companies. To truly create an ethical data ecosystem, technology companies must err on the side of social responsibility when sourcing their data and actively move away from those with unethical practices. This is a huge shift in mentality for data users - unfortunately, the future of ethical data is predominantly in the hands of companies who have historically encouraged unethical practices.
What does the future hold?
Ethical data generation practices are the unseen eyesore of our technological advancements - secretly capitalizing off the world's most vulnerable individuals to pad the pockets of those who already have staggering wealth. A world where workers are fairly compensated for their data generation work is more about the stamp of ethical data - it's about paving the way for digital work to be a viable source of dignified income.
Thinking differently about data doesn't have to only be in the hands of the Jeff Bezoses of the world, it can come from us, the consumers. As frequent users and benefactors of data, it is our duty to ring the alarm bells and call out our favorite companies. Fair wages and humane working conditions should not only be regaled to physical labor - such as clothes manufacturing or warehousing workers. Rather, these principles should extend to all work - whether digital or physical, high or low paid. Ethical work ultimately means creating societies where humans are valued and put first - before profits or efficiency.
We call on each of you - technology users - to help upend the current data ecosystem. As customers and consumers, we have the power and the right to demand more. The uncomfortable fact of unethical data practices is not only the fault of large technology companies - it is also our continuous oversight and lack of action and awareness that exacerbate this state of affairs. Individuals who care - the time is now to talk back, encourage companies to support ethical models like Karya, and work as a collective towards action. Let's create ethical data - together!