The Origins of Data for AI Development

data sources for AI data sources for AI
Artificial Intelligence (AI) is fundamentally dependent on data. The training of algorithms requires vast amounts of information, and the quality and nature of this data directly influence the outcomes produced by AI models. However, a significant issue persists: many AI developers and researchers have limited understanding of the origins of the data they utilize. The current state of data collection practices in AI is notably underdeveloped, especially when contrasted with the advanced methodologies employed in AI model development.
Data Provenance Initiative
To address these concerns, the Data Provenance Initiative was formed, comprising over 50 researchers from both academia and industry. Their primary objective was to answer a crucial question: Where does the data used to build AI originate? They conducted a thorough audit of nearly 4,000 public datasets, encompassing over 600 languages and spanning 67 countries over three decades. The investigation revealed that the data was sourced from 800 unique locations and nearly 700 organizations.,
Findings
The findings, which were shared exclusively with MIT Technology Review, indicate a troubling trend: the data practices within AI are increasingly favoring a few dominant technology companies, leading to a consolidation of power and resources. Shayne Longpre, a researcher at MIT involved in the project, notes that in the early 2010s, data sets were derived from a diverse array of sources, including encyclopedias, web pages, parliamentary transcripts, earnings calls, and even weather reports. At that time, data was curated and collected with specific tasks in mind.
Shift in Data Collection
The introduction of transformer models in 2017 revolutionized the landscape of AI, shifting focus towards larger models and data sets. Today, a substantial portion of AI data sets is created through indiscriminate scraping from the internet, resulting in a widening gap between curated datasets and those simply aggregated from online sources. Longpre emphasizes that in foundation model development, the scale and diversity of data sourced from the web are of paramount importance.
Rise of Multimodal Models
Recent years have also witnessed the emergence of multimodal generative AI models capable of generating various forms of media, including videos and images. Similar to large language models, these systems require extensive amounts of data, with YouTube serving as a primary source. Astonishingly, over 70% of data utilized for both speech and image datasets stems from this singular platform, raising concerns about the concentration of power within a single company—Alphabet, Google’s parent organization.
Implications of Data Concentration
This concentration of data not only grants Alphabet significant control over an essential resource but also raises questions regarding how the company will share this data with competitors. Sarah Myers West, co-executive director at the AI Now Institute, stresses the importance of viewing data as a product of specific processes rather than a naturally occurring resource. The datasets that underpin AI systems often reflect the objectives and designs of profit-driven corporations, influencing the infrastructures of our world in ways that align with corporate interests.
Representation in Data
This monoculture in data sourcing brings forth critical questions about the accuracy of human experiences represented in AI training data. Sara Hooker, vice president of research at Cohere and a participant in the Data Provenance Initiative, argues that the data sourced from platforms like YouTube may not capture the full spectrum of human nuances and experiences. Videos uploaded often target specific audiences, which can limit the breadth of representation in the data.
Hidden Restrictions
Another issue arises from the opaque nature of data usage in AI. Companies frequently do not disclose the data sources they utilize for training their models, primarily to protect their competitive edge. The researchers found that datasets often come with restrictive licenses that impose limitations on their use, making it challenging for developers to select appropriate data and ensuring that they do not inadvertently use copyrighted material.
Exclusive Data Deals
In recent times, major companies like OpenAI and Google have entered into exclusive data-sharing agreements with publishers and social media platforms. This trend exacerbates the concentration of power, partitioning the internet into zones of access and limiting opportunities for smaller companies, researchers, and nonprofits to obtain the necessary data for their projects.
Global Skew in Data
Furthermore, the data used to train AI models is disproportionately skewed towards Western regions. The Data Provenance Initiative’s analysis revealed that over 90% of the datasets originate from Europe and North America, with less than 4% sourced from Africa. This bias reflects a narrow view of global culture and experiences, leading to AI models that may misrepresent or overlook non-Western perspectives.
In conclusion, the sources of data for AI development are complex and fraught with challenges. As the reliance on AI continues to grow, it becomes imperative for the tech community to address the issues of data provenance, representation, and equitable access. Ensuring a diverse and comprehensive dataset is essential for creating AI that truly reflects the richness of human experience and fosters a more inclusive technological future.
Add a comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use