Artificial intelligence (AI) and machine learning (ML) models can solve well-specified problems.
But applications of AI and ML for major social and scientific problems are often constrained by a lack of high-quality, publicly available data—the foundation on which AI and ML algorithms are built.
Artificial intelligence (AI) and machine learning (ML) models can solve well-specified problems, like automatically diagnosing disease or grading student essays, at scale. But applications of AI and ML for major social and scientific problems are often constrained by a lack of high-quality, publicly available data—the foundation on which AI and ML algorithms are built.
The Biden-Harris Administration should launch a multi-agency initiative to coordinate the academic,
industry, and government research community to support the identification and development of datasets for
applications of AI and ML in domain-specific, societally valuable contexts. The initiative would include
activities like generating ideas for high-impact datasets, linking siloed data into larger and more useful
datasets, making existing datasets easier to access, funding the creation of real-world testbeds for
societally valuable AI and ML applications, and supporting public-private partnerships related to all of the
above.
Open-source data challenges are a proven way to attract top researchers to develop ML models. For example, the 14-million-picture ImageNet dataset was released in 2007 as a computer-vision challenge in which researchers competed to produce the best image-processing algorithm. But because assembling big datasets is a lengthy and expensive process, private-sector companies often have little incentive to share the big datasets they do create.
Funding the creation of big datasets and other open, shared AI resources is a powerful way for the federal government to drive AI and ML talent toward socially impactful and nationally strategic ends. Datasets can significantly accelerate progress in research related to (a) core AI technologies and techniques (e.g., computer vision, natural-language processing, and meta-learning); (b) applications of AI in science and engineering; and (c) applications of AI to societal problems. For instance, U.S. traffic and transportation data are currently dispersed across thousands of jurisdictions and companies. Integrating these data into a single accessible and responsibly managed dataset would help experts optimize freight routes, reduce transportation emissions, and anticipate supply-chain disruptions.<fn-sp>1<fn-sp> Other high-leverage domains for AI and ML include energy demand forecasting, medical diagnostics, and automated legal assistance.
There is little time to waste, as our nation’s technological competition with China intensifies. The Chinese government is investing significantly in applied AI via its Made in China 2025 plan, data-sharing partnerships with tech companies, and policy support.<fn-sp>2<fn-sp> As a result, many AI products achieved widespread Chinese public adoption before reaching the market in the U.S. (e.g.automated loan underwriting and facial recognition<fn-sp>3<fn-sp>, despite the fact that average citations for Chinese AI papers lag behind those for American AI research.<fn-sp>4<fn-sp> The new administration must close the data and deployment gap in order to shape global AI governance around our nation’s values.
The Trump administration’s American AI Initiative increased the budget for non-defense AI R&D from $1.118 billion in FY2020 to $1.503 billion in FY 2021, including $868 million to the NSF,$125 million to the DOE, $100 million to the USDA, and $50 million to the NIH.<fn-sp>5<fn-sp> Furthermore, OSTP reports have cited the importance of shared datasets, compute, and testbeds.<fn-sp>6<fn-sp> However, work remains to be done in identifying the datasets and other shared resources most needed by researchers, so that funding can be directed towards its most impactful uses. We propose several concrete ideas for identifying and funding these shared AI resources.
The federal government should launch a multi-agency AI for Good initiative with a budget of at least $100 million per year, sourced from the National Science Foundation’s FY2021 budget forAI R&D. This initiative funds opportunities for external research talent to work within and outside of the government to identify, create, and maintain shared datasets in domains of crucial public importance.
This initiative would be headed by the newly formed National Artificial Intelligence Initiative Office in the White House Office of Science and Technology Policy (OSTP) and operate incoordination with the NSF and the multi-agency Networking and Information TechnologyResearch and Development Program (NITRD). It would support the following types of activities:
The federal government has previously funded multiple applied AI initiatives to achieve broad policy goals. Examples include:
Artificial intelligence is beginning to unlock a range of applications, from enabling social workers to identify at-risk youth<fn-sp>12<fn-sp> to helping cities anticipate their exposure to extreme climate events.<fn-sp>13<fn-sp> But scaling up these efforts and enabling many more applications requires greater access to data. A prerequisite for many more transformative applications of AI and ML – to pressing problems in fields like healthcare, energy, and education – will be shared datasets and infrastructure widely available to top researchers.
American innovation has flourished through a decentralized and complex ecosystem of companies and universities. The new administration should therefore closely collaborate with academia and the private sector to find the best ideas for datasets and other shared resources, construct or release these datasets, and create competitions and other mechanisms to encourage the development of applied solutions at scale. With this toolkit, the new administration could have high leverage against specific hard problems. For example, funding a dataset in a field of strategic national importance like energy would allow the U.S. government to define the problem, set the agenda for an entire field, then attract the most talented researchers and engineers to develop the best solutions at scale. A national AI for Good initiative could attract top talent and spur the development of solutions to some of our greatest national challenges.
Past federal data initiatives have largely not leveraged advances in AI and ML to build data products, partly due to a lack of domain knowledge, the right datasets, and financial resources.
By contrast, the AI for Good initiative aims to produce AI and ML solutions that can be implemented at scale and adopted by the communities who need them by:
The Climate Change AI research organization has suggested a range of high-leverage datasets.<fn-sp>15<fn-sp> Some that seem particularly appropriate for government funding include:
This initiative could catalyze similar intra-institutional efforts in other domains such as economics, healthcare, and education.
Additional resources that could accelerate progress in AI include (1) real-world testbeds for reinforcement learning, (2) open-source libraries, and (3) secure data-labeling tools.
Reinforcement learning is the study of a computer agent as it learns through interactions with the environment. Many researchers have used games to train agents. The federal government can fund more diverse and sophisticated testing environments that more directly relate to envisioned real-world applications. For example, researchers participating in the Autonomous Greenhouse Challenge co-sponsored by Wageningen University in the Netherlands developed algorithms that increased the productivity and sustainability of indoor agriculture.<fn-sp>16<fn-sp>
Open-source libraries expedite development of AI models by allowing researchers to import code for common tasks. For example, libraries might provide code for common tasks like preloading datasets, parallelizing machine learning, and logging progress.
Data-labeling tools are necessary to efficiently create big datasets that can be used to train AI and ML algorithms. The federal government could create tools tailored to specific domains. For instance, secure tools that respect HIPAA privacy will be needed to create big training datasets for medical applications.
The first step in building a pipeline from research to applied AI is to understand AI researchers’ existing interests and needs. Armed with this knowledge, the federal government can strategically determine what domains are most in need of support, what valuable data needs to be unlocked, and where grant opportunities will be most impactful. The federal government can help meet these needs by forging partnerships with varied members of the AI community (e.g.,universities, companies, and policymakers).Competitions and challenges—along the lines of Kaggle competitions<fn-sp>17<fn-sp>, SpaceNet<fn-sp>18<fn-sp>, or DARPAPrize Challenges<fn-sp>19<fn-sp>—hold great promise for driving researchers toward high-impact work in applied AI. To be most effective, competitions and challenges should focus on a practical AI problem chosen by domain experts. Participation incentives include monetary awards, computing resources, and/or recognition at top conferences. For federally sponsored competitions and challenges, winning models should be open-sourced, and winning teams connected with resources or partners to implement their solution.
The AI for Good initiative is an opportunity to establish clear standards for ethical, secure, and equitable data science across multiple domains. The following general recommendations should underlie the initiative, with adaptations made as necessary by experts depending on the specific domain in question. In all cases, trust relies on the relevant bodies being transparent about how data is collected and governed.
Read more about the Day One Project <rte-link> here<rte-link>.