top of page
  • Writer's pictureChockalingam Muthian

Google Dataset Search


Google’s goal has always been to organize the world’s information, and its first target was the commercial web. Now, it wants to do the same for the scientific community with a new search engine for datasets.


Datasets are easier to find when you provide supporting information such as their name, description, creator and distribution formats as structured data. Google's approach to dataset discovery makes use of schema.org and other metadata standards that can be added to pages that describe datasets. The purpose of this markup is to improve discovery of datasets from fields such as life sciences, social sciences, machine learning, civic and government data, and more.



The service, called Dataset Search, launched September 2018 , and will be a companion of sorts to Google Scholar, the company’s popular search engine for academic studies and reports. Institutions that publish their data online, like universities and governments, will need to include metadata tags in their webpages that describe their data, including who created it, when it was published, how it was collected, and so on. This information will then be indexed by Dataset Search and combined with input from Google’s Knowledge Graph. (Ex.If dataset X was published by NASA or American Institute of physics, some info about the institute will also be included in the results.) Google AI who created Dataset Search, aims to unify the tens of thousands of different repositories for datasets online.


The initial release of Dataset Search will cover the environmental and social sciences, government data, and datasets from news organizations like ProPublica.


Google dataset will expand soon because of the open data initiatives around the world. To create a decent search engine, we need to know how to build user-friendly systems and understand what people mean when they type in certain phrases.


The metadata tags the company is using to make datasets visible to its search crawlers are all open standard. Search engines improve most quickly when a critical mass of users is there to provide data on what they’re doing.


In the coming days I will be closely working on this Dataset Search raise pull request and provide meaningful insights in OpenDL site. Apart from that the API tools will be integrated in the domains I work currently. Interesting days ahead.



14 views0 comments

Recent Posts

See All

LLM Tech Stack

Pre-trained AI models represent the most important architectural change in software development. They make it possible for individual developers to build incredible AI apps, in a matter of days, that

bottom of page