Your data + Google Dataset Search

January 16, 2019
Amy E. Hodge

"I was wondering if you know anything about getting datasets discoverable on Google Dataset Search?"

We recently received this query from a Stanford researcher who had deposited content into the Stanford Digital Repository.

The short answer: request a DataCite DOI from Stanford Libraries, which you can do by emailing doi-contact@lists.stanford.edu.

For those of you unfamiliar with Google Dataset Search or who are interested in the details behind the response, read on! 

What is Google Dataset Search?

Google Dataset Search is a fairly new project from Google that is intended to function kind of like Google Scholar, except for data. It's a way for people to identify data resources, no matter where they live on the Web. Google Dataset Search searches the metadata for datasets available on the Web and then tells the user where the data live. It doesn't actually aggregate any of the data itself. As Google puts it, Google Dataset Search is "a tool designed to make it easier for researchers to discover datasets that can help with their work."(1)

How does Google find the datasets?

Google looks for Web content whose metadata includes certain schema.org metadata fields filled out in a particular way. Not all data repositories have implemented schema.org, and not all those using schema.org have implemented it in a way that allows Google to find and index their content. While recent work to the Stanford Libraries catalog included implementing schema.org for some of our content, DataCite has completed their schema.org implementation to ensure that their content is included in Google Datasets.(2)

Who is DataCite?

DataCite is a global non-profit organization that provides Digital Object Identifiers -- DOIs -- for research data. You can think of them as the counterpart to CrossRef, which is the organization most publishers use to provide DOIs for the articles you publish in their journals. A DOI will uniquely identify your data and its location on the Web. Stanford Libraries is a DataCite Member Organization, which means we are able to provide DOIs to researchers at Stanford for their research data.

What is the Stanford Digital Repository? 

If you are looking for a way to make your research data available to others without having to field individual email requests or maintain your own server, then you're looking for the Stanford Digital Repository. We provide a self-service Web application for depositing your content into the SDR. Your files and a description of the content are then assigned a unique identifier and made available under the license you choose on a persistent URL, or PURL, page. We take care of your content in our preservation system and make sure that it is discoverable through the Libraries' catalog, which is crawled by Google.

How is the SDR related to DataCite, DOIs, and Google Dataset Search?

A DOI is only as good as the location of the content. If your data are in a location that is not being actively maintained and the link goes dead, your DOI will be dead as well. The DOI can be updated to point to a new link, but do you really want to remember to do that every time you have to move your files? That's why it's important to put your content in a location that you can rely on staying constant over time, like the Stanford Digital Repository.

Once your content is in the Stanford Digital Repository, you can request that we assign it a DataCite DOI. And with DataCite implementing schema.org in a way that is compliant with Google Dataset Search guidelines, your dataset should show up there once the dataset metadata at DataCite has been properly crawled by Google.

Still have questions? 

Contact us at sdr-contact@lists.stanford.edu for questions on depositing datasets or other scholarly content in the SDR. 

Contact us at doi-contact@lists.stanford.edu for questions on how to get a DOI for your dataset.

 

(1) https://ai.googleblog.com/2018/09/building-google-dataset-search-and.html

(2) https://blog.datacite.org/taking-discoverability-to-the-next-level/ 

accessibilityaccessprivsarrow-circle-rightaskus-chataskus-librarianbarsblogsclosecoffeecomputercomputersulcontactsconversationcopierelectricaloutleteventsexternal-linkfacebook-circlegroupstudyhoursindividualinterlibrarynewsnextoffcampusopenlateoutdoorpeoplepolicypreviousprinterprojectsquietreservesscannersearchstudysupportingtabletourstwitter-circleworking