Reading: Additional Sources of Datasets

 In this reading, you will learn about:

  • Open datasets and sources
  • Proprietary datasets and sources
  • Dataset license

Open datasets and sources

In this data-driven world, some datasets are freely available for anyone to access, use, modify, and share. These are called open datasets.
Open datasets include a public license and are very useful for your journey as a Data Scientist. Some of the most informative open dataset sources are listed below.

Government Data:

Financial Data Sources:

Crime Data:

Health Data:

Academic and Business Data:

Other General Data:

Propriety datasets and sources

Proprietary datasets contain data primarily owned and controlled by specific individuals or organizations. This data is limited in distribution because it is sold with a licensing agreement.
Some data from private sources cannot be easily disclosed, like public data.

National security data, geological, geophysical, and biological data are examples of propriety data. Copyright laws or patents usually bind this type of data. Proprietary datasets that mainly contain sensitive information are less widely available than open datasets.

Some standard propriety dataset sources are listed below.

Health Care:

https://www.sgim.org/communities/research/dataset-compendium/proprietary-datasets

Financial Market data:

https://datarade.ai/data-categories/proprietary-market-data

Google Cloud based datasets:

https://cloud.google.com/datasets

Dataset licenses

When you select a dataset, it is necessary to look into the license. A license explains whether you can use that dataset or not; or explains if you have to accept certain guidelines to use that dataset. The different license types are listed below.

  1. PUBLIC DOMAIN MARK - PUBLIC DOMAIN
    When a dataset has a Public Domain license, all the rights to use, access, modify and share the dataset are open to everyone. Here there is technically no license.

  2. OPEN DATA COMMONS PUBLIC DOMAIN DEDICATION AND LICENSE – PDDL
    Open Data Commons license has the same features as the Public Domain license, but the difference is the PDDL license uses a licensing mechanism to give the rights to the dataset.

  3. CREATIVE COMMONS ATTRIBUTION 4.0 INTERNATIONAL CC-BY
    This license allows users to share and modify a dataset, but only if they give credit to the creator(s) of the dataset.

  4. COMMUNITY DATA LICENSE AGREEMENT – CDLA PERMISSIVE-2.0
    Like most open-source licenses, this license allows users to use, modify, adapt, and share the dataset, but only if a disclaimer of warranties and liability is also included.

  5. OPEN DATA COMMONS ATTRIBUTION LICENSE - ODC-BY
    This license allows users to share and adapt a dataset, but only if they give credit to the creator(s) of the dataset.

  6. CREATIVE COMMONS ATTRIBUTION-SHAREALIKE 4.0 INTERNATIONAL - CC-BY-SA
    This license allows users to use, share, and adapt a dataset, but only if they give credit to the dataset and show any changes or transformations, they made to the dataset. Users might not want to use this license because they have to share the work they did on the dataset.

  7. COMMUNITY DATA LICENSE AGREEMENT – CDLA-SHARING-1.0
    This license uses the principle of ‘copyleft’: users can use, modify, and adapt a dataset, but only if they don’t add license restrictions on the new work(s) they create with the dataset.

  8. OPEN DATA COMMONS OPEN DATABASE LICENSE - ODC-ODBL
    This license allows users to use, share, and adapt a dataset but only if they give credit to the dataset and show any changes or transformations they make to the dataset. Users might not want to use this license because they have to share the work they did on the dataset.

  9. CREATIVE COMMONS ATTRIBUTION-NONCOMMERCIAL 4.0 INTERNATIONAL - CC BY-NC
    This license is a restrictive license. Users can share and adapt a dataset, provided they give credit to its creator(s) and ensure that the dataset is not used for any commercial purpose.

  10. CREATIVE COMMONS ATTRIBUTION-NO DERIVATIVES 4.0 INTERNATIONAL - CC BY-ND
    This license is also a restrictive license. Users can share a dataset if they give credit to its creator(s). This license does not allow additions, transformations, or changes to the dataset.

  11. CREATIVE COMMONS ATTRIBUTION-NONCOMMERCIAL-SHAREALIKE 4.0 INTERNATIONAL - CC BY-NC-SA
    This license allows users to share a dataset only if they give credit to its creator(s). Users can share additions, transformations, or changes to the dataset, but they cannot use the dataset for commercial purposes.

  12. CREATIVE COMMONS ATTRIBUTION-NONCOMMERCIAL-NODERIVATIVES 4.0 INTERNATIONAL - CC BY-NC-ND
    This license allows users to share a dataset only if they give credit to its creator(s). Users are not allowed to modify the dataset and are not allowed to use it for commercial purposes.

Note: Additional license types exist. Any dataset you use will include details about its license.

Author(s)

Lakshmi Holla

Other Contributor(s)

Malika Singla

Changelog

DateVersionChanged byChange Description
2022-12-140.1Lakshmi HollaInitial version created

Comments

Popular posts from this blog

Lila's Journey to Becoming a Data Scientist: Her Working Approach on the First Task

Text Formulas