Clarifying the HRI data catalog content halves the number of datasets

In January 2016, Helsinki Region Infoshare (HRI) will begin a large task, in which the smallest published datasets will be merged into larger datasets. As a result, it will approximately halve the number of available datasets. Although the number of datasets is drastically reduced, content will not be deleted. The same data will be provided in a more homogenous format. Simultaneously, the metadata, which provides the description of the datasets will be improved. Lastly, where possible, the datasets will be converted into several different file formats.

In January 2016, HRI is planning to merge their small datasets into larger datasets in order to provide a clear picture of open data and the city.

In the beginning of HRI, the first published datasets were small Excel tables of the Statistical Yearbook of Helsinki 2009, as well as the Statistical Yearbook of Vantaa 2010 and 2011. Each of the small Excel tables were published as an individual dataset. Back then, this initial content was important, in order to be able to learn the publishing process, how to correctly fill out metadata and for testing CKAN functionalities.

Since then, HRI has learned that solely for the findability of datasets, it is advisable to publish datasets in larger collections. For example, the newest Statistical Yearbook of Vantaa (2012-15) has been published as a single dataset. It makes it easier to understand the overall content of the data catalog, as well as it increases the findability of datasets and time series.

The start will be made by combining the small Excel table releases of the remaining statistical yearbooks into single annual releases. Afterwards, the small Excel table releases will be removed from the data catalog. At the same time, all published datasets will be checked and where possible, they will be merged to form more useful datasets. As an example, in future the traffic noise zones will be available as a single dataset that covers the entire Helsinki Metropolitan Area.

This work is scheduled to commence in January 2016.

For the users, these outlined changes will be noticeable not only through easily findable data collections but also by the sharp drop in number of available datasets. The number of datasets will approximately be reduced by half. Although the number of datasets drops, it does not affect the amount of data available in the data catalog. All previously published data will be available, just in some cases it will be available in an updated dataset.

Published datasets on the merging and removal list (tentative):

Removing and merging datasets will break some of HRI’s dataset links.

However, the benefits largely outweigh possible constraints:

    • it enables a better overview of the data content in the data catalog
    • datasets can be found faster and easier
    • newest datasets can be found easier
    • time series can be found easier
    • duplicates will be removed
    • the amount of manual labour (and possible mistakes) decreases as the number of maintained datasets decreases
    • some datasets are not maintained anymore because a newer version has been published as a part of a larger collection (i.e. terraces in public places in Helsinki can be found from the register of public areas in the City of Helsinki)

Before removing any datasets, a database dump of the metadata will be taken, which will provide a snapshot of the current situation. It can be useful for later reference, as well as for file history.

Also, as a part of this project, the quality of metadata will be improved and GIS datasets will be made available via the geoserver. Furthermore, there will be thorough discussion about better distribution formats for all remaining datasets. Lastly, the criteria and metrics for measuring the current state of open data in Helsinki will be improved. The goal is to provide the end user with a better overview of the overall openness of the city. Solely tracking and relying on the number of available datasets is an outdated approach; it is time to focus on the content and the use of datasets.

Translator: Kaarlo Uutela