Combining datasets improved findability

Almost 1300 datasets in January, around 600 in February and now the counter on the frontpage of HRI shows 549. Good grief, has HRI begun deleting data? Not exactly, but a large operation of re-organising HRI’s datasets, which involved combining and harmonising datasets where possible. No data has been lost, the previous 1300 datasets are now available in larger packages.

In January HRI still offered 1 276 datasets. Now the same data has been aggregated into 549 datasets.

At the start of the year, HRI began the large task of clarifying the contents of its data catalog, which involved combing through smaller datasets and combining them into larger packages. The purpose was to improve the findability of datasets, as well as make it easier to oversee and maintain datasets in the catalog. This task has now been finished.

For the user, the aggregation of datasets has been visible through a drastic drop in number of available datasets – by almost a half – but also by easier searchability of the data catalog. Data has been grouped in order to provide larger packages of similar data. Before purging any data from the catalog, a database dump was taken in order to keep a point of reference for later use.

Small crumbs of information into larger packages

The main targets in need of aggregation were roughly 500 smaller datasets that were opened during the first years of HRI’s journey. These datasets included the Statistical Yearbook of Helsinki 2009, the Statistical Yearbook of Vantaa 2010 and 2011, as well as the Population of Vantaa publications. Altogether, these publications accounted for over 500 datasets in the catalog. These datasets have now been aggregated into annual packages.

A number of other datasets have also been combined with each other. For example, city specific datasets of traffic noise zones have been harmonised into the traffic noise zones in Helsinki metropolitan area dataset, performance indicators of social services in Helsinki have been packaged into a time series, Helsinki’s terraces can now be found in the land usage permission system for public areas in the City of Helsinki, and parking meters can be found in Helsinki metropolitan area service map. Additionally, overlapping datasets have been removed. For example, the annually released building land stock SeutuRAMAVA datasets are now available as a single dataset.

To sum up, only datasets that are maintained by an external party to the Helsinki metropolitan cities were removed from the data catalog. For example, corporate tax public records and open data maintained by Metropolia University of Applied Sciences were removed. These datasets can now be found in the government’s opendata.fi service.

During this operation, nearly all datasets requiring data updates have been updated. Some datasets are still waiting for fresh data updates, but the majority of datasets are up to date. Additionally, where possible, datasets were converted into structured machine readable formats. Also, the metadata has been improved, checked, supplemented, and harmonised – especially with datasets that were released years ago. Altogether, HRI now provides up-to-date information easier and similar datasets can be found more effectively.

Inevitably, some of HRI’s dataset links broke during this operation. However, the benefits of optimizing the contents of the catalog outweigh possible constraints:

  • it is now easier to comprehend the data content of the entire data catalog
  • datasets can now be found faster and more reliably
  • fresh datasets are found more reliably
  • timeseries can be found easier
  • overlapping datasets have been removed
  • the amount of manual labor (and possible errors) decreased, as the number of maintainable datasets decreased

Questions along the way

During the cleaning operation, a lot of questions popped up. For the user, is it better that datasets are available as large collections, or would it be better to serve some datasets in smaller packages? For example, are traffic light intersections better as their own dataset or should they be incorporated into the metropolitan service map? (They will be found there in the fall, nevertheless.)

What about datasets, which were previously maintained by the city but nowadays the responsibility has moved elsewhere? For example, Helsinki’s tourism statistics were previously collected by city-owned marketing company Visit Helsinki but at the turn of the year the responsibility was transferred to Visit Finland.

This brings us to the lifecycle of datasets. Are older datasets still interesting and necessary? Should datasets, which are not actively maintained anymore, be removed after a certain period of time? Or do some datasets – and which ones (?) – have historical value? Or should older datasets be archived?

Another challenge is the format of datasets. It wasn’t feasible to convert all datasets into structured machine readable form. For example, the comparisons in daycare and social services between the six largest cities would require immense amounts of time and effort to convert into a structured machine readable form. It is clear that it is better to open data in unstructured formats, than not opening it at all. However, when is it justifiable to use time and resources to convert datasets? In contrast, when is it acceptable to open data fast and easy but as a result foregoing its use in an application?

End result clear and easily maintained data catalog

All in all, we find and wish that HRI’s data catalog is now easier to browse in its entirety. Thanks to updated metadata, the data search works reliably and fresh data can be found with ease. When opening new datasets, HRI will pay more attention to data packages. Also, if certain data works better as a part of a larger service like the service map, it can be worthwhile to rethink the distribution channel.

Maintenance of the data catalog has become easier with the reduced number of datasets. Maintenance is further simplified by Helsinki’s Code Fellows’ automated scripts, which check monthly for broken dataset resource links and which datasets require updating. This information is relayed into HRI’s e-mail and it enables the HRI staff to be content with the data catalog content being up to date and functional.

It has become evident that in the future, the number of datasets is not a viable metric in determining the progress of the city’s open data efforts. This year, overall openness and availability of open has progressed much further despite the sharp drop in number of open datasets. Brainstorming about new, better metrics to measure the status of open data efforts in the city continue.

Translated by: Kaarlo Uutela

  • HRI

    Old discussion 2016/09/21 at 17:49 / Petri Kola
    Maybe measuring quantity is not so bad after all. You just have to understand what you are measuring. If you want to measure how much data you have you should instead of the number of dataset measure the amount of statements. Bigger is better. The amount of datasets is also interesting. Perhaps it reflects data cohesion? Smaller is better.