The Roma Digital Repository Project

Emily Hanlon

The Open Society Archives (OSA) is the official repository for the Central European University and the Open Society Foundations (OSF) and over the past several years, while we have been collecting records for the CEU and OSF organizational archives, OSA has been selecting the digital copies of any OSF Roma-related program materials, including grant and some scholarship files. OSA has begun an exciting, experimental information processing and retrieval project that we are calling the Roma Digital Repository Project (RDRP). For this pilot project we are working in cooperation with OSF’s Roma Initiatives Office and a Romanian-American consulting company. Our aim is to create what amounts to a database, containing all of the OSF Roma-related program materials. This database will produce reports, based on pre-defined queries, which will help the OSF understand which Roma initiatives have been funded, to support the identification of funding gaps.

These reports will be produced using a combination of various data mining and processing tools which will read patterns in the information contained in the repository, rather than imposing causal relationships on that information. We will be exploring our data using our human-produced authoritative metadata and three techniques: semantic tagging, full text search and statistical analysis. This approach is very much at the heart of recent big data projects and the pilot system will be an experimental one because there are no systems at this time that search unstructured information this way.

More on the techniques we will be using to search the contents of our repository:

Semantic tagging

This is a technique that assigns labels our digital holdings, using an ontology produced, to reflect the meaning of the content. The ontology is a technical language that will define all of the primary, secondary and tertiary concepts within the digital holdings, and will make it so that the system search and find the key data concepts we highlight.

Full text search

This is a tried and true technique. Systems search the entire text of each of the digitized information in the digital repository. This technique is employed by Google, for example.

Statistical Analysis

We will be using statistical analysis to scan the repository’s information. This is useful because for this pilot we will not be analyzing all of the digitized Roma-related content but a specific subset. This will help us to recognize that in our pilot, or in statistical terms, the representative selection, if our methods are working.

Look for more information about this in the coming days!