UNSTRUCTURED DATA CORPUS

The emergence of open data is the growing growth of electronic collections through mass digitization.

Also, with the budgets of cultural institutions decreasing, cultural institutions need to rethink the management of their funds on the one hand. On the other hand, its data needs to be developed.

These two requirements are partially solved using Named Entity Recognition (NER). Indeed, the traditional model of manual cataloging and indexing has been under serious pressure for several years.

With ever-shrinking budgets, they have to do more with less. The trend towards semi-automatic computerized cataloging is strong.

It is also supported by financial institutions. This encourages data through enrichment by linking it to external sources of information.

This context gave importance to the concepts of data network and open data in the cultural world. Recent initiatives such as OpenGLAM2 and Lodlam illustrate the evolution of applications. It shows how these evolutions permeate the field of cultural heritage.

In both the United States and the European Union, their digital libraries adopt data network principles. In France, the National Library of France has a similar project.

The enrichment and integration of heterogeneous collections can be facilitated by using dictionaries made available according to data network principles.

A popularization file is presented on the data network and the semantic network. It allows information professionals to update themselves on the challenges of a fully evolving field.

Following the same logic, it presents a presentation of the different technological bricks of the semantic web, with new possibilities it offers to libraries as an approach.

Research Questions

Here we will try to answer a few questions. First, we will discuss the possibilities and limitations of NER and other feature extraction methods to enrich the unstructured data corpus.

A standard will be used to calculate the precision, recall and F-Score of results achieved by their services. More systemic issues will also be addressed, such as the benefits of using a GSC and how to counteract its shortcomings.

Indeed, terms like "paleontology" or "space exploration" enrich a corpus. It is an undeniably interesting source of information. But they are not considered by a GSC because they are not named entities.

Also, the GSC is, at first glance, concerned with the granularity and frequency of a term. It does not make it possible to distinguish between the omission of the term "Quebec" in a compilation and its proper meaning. The city adds little value, but contributes to a high recall score.

Next, we will take an overview of its creation and current evolution, with special attention to its use in the data network and cultural heritage.

Next, we'll cover a case study and its methodology for its use, followed by contextualization of the results of our study.

The problem of multilingualism will also be addressed, both at the level of extraction and disambiguation. What effect does the language of a corpus have on a subtraction algorithm?

At the same time, how to continue to find a meaning of the extracted entities? What is the definition in the correct language?

Finally, we will consider the general risks of using it in bulk, especially in terms of the language used.

Dr.Yaşam Ayavefe