Position paper presented at the ACM SIGIR-97 Workshop on Networked Information Retrieval, Philadelphia, July 31, 1997

Networked Digital Libraries: the Concept and a Case Study

José Luis Borbinha (IST / INESC-Telematics Systems and Services Group)
João Ferreira (IST-Electrical Engineering and Computers Department)
Joaquim Jorge (IST / INESC-Intelligent Multimodal Interfaces Group)
José Delgado (IST / INESC-Telematics Systems and Services Group)

Abstract

This paper introduces the concept of networked digital library, defined as a library with the additional mission to stimulating, supporting, disseminating and recording the process of creation of information. The paper presents also ArquiTec, a case study of a networked digital library for an academic and research community.

1. Introduction

ArquiTec is a joint effort by INESC, the Portuguese National Library and JNICT (the Portuguese agency for research funding), to develop a networked digital library for the Portuguese academic and research community.

We intend to use ArquiTec both as a technology demonstrator and a framework to develop, test and consolidate expertise in core fields related with digital libraries, with a special emphasis on our concept of networked digital library (NDL). In that sense, ArquiTec's added value goes beyond merely providing a useful running system. It will also serve as a laboratory for further identification and study of technical, social and institutional implications raised by the NDL concept.

In the next point we present our definition for the concept of networked digital library. The remainder of this paper describes the ArquiTec project as an embodiment of that concept.

2. Networked Digital Libraries

Figure 1 summarizes our approach to the networked digital library paradigm.

Traditional libraries are organized around books. The book is traditionally a "sacred" indivisible piece of knowledge and it is intended to be stored "forever". In this scenario, authors decide what to write and when to edit, while librarians decide whether or not to buy the final product.

More recently, increasing specialization brought us thematic journals, reports and conferences, from which a new object emerged: the paper. The paper is formal, being validated by the credibility of an editor or a review committee. It is not intended to be valid forever, but to be discussed during a period of time, refined and, in the end, what survives is then distilled in books.

It is difficult for traditional libraries to follow the trend of ever increasing specialization on all fields of knowledge; so libraries themselves become specialized, with a mission to serve specific communities. Since these communities are well identified, it is now possible to anticipate their needs and to provide customized services, such as the notification of the arriving of new journal issues, the advertisement of new publications, etc.

Figure 1: The evolution of library paradigms.

The scenario changes again with the arrival of computers. Computer networks allow communities to intensify their interactions. With electronic mail, desktop publishing tools and WWW, everyone becomes a potential publisher. Speed of interactions increases, and a new focus emerges: the idea. To produce fast results, ideas are presented in informal pre-prints and discussed in informal workshops. Ideas that succeed in this process result in formal papers, which are then published in journals and promoted in formal conferences. Using electronic mail and WWW, it is now easier for libraries to reach communities and to provide new services. By the same reason, it is now easier for users to interact with libraries, not only to access OPAC (Online Public Access Catalog) services but, in an extreme scenario, to contribute also with new kinds of meta-knowledge that can notably enrich the library contents. Examples of such contributions can be the tuning and completion of thesauri and catalogues (allowing dynamic and collaborative cataloguing), annotations and comments to stored documents, etc.

The presented perspective lead us to a vision and a definition for the networked digital library that, from our point of view, comprises the most relevant discussed concepts (as shown in figure 1):

A networked digital library (NDL) is defined not only as a repository of information, with the traditional missions of preserve, organize and provide access to that contents, but also as a system to disseminate that information and actively stimulate, support and record the process of its creation.

3. ArquiTec

The ArquiTec project started in the beginning of 1997, and a limited prototype has been developed until now. The final service is scheduled for public release by the end of current year, followed by a six-month trial period.

ArquiTec is based on a series of paradigmatic initiatives drawn on the expansion of the Internet to provide online collections of scholarly and scientific documents, namely in NCSTRL (Networked Computer Science Technical Reports Library) [1]. ArquiTec is accessible over the Internet, through a WWW interface. It will provide access to different kinds of technical documents (such as papers, reports, theses, dissertations, etc.), in different fields of knowledge, while special services will be also provided to the community (such as a notification service).

Following the NDL concept, the system will provide support for a three-step workflow in the production of information, comprising:

Informal documents: usually known as grey literature (position papers, pre-prints, etc.).
Refereed documents: papers presented in conferences, published in conventional journals, etc.
Formal documents: theses, dissertations, reports, electronic books, etc., which should be archive in a special server.

Figure 2: ArquiTec main entities.

Keeping in mind the definition given for the NDL concept, our system has been developed around three main entities, as shown in figure 2:

Documents Space: documents exist in local repositories managed by distributed servers, with selected documents replicated in a central archive.
User Directory: the library knows about its users, managed in a global directory.
Concept Space: an ontological space, supported by formal and statistical thesauri and users contributions.

These entities are related among them by:

Index: the relationship between the documents and the concept space (important support for the statistical thesauri).
Authors and Patrons: any user can be an author of a document or just a reader.
User Profiles: users are identified by their interests, which relate to subjects in the thesaurus (likewise for documents).

A new concept of catalog, now viewed as an integration space, makes it possible to explore the main entities and relationships between all the entities (such as looking for authors of documents related to a specific subject or users sharing common interests).

Finally, the Portuguese National Library will build and maintain, as an official archive, a collection with a copy of formal or refereed documents.

4. Architecture

The system has a distributed architecture, with each participating institution managing its own collection (and preferably a repository server, although that is not mandatory). The main blocks of ArquiTec are presented in figure 3.

Figure 3: ArquiTec main blocks.

The core of the system is based on a modified and extended version of NCSTRL, using version 4.1.8 of the DIENST protocol [2], which includes Glimpse [3] as the indexing machine.

Figure 4 shows the structure of the ArquiTec local servers. The NCSTRL user interface was modified, in order to support multi-lingual access (Portuguese and English in the first release). Other modifications included for example the submission of documents, which can now be done remotely.

Concerning the metadata, we use the original RFC 1807 format from NCSTRL [4]. In fact we extended its usage, since we also decided to register annotations in that format (each annotation becomes a new metadata file).

Figure 4. Architecture of a local ArquiTec server.

The ArquiTec central server has two more modules than the local servers. The submission module was modified in order to allow the management of the archive, which is seen as a collection of the central server. After documents are selected by the National Library staff, a new gather module at the central server copies the chosen documents from the local servers using HTTP and then simulates a local submission. Another new component is the notification service. This is a "batch" service, which is activated by relevant events in the repositories (such as the submission of a new document or an annotation).

In ArquiTec the central server gathers the metadata files from their local servers (and not the indexes) and creates the indexes locally. This is a requirement of Glimpse, but is has a positive consequence since this way it is possible to provide a global fault-tolerant central catalog for documents, independently of the local servers.

A final requirement was name persistence for official archived documents. The problem of naming objects in a digital library was generically addressed in the "Kahn/Wilensky Report", from which the concept of handle as an URN (Uniform Resource Name) emerged [5]. A simplified version of that concept was implemented by OCLC in the PURL (Persistent URL) service, based on the existence of a highly reliable server [6]. A PURL is a normal URL, with a logical meaning that, when used, implies an access to the PURL server that acts as an HTTP proxy and automatically translates the logical name to the real URL of the object referred to.

A PURL service is provided at the National Library, which automatically generates a PURL for each document archived. However, when one of those documents is retrieved via a query to its local server, its PURL is visible to the user but the retrieved version is the local one (simulating a cache, in a certain way).

5. Users

Users access ArquiTec in one of two modes: anonymous or identified.

Identified users have profiles composed of explicitly provided data (their explicit interests) and data implicitly extracted from the history of their interactions with the system (such as submitted and retrieved documents, for example). User profiles serve three main purposes:

Searching: the profile is used to rank search results, for example highlighting documents that best match a user's interests (but never hiding or restricting the access to the other documents).
Filtering: the profile is used for an information dissemination service, supported by electronic mail, through which users can receive automatic notifications of new events.
Collaboration: interactive services for document annotation (which can be reflected in the catalog) and for thesaurus tuning are also provided.

The ArquiTec user directory is based in a structure of LDAP (Lightweight Directory Access Protocol) servers [7]. LDAP is a simplified TCP/IP version of the original DAP (Directory Access Protocol) protocol, defined in the ISO X.500 standard [8].

6. Documents

ArquiTec manages three document spaces, as shown in figure 5:

Informal space: for informal documents (supported by HARVEST brokers)
Formal space: for refereed documents.
Official space: an archiving space for formal documents (supported by the central server).

Figure 5: Documents spaces in ArquiTec.

Any identified user can contribute with new documents to the library and add annotations to the existing document. Annotations are automatically stored and indexed by the digital library, becoming conceptually attached to their documents.

The submission of an annotation may originate automatic notifications, composed by electronic mail messages sent to specific users such as the document authors and the users that had retrieved it. A similar situation occurs when a new document is submitted.

As it was mentioned, the Portuguese National Library will maintain a central archive with a copy of formal or refereed documents (and related annotations). Currently, the documents to archive are copied from local collections only after explicit selection by the library staff. The reason for this procedure is not technical, but organizational, since it is difficult to foresee how often and in what manner different communities will use ArquiTec. This is also a completely new reality for the National Library which, despite their experience with printed material, have had no chance yet to accumulate similar know-how in order to establish equivalent rules for on-line publications. In the future, after this point is clarified, it is expected that the central server will automatically gather specific kinds of documents from the local sites (such as theses).

7. Concepts space

Thesauri and user contributions support our concept space.

The system recognizes two kinds of thesauri:

Formal sources: it is possible to import external formal thesauri, developed independently of ArquiTec.
Statistical thesauri: we give an extreme importance to extra relationships automatically extracted from document repositories and the users directory (for example, two subjects are related if there is at least one document referring to them or a user interested on them, being such relationship shown as a link to the document or to the user profile).

On the other side, the increasing scholarly and scientific activity has resulted in a growth of publications rich in new and interdisciplinary perspectives, raising serious problems for traditional libraries where collections have been usually classified with static structures. In order to deal with the dynamic classification problem, our digital library allows users to contribute to by:

Suggesting new keywords for documents or questioning existing ones.
Suggesting new relationships to the thesauri or questioning existing ones.

This service gives users a means to interact with the library, not only to access it as an OPAC service but also to contribute with new meta-knowledge that can notably enrich the system.

For the multilingual thesauri, the ISO-5964 standard was followed [9], and the structure was developed in MCF, a simple, flexible and portable format for meta-content representation [10].

8. Open Issues

Examples of main identified research issues that are requiring our attention are:

URNs: a support for a more complete URN service than the PURL servers must be considered.
Security: requirements for secure authentication and certification authorities, important for example for managing documents with access restrictions (an issue related with the URN problem).
Natural language classification and search: with a special focus on the Portuguese language.
Long term preservation: to ensure the survival of the official repository with the evolution of the technology, such as new storage systems, document formats, etc. (this of special concern to the National Library).

We have been carrying out work concerned also with the integration of other spaces, accessible by new interfaces to local DIENST servers. Examples are interfaces for Z39.50, useful for the integration of OPAC systems such as the catalogs of conventional libraries, and HARVEST brokers, useful for the support of informal publications and other similar material such as "home-pages", archived mailing lists, etc.

References

[1] Davis, J. R. (1995). Creating a Networked Computer Science Technical Report Library. D-Lib Magazine, September 1995 (Available in 13 May 1997 at http://www.dlib.org/dlib/september95/09davis.html).

[2] Davis, J. R.; Lagoze, C. (1994). A protocol and server for a distributed digital technical report library. Technical Report TR94-1418, Computer Science Department, Cornell University, 1994.

[3] Manber, U.; Wu, S. (1993). GLIMPSE: A Tool to Search Through Entire File System. University of Arizona Technical Report TR 93-34.

[4] Lasher, R.; Cohen D. (1995). RFC 1807: Format for Bibliographic Records. June 1995 (Available in 13 May 1997 at http://ds.internic.net/rfc/rfc1807.txt).

[5] Kahn, R.; Wilensky, R. (1995). A Framework for a Distributed Digital Object Services. CS-TR Report, May 1995 (Available in 13 May 1997 at http://WWW.CNRI.Reston.VA.US/home/cstr/arch/k-w.html).

[6] Weibel, S.; Jul, E. (1995). PURLs to improve access to Internet. OCLC Newsletter, November/December 1995, 19 (Updated version available in 13 May 1997 at http://purl.oclc.org/OCLC/PURL/SUMMARY).

[7] Yeong, W.; Howes, T.; Kille, S. (1995). RFC 1777: Lightweight Directory Access Protocol. IETF Network Working Group, March 1995 (Available in 13 May 1997 at http://ds.internic.net/rfc/rfc1777.txt).

[8] CCITT (1988). The X.500 Directory: Overview of Concepts, Models and Service. CCITT Recommendation X.500, 1988.

[9] International Organization for Standardization (1985). ISO-5964: Documentation Guidelines form the establishment and development of multilingual thesaurus. ISO, 1985.

[10] Gutha, R. V. (1997). Meta-Content Framework. Apple Computer White Paper (Available in 13 May 1997 at http://mcf.research.apple.com/hs/mcf.html).