Full-text indexing of non-textual resources (paper by David Byers)

Intérêt

http://doi.org/10.1016/S0169-7552(98)00059-2 Full-text indexing of resources on the World Wide Web is limited to simple content types, such as HTML and plain text. More complex content types, such as Postscript, PDF and proprietary word-processing formats are excluded, despite the fact that such documents are usually rich in content.

The reason for excluding these types of resources is simply that it would be too expensive and too difficult to attempt to extract a textual representation from them.

The operator of a search engine is simply not motivated to expend the additional resources that would be needed to handle such documents. The gain would be fairly small, and search engines are extremely popular even when they are limited to HTML and plain text documents.

The situation is quite different from the point-of-view of the content provider. A site may have significant amounts of its content in non-textual documents, but despite this the content provider may want to have the documents indexed in normal search engines.

In this paper we present several server-side solutions that allow existing indexing software to index the textual representation of non-textual resources.

Full-text indexing of non-textual resources (paper by David Byers)

Taille du texte

Imprimer

Partager sur :

Twitter Facebook

LinkedIn Pinterest

Envoyer par mail

Contenu sous droits d'auteur — Dernière mise-à-jour : 2022-02-07 10:16:17

Accueil

À propos du site

Informations spéciales

Contactez-nous

Découvrez nos contenus

par catégories

par mots-clés

par dates d'ajout et de modification

Index alphabétique

Partagez vos connaissances !
Pour publier durablement et librement sur Internet, contactez-nous.

Nos sites partenaires
AURORAE LIBRI : livres anciens, textes rares & illustrés modernes

VINTAGE-ERA : informatique vintage, retro gaming, jeux de rôles et imprimés des années 1970-2000

Libre Savoir a 17 ans.

Suivez-nous : RSS Twitter

/a>