Authoritative Sources in a Hyperlinked Environment
Dept. of Computer Science, Cornell University, Ithaca NY 14853. Email: kleinber@cs.cornell.edu.
This work was performed in large part while on leave at the IBM Almaden Research Center, San Jose CA
95120. The author is currently supported by an Alfred P. Sloan Research Fellowship and by NSF Faculty
Early Career Development Award CCR-9701399.
The network structure of a hyperlinked environment can be a rich source of information
about the content of the environment, provided we have e ective means for understanding
it. In this work, we develop a set of algorithmic tools for extracting information from the
link structures of such environments, and report on experiments that demonstrate their
e ectiveness in a variety of contexts on the World Wide Web (www) [4]. In particular, we
focus on the use of links for analyzing the collection of pages relevant to a broad search topic,
and for discovering the most \authoritative" pages on such topics.
While our techniques are not speci c to the www, we nd the problems of search and
structural analysis particularly compelling in the context of this domain. The www is a
hypertext corpus of enormous complexity, and it continues to expand at a phenomenal rate.
Moreover, it can be viewed as an intricate form of populist hypermedia, in which millions
of on-line participants, with diverse and often conflicting goals, are continuously creating
hyperlinked content. Thus, while individuals can impose order at an extremely local level,
its global organization is utterly unplanned | high-level structure can emerge only through
a posteriori analysis.
Our work originates in the problem of searching on the www, which we could de ne roughly as the process of discovering pages that are relevant to a given query. The quality of a search method necessarily requires human evaluation, due to the subjectivity inherent in notions such as relevance. We begin from the observation that improving the quality of search methods on the www is, at the present time, a rich and interesting problem that is in many ways orthogonal to concerns of algorithmic e ciency and storage. In particular, consider that current search engines typically index a sizable portion of the www and respond on
the order of seconds. Although there would be considerable utility in a search tool with a
longer response time, provided that the results were of signi cantly greater value to a user,
it has typically been very hard to say what such a search tool should be computing with this
extra time. Clearly we are lacking objective functions that are both concretely de ned and
correspond to human notions of quality.
Queries and Authoritative Sources. We view searching as beginning from a user-
supplied query. It seems best not to take too uni ed a view of the notion of a query; there
is more than one type of query, and the handling of each may require di erent techniques.
Consider, for example, the following types of queries.
Speci c queries. E.g., \Does Netscape support the JDK 1.1 code-signing API?"
Broad-topic queries. E.g., \Find information about the Java programming language."
Similar-page queries. E.g., \Find pages `similar' to java.sun.com."
Concentrating on just the rst two types of queries for now, we see that they present very
di erent sorts of obstacles. The di culty in handling speci c queries is centered, roughly,
around what could be called the Scarcity Problem: there are very few pages that contain the
1required information, and it is often di cult to determine the identity of these pages.
For broad-topic queries, on the other hand, one expects to nd many thousand relevant
pages on the www; such a set of pages might be generated by variants of term-matching
(e.g. one enters a string such as \Gates," \search engines," or \censorship" into a search
engine such as AltaVista [17]), or by more sophisticated means. Thus, there is not an
issue of scarcity here. Instead, the fundamental di culty lies in what could be called the
Abundance Problem: The number of pages that could reasonably be returned as relevant is
far too large for a human user to digest. To provide e ective search methods under these
conditions, one needs a way to lter, from among a huge collection of relevant pages, a small
set of the most \authoritative" or \de nitive" ones.
This notion of authority, relative to a broad-topic query, serves as a central focus in our
work. One of the fundamental obstacles we face in addressing this issue is that of accurately
modeling authority in the context of a particular query topic. Given a particular page, how
do we tell whether it is authoritative?
It is useful to discuss some of the complications that arise here. First, consider the natural
goal of reporting www.harvard.edu, the home page of Harvard University, as one of the most
authoritative pages for the query "Harvard". Unfortunately, there are over a million pages
on the www that use the term \Harvard," and www.harvard.edu is not the one that uses
the term most often, or most prominently, or in any other way that would favor it under
a text-based ranking function. Indeed, one suspects that there is no purely endogenous
measure of the page that would allow one to properly assess its authority. Second, consider
the problem of nding the home pages of the main www search engines. One could begin
from the query "search engines", but there is an immediate di culty in the fact that
many of the natural authorities (Yahoo!, Excite, AltaVista) do not use the term on their
pages. This is a fundamental and recurring phenomenon | as another example, there is
no reason to expect the home pages of Honda or Toyota to contain the term \automobile
manufacturers."
Analysis of the Link Structure. Analyzing the hyperlink structure among www pages
gives us a way to address many of the di culties discussed above. Hyperlinks encode a
considerable amount of latent human judgment, and we claim that this type of judgment
is precisely what is needed to formulate a notion of authority. Speci cally, the creation of
a link on the www represents a concrete indication of the following type of judgment: the
creator of page p, by including a link to page q, has in some measure conferred authority on
q. Moreover, links a ord us the opportunity to nd potential authorities purely through the
pages that point to them; this o ers a way to circumvent the problem, discussed above, that
many prominent pages are not su ciently self-descriptive.
Of course, there are a number of potential pitfalls in the application of links for such
a purpose. First of all, links are created for a wide variety of reasons, many of which
2have nothing to do with the conferral of authority. For example, a large number of links
are created primarily for navigational purposes (\Click here to return to the main menu");
others represent paid advertisements.
Another issue is the di culty in nding an appropriate balance between the criteria of
relevance and popularity, each of which contributes to our intuitive notion of authority. It
is instructive to consider the serious problems inherent in the following simple heuristic for
locating authoritative pages: Of all pages containing the query string, return those with the
greatest number of in-links. We have already argued that for a great many queries ("search
engines", "automobile manufacturers", . . . ), a number of the most authoritative pages
do not contain the associated query string. Conversely, this heuristic would consider a univer-
sally popular page such as www.yahoo.com or www.netscape.com to be highly authoritative
with respect to any query string that it contained.
In this work, we propose a link-based model for the conferral of authority, and show how
it leads to a method that consistently identi es relevant, authoritative www pages for broad
search topics. Our model is based on the relationship that exists between the authorities for
a topic and those pages that link to many related authorities | we refer to pages of this
latter type as hubs. We observe that a certain natural type of equilibrium exists between
hubs and authorities in the graph de ned by the link structure, and we exploit this to develop
an algorithm that identi es both types of pages simultaneously. The algorithm operates on
focused subgraphs of the www that we construct from the output of a text-based www
search engine; our technique for constructing such subgraphs is designed to produce small
collections of pages likely to contain the most authoritative pages for a given topic.
Overview. Our approach to discovering authoritative www sources is meant to have a
global nature: We wish to identify the most central pages for broad search topics in the
context of the www as a whole. Global approaches involve basic problems of representing
and ltering large volumes of information, since the entire set of pages relevant to a broad-
topic query can have a size in the millions. This is in contrast to local approaches that seek
to understand the interconnections among the set of www pages belonging to a single logical
site or intranet; in such cases the amount of data is much smaller, and often a di erent set
of considerations dominates.
It is also important to note the sense in which our main concerns are fundamentally
di erent from problems of clustering. Clustering addresses the issue of dissecting a hetero-
geneous population into sub-populations that are in some way more cohesive; in the context
of the www, this may involve distinguishing pages related to di erent meanings or senses
of a query term. Thus, clustering is intrinsically di erent from the issue of distilling broad
topics via the discovery of authorities, although a subsequent section will indicate some con-
nections. For even if we were able perfectly to dissect the multiple senses of an ambiguous
query term (e.g. \Windows" or \Gates"), we would still be left with the same underlying
3problem of representing and ltering the vast number of pages that are relevant to each of
the main senses of the query term.
The paper is organized as follows. Section 2 discusses the method by which we construct
a focused subgraph of the www with respect to a broad search topic, producing a set of
relevant pages rich in candidate authorities. Sections 3 and 4 discuss our main algorithm
for identifying hubs and authorities in such a subgraph, and some of the applications of this
algorithm. Section 5 discusses the connections with related work in the areas of www search,
bibliometrics, and the study of social networks. Section 6 describes how an extension of our
basic algorithm produces multiple collections of hubs and authorities within a common link
structure. Finally, Section 7 investigates the question of how \broad" a topic must be in
order for our techniques to be e ective, and Section 8 surveys some work that has been done
on the evaluation of the method presented here.
Download a pdf copy and Read The Whole Paper from Authoritative Sources in a Hyperlinked Environment
No comments:
Post a Comment