点击下载
本文文档

当前位置：首页 - 正文

A scalable randomized method to compute link-based

来源：动视网责编：小OO 时间：2025-10-01 17:14:00

A scalable randomized method to compute link-based

AScalableRandomizedMethodtoComputeLink-BasedSimilarityRankontheWebGraphDánielFogaras1,2andBalázsRácz1,21ComputerandAutomationResearchInstituteoftheHungarianAcademyofSciences2BudapestUniversityofTechnologyandEconomicsfd@cs.bme.hu,bracz+c31@math.bme.h

推荐度：

点击下载本文 文档为doc格式

导读AScalableRandomizedMethodtoComputeLink-BasedSimilarityRankontheWebGraphDánielFogaras1,2andBalázsRácz1,21ComputerandAutomationResearchInstituteoftheHungarianAcademyofSciences2BudapestUniversityofTechnologyandEconomicsfd@cs.bme.hu,bracz+c31@math.bme.h

A Scalable Randomized Method to Compute

Link-Based Similarity Rank on the Web Graph

Dániel Fogaras1,2and Balázs Rácz1,2

1Computer and Automation Research Institute of the

Hungarian Academy of Sciences

2Budapest University of Technology and Economics

fd@cs.bme.hu,bracz+c31@math.bme.hu

Abstract.Several iterative hyperlink-based similarity measures were published

to express the similarity of web pages.However,it usually seems hopeless to eval-

uate complex similarity functions over large repositories containing hundreds of

millions of pages.We introduce scalable algorithms computing SimRank scores,

which express the contextual similarities of pages based on the hyperlink struc-

ture.The proposed methods scale well to large repositories,fulﬁlling strict re-

quirements about computational complexity,memory usage and parallelization.

The algorithms were tested on a set of ten million pages,but parallelization tech-

niques make it possible to compute the SimRank scores even for the entire web

with over4billion pages.The key idea of this paper is that randomized Monte

Carlo methods combined with indexing techniques yield a scalable approxima-

tion of SimRank.

1Introduction

Calculating the similarity of two web pages is a fundamental building block of webmin-ing algorithms.For example,clustering algorithms classify the pages based on some similarity function and related queries of web search engines enumerate the pages most similar to some source page.Both applications require fast computable similarity val-ues that resemble the human users’similarity preferences.

Successful application areas of link analysis include ranking[3,13]and clustering [1],as well as similarity search.The similarity scores of this paper are also evaluated solely from the hyperlink structure of web pages.Link-based methods are efﬁcient since the link structure forms a homogenous language-independent dataset.Recent link-based similarity algorithms iterate some equations expressing basic intuitions about similarity preferences.SimRank recursively reﬁnes the cocitation measure[12].The algorithm of [8]follows the philosophy of HITS[13]claiming that related pages form a dense bipar-tite subgraph within the link structure.Additional similarity metrics for the web graph are presented in[9,14],for example using networkﬂows in local subgraphs.Several more metrics based on cocitation are evaluated in[7]that are not complex enough(e.g. can be spammed or utilize only the1–2neighborhood of the queried pages).

Unfortunately,the available implementations of complex link-based algorithms(Sim-Rank,HITS-based and networkﬂows)do not scale well to massive datasets.All of them require random access to the web graph(link structure),while the size of a web graph can far exceed the available main memory.Additionally,the memory and time require-ments of SimRank are quadratic in the number of pages,which is irrealistic over largerepositories.In contrast,PageRank requires only linear time iterations,and its external memory issues are solved in[6].

This paper introduces novel algorithms computing SimRank scores that scale well to large repositories.For example,on a collection of10M pages the basic SimRank algorithm would require up to1014steps and main memory to compute the Sim-Rank scores for all pairs of pages.On the other hand,our main algorithm precomputes roughly100ﬁngerprints each of size800MB,and the SimRank of two pages can be computed with two disk seeks later.We also give heuristics to reduce the total size of the database by a factor of10in a distributed system.

Our solution simulates random navigational pathways and stores them asﬁnger-prints in the index database.The expected similarity of twoﬁngerprints equals to the SimRank scores that can be derived from the random surfer pair model of[12].Thus, our randomized algorithm approximates SimRank scores from several independentﬁn-gerprints.For text-based similarity searchﬁngerprints and Monte Carlo methods were already successfully applied,see[4,11].

This paper is organized as follows.In Section2and3we present the main algorithm and the heuristics to compact the index.In Section4we formally analyze the error of approximation due to randomization.In Section5,we show that indexing and query evaluation can be very effectively parallelized to deal with extremely large repositories. Our algorithms were tested on a web crawl with10M pages,and the method of[10] is applied to numerically compute the overall quality of our output SimRank scores,as described in Section6.

1.1Algorithms scaling to the whole Web Graph

In this section we declare the strict computational complexity,memory usage and par-allelization requirements for our SimRank algorithms.An algorithm that fulﬁlls such requirements can be scaled on a distributed system even to a graph with billions of ver-tices like the whole webgraph.Similar requirements appear in[18],furthermore these characterize the computational environment in which Google’s PageRank scores are computed[3,17].

The following components of similarity search need to be implemented:

–Indexing:precompute an index database from the webgraph.

–Similarity query:calculate sim(v,w)for the webpages v and w using the index database.

–Top query:for a given query page v enumerate the k most similar pages to v,i.e., the pages with largest sim(v,·)scores,where k≈100−500.See Table1for a sample top query.

The main contributions of this paper are the algorithms of Sections2and3fulﬁlling the following requirements:

–Computational complexity:the time complexity of similarity query is constant, and that of top query is proportional to the size of the result list.Furthermore in-dexing is a linear time algorithm in both the number of vertices and the number of edges of the web graph.Table1.Sample top query with category labels from the Open Directory Project and SimRank values.For a description of calculation environment see Section6.

www.dbum20.20m.com Science/Technology/Space/Launch_Vehicles

2.0.0744

www.tmgnow.com Science/Anomalies_and_Alternative_Science/...

4.0.0201

www.futurespace.de Science/Technology/Space/Personal_Pages

6.0.0175

discovery.larc.nasa.gov Science/Technology/Space/Spacecraft_and... 8.0.0161

OrbitExperience.com Science/Technology/Space/NASA/Astronauts

10.0.0146

. . .. . .

422.0.0002

idr.janes.com Science/Technology/Military_Science/Weapons 424.0.0002

|I(v)||I(w)|

v ∈I(v)

w ∈I(w)

sim(v ,w ),(1)

where I(x)denotes the set of pages(vertices)linking to x,and c∈(0,1)is a constant, the decay factor.If either I(v)or I(w)is empty,then sim(v,w)is0by deﬁnition.For a given directed graph and decay factor c the sim(·,·)SimRank scores are deﬁned as the solution of the above system of equations for all v,w vertices.

Besides the recursive intuition and equations SimRank can be also interpreted by the random surfer pair model.This interpretation will play a crucial role throughout our paper,so we recall this approach from[12],too.The random surfer pair model assumesTable2.Summary of methods,where V is the number of vertices,N is the number ofﬁngerprints (N≈100),L is the length of theﬁngerprints(L≈5–10).

Basic Compacted Fingerprint

Index size V22·V·N

stream,N·L edgescans

Memory usage(indexing)≥V2constant or V

inverted indexing

Memory usage(top query)−V·N parallel memory

output sensitive

Distributed computing−horizontal

that a pair of vertices(web pages)is similar,if two random walks starting from the corresponding vertices and following the links backwards(are expected to)meet within a few steps.Based on this a rich family of similarity functions can be deﬁned by varying the weight that is given if the random walksﬁrst meet after exactly X backward steps. The next theorem of[12]states that by choosing weights exponentially decreasing with X the obtained similarity scores are equivalent to the SimRank scores.

Theorem1.Let X v,w denote theﬁrst meeting time of two independent uniform random walks started from vertices v and w on the transposed webgraph.Then the(c X v,w) values solve the SimRank equations with decay factor c.

2Our Algorithm

Our novel approach calculates the SimRank values based on Theorem1using Monte Carlo method:to evaluate sim(v,w)we will simulate N independent3pair of random walks starting from these pages on the transposed graph,and estimate SimRank by the average of the exponentially weighted meeting time.A naive implementation would just simulate random walks upon a query on sim(v,w).Unfortunately this requires practi-cally random access to the edge set of the graph which is prohibited by the memory requirements.

2.1Fingerprints and Indexing

We will pre-compute N random walks for each vertex v,and store these paths in an index.For practical purposes we truncate the random walks to aﬁnite length L.4Upon a query sim(v,w)we load the two sets of paths from the index database and compute the average weighted meeting time.This requires two database accesses.The algorithms are formalized as Algorithm1.

A path of a random walk of length L starting from a vertex v and following the hyperlinks backwards will be referred to as aﬁngerprint of v.This captures the intuition thatﬁngerprints depict the surroundings of vertices.Fingerprint[i][j][k]denotes the k th node of the i thﬁngerprint of vertex j,where i=1,...,N and k=1,...,L.N=number ofﬁngerprints,L=length of path.c=decay factor of SimRank.

Indexing:Uses random access to the

graph.

1:for i:=1to N do

2:for all j vertices of the web graph do

3:Fingerprint[i][j][]:=random re-

versed path of length L starting

from j.

N=number ofﬁngerprints,L=length of paths.Uses subroutine GenRndInEdges that generates a random in-edge for each vertex in the graph and stores its source in an array.

1:for i:=1to N do

2:for all j vertices of the web graph do/*start a path from each vertex*/

3:PathEnd[j]:=j

4:for k:=1to L do

5:NextIn[]:=GenRndInEdges()

6:for all j vertices of the web graph do/*prolong the current paths with the new in-edges*/

7:PathEnd[j]:=NextIn[PathEnd[j]]

8:save PathEnd[]as Path k[]

9:merge the Path k[j]arrays for k=1..L into Fingerprint[i][j][k].

M/V edge-scans(where M denotes the available memory)over the edges as

data stream,or by O(log(E/M))+1edge-scans by sorting the edges with an external memory sorting algorithm and generating all sets of InEdges with an additional scan.N=number ofﬁngerprints

1:for i:=1to N do

2:InEdges[]:=GenRndInEdges()

3:OutEdges[]:=InEdges[]transposed./*This is a graph represented as adjacency lists,e.g.

for each vertex j OutEdges[j]returns the set of endpoints of outgoing edges(neighbors)*/ 4:save InEdges[]as IndexIn[i][]and OutEdges[]as IndexOut[i][]

Algorithm4Top query(compacted SimRank)

N

10:return SimilarPages sorted by value

7This path is unique as each vertex is linked to by one(or zero)edge in H.

main memory according to the requirements of Section1.1.This enables us to run more complex graph algorithms on H,for example the top query will be calculated by a(modiﬁed)breadthﬁrst search.

3.1Indexing and Query Evaluation

The indexing method generates a compactedﬁngerprint H for each simulation with the same GenRndInEdges()function as in the previous section;see Algorithm3for the pseudocode.The main difference is,that the algorithm does not save the actual paths for each vertex as in Algorithm2,but it saves the generating InEdges[]array and the transposed OutEdges[],i.e.the edges of subgraph H.

To evaluate sim(v,w)queries we simulate the random walks from v and w respec-tively by following the edges pointed by IndexIn[].This requires random access to In-dexIn[],which stillﬁts into the main memory as the array has size V.The top query for vertex q will be evaluated as depicted on Algorithm4.Basically for each independent simulation,we have to generate all the paths that intersect the path of vertex q.Soﬁrst we calculate theﬁngerprint path of q using the IndexIn[]array.Then for each node of that path we enumerate the outgoing paths using the transposedﬁngerprint IndexOut[]. For the k th element of q’s path we take exactly k steps forward in all possible directions pointed by IndexOut[].This will generate us all the vertices with positive similarity to q,and thus has a time complexity linear in the output size.In the example of section1.1 exactly424documents had positive similarity to the query document.

Both similarity and top query algorithms uses random access to H for each simu-lation.With the parallelization techniques discussed later in Section5,it is possible to avoid loading the compactedﬁngerprints one-by-one to the main memory upon a query.

3.2Comparison with SimRank

The deﬁnition ofﬁngerprints in compacted SimRank shows the analogy to the pre-vious method;we drastically decrease index size by“reusing”the random-generated InEdges[].Thus estimated SimRank values and the original deﬁnition diverges but the resulting similarity scores behave in the web search engine tests practically as well as the original ones.For the experimental results,see Section6.1.

For the analogy,recall that original SimRank requires simulating pairs of backward random walks being pairwise independent until theﬁrst meeting time of each pair.The following slightly different facts hold for compacted SimRank about the independence ofﬁngerprints:1.eachﬁngerprint path makes independent steps until itﬁrst crosses itself;2.a pair ofﬁngerprints is independent until theyﬁrst meet each other,even if they have performed different number of steps before.Notice that these differences are more unlikely to occur for shorter paths on average.Therefore,the compacted SimRank is close to the original SimRank especially for short path length.

The complexity of the two methods are compared in Table2.

4How Many Fingerprints are Needed?

In this section we will try to estimate how well our methods approximate the actual SimRank values and how manyﬁngerprints are sufﬁcient for adequate results in a real-world application.It is obvious by Theorem1,that for a sufﬁciently large number ofﬁngerprints N,the results returned by our randomized methods sim(u,v)converge to sim(u,v),the actual SimRank.However,storage requirement and run time for both indexing and query is linear in N,so we want to minimize it.In this section we will show that about100ﬁngerprints are sufﬁcient for applications.

For a given query sim(u,v)theﬁngerprints give N independent identically dis-tributed random variables X1,...,X N,each of which has an expected value of sim(u,v). By the central limit theorem—which is commonly applied from N=30—the result we calculate sim(u,v)=1

√√

N i X i and sim(u,w)=1 N i X i<1N i X i−Y i<

0},which converges toΦ −√(X−Y) by the central limit theorem.As(X)=

sim(u,v)>(Y)=sim(u,w)andΦ(x)≤1

2πx

e−x2/2for x<0this gives us Pr{ sim(u,v)< sim(u,w)}≈Φ −√(X−Y) ≤c1·e−c2·N.

The proof is robust enough to apply to practically any ordering calculated by Monte Carlo methods—for example it applies to our methods where the inﬁnite random walks of Theorem1are approximated by paths of length L.

To get actual probabilities from the previous proof one should calculate or estimate (X−Y)which is very hard.Instead as commonly used in evaluating Monte Carlo methods we measured the empirical variation during our experiments and got typical values ranging from0.0001to0.01.Even in the worst scenario the probability at N= 100of a medium ranked page getting into the top few results is less,than2.8·10−7; while interchanging two top ranked pages is less probable,than0.0062.The probability of interchanging two low ranked pages is≈0.16.

This suggests that N=100gives more than adequate precision for a search engine. If the result list is ordered by some other function(for example PageRank),then even lessﬁngerprints are enough,to distinguish between unrelated pages,low similarity and top similarity pages.

5Scaling to the Whole Web

In this section we will shortly summarize the possibilities for parallelizing the algo-rithms described earlier.Our aim is to apply the methods for graphs as large as thewhole web graph,i.e.a few billions of vertices.The platform considered here for such applications is a very large cluster of PC category machines[2],interconnected by some form of network.We will consider requirements for minimum operation,load balancing and fault tolerance to show that our methods are applicable even in the largest publicly available search services.

The key observation is the possibility of distributing the independentﬁngerprints to N different machines[19].We will call this coarse horizontal parallelization.In this case the index builds can run in parallel with the necessary edge-scans distributed as broadcast data over the cluster.Each computer uses its own random number generator to calculate a different and independent set ofﬁngerprints.The resulting indices are stored only locally.The query to evaluate will also be broadcasted to all the participating machines,which return their individual result list for merging by the front-end server. The network transfer required is proportional to the length of the result list(the number of pages with positive similarity to the query page)even with the most naive approach. Alternatively,a more advanced distributed top query method can be used for merging the result sets with much less network trafﬁc,or even to permit partial evaluation at the index servers.Distributed top queries were deeply studied recently,see for example[5].

This utilizes at most N machines,each of which has enough memory for V cells for indexing,or query with compacted SimRank.For query with the original SimRank the only resource required is database accesses(disk seeks)8.

Thanks to the robustness of Monte Carlo methods this method offers intrinsic load balancing and fault tolerance support.Assume we have more than N machines,and each of them has got an index.For an adequate precision result it is enough to ask any N machines and merge the result list.This offers nice and smooth load balancing support for an arbitrary number of index servers.As for fault tolerance,skipping oneﬁngerprint, and merging N−1result lists instead of N does not change the result list signiﬁcantly, thus failed servers do not inﬂuence proper operation.This enables a very effective combination of load balancing and fault tolerance:if the query load is larger than the available number of machines can serve,then the number ofﬁngerprints used for a single query is decreased automatically,thus loosing some precision but maintaining adequate response times.Or in case of underload,precision can be increased for free to achieve better results.9

6Experiments

In our experiments the value siblingΓ[10]measures the quality of the similarity rank-ing by comparing it to a ground truth similarity ordering extracted from the Open Direc-tory Project(ODP,[16])data,a hierarchical collection of webpages.The category tree of ODP provides ground truth similarity pairs by claiming that sim(u,v)90M 80M 70M 60M 50M 40M 30M 20M 10M

0.540.520.500.480.460.440.420.40

10M

20M 30M 40M 50M 60M 70M 80M 90M

6.1Experimental results

Our numerical experiments address the problem ofﬁnding the optimal settings over the 3-dimensional parameter space:path length L,decay factor c and number ofﬁngerprints N.To perform a full optimization was beyond the scope of our computing capacity,so weﬁxed the default values L=10,N=100and c=0.65;then the parameters were analyzed independently by varying only one parameter at a time,see Fig.1.

The left hand side bars of the clustered bar charts of parts a)and b)on Fig.1 show that both the number of comparable pairs and the quality increase with L.Recall that the SimRank scores approximated with path length L correspond to L iteration of the recursive equations.Thus,the growing quality testiﬁes the recursive intuition of SimRank.However,more recursion slightly reduces the quality among the pages with low SimRank scores,asΓwas slightly decreasing after L=7.

The right hand side bars of parts a)and b)show the quality loss for the com-pacted SimRank scores compared to SimRank.The results fulﬁll the expectations of Section3.2that the longer the paths are the more quality loss occurs on average.How-ever,in a real application it may be reasonable to trade the quality difference to the advantage of the reduced index database compacted to L times smaller.

The amount of comparable pairs inﬁgure c)increases linearly with N.The rea-son is simple:moreﬁngerprint paths cross each other,and thus make more similarity pairs comparable.Part d)generally shows quality fall in siblingΓ.We argue that in case of N=10the pairs with positive approximated SimRank scores have relatively large SimRank,and such pairs approximate the ground truth similarity better than those having smaller SimRank values.

The lastﬁgure e)shows that the decay factor c is inversely proportional to sibling Γ.The similarity pairs chart is omitted,since the number of pairs remains unchanged. When c→0and N isﬁxed,the weight given for a shorter meeting time will largely supercede the weight given for longer meeting times.Our measurements imply that shorter paths should generally be given higher weight than the sum of arbitrary number of longer paths.

7Conclusion

We showed a method for computing the SimRank similarity scores for web pages.Sim-Rank is purely based on the link structure of the web graph,just as PageRank,but has previously had algorithms with quadratic memory requirement and running time.Our method uses linear time randomizedﬁngerprint-based calculations and can be imple-mented on V pages using V cells of memory,which can be further reduced to constant by using external memory sorts.When parallelized on a cluster of machines this method can cope with the whole webgraph making it suitable for the largest scale commercial web search engines,featuring load balance and fault tolerance intrinsically.Further-more,we presented an approximation with lower precision,but drastically reduced in-dex database sizes that can be entirely served from main memory in case of extreme workloads.

References

1.Pavel Berkhin.Survey of clustering data mining techniques.Technical report,Accrue Soft-

ware,San Jose,CA,2002.

2.E.Brewer.Lessons from giant-scale services.

3.Sergey Brin and Lawrence Page.The anatomy of a large-scale hypertextual Web search

engine.Computer Networks and ISDN Systems,30(1-7):107–117,1998.

4.A.Broder.On the resemblance and containment of documents.In Proceedings of the Com-

pression and Complexity of Sequences1997,page21.IEEE Computer Society,1997.

5.Nicolas Bruno,Luis Gravano,and Amelie Marian.Evaluating top-k queries over web-

accessible databases.In ICDE,2002.

6.Yen-Yu Chen,Qingqing Gan,and Torsten Suel.I/O-efﬁcient techniques for computing Page-

Rank.In Proceedings of the eleventh international conference on Information and knowledge management,pages549–557.ACM Press,2002.

7.Marco Cristo,Pável Calado,Edleno Silva de Moura,Nivio Ziviani,and Berthier Ribeiro-

Neto.Link information as a similarity measure in web classiﬁcation.In String Processing and Information Retrieval,pages43–55.Springer LNCS2857,2003.

8.Jeffrey Dean and Monika R.Henzinger.Finding related pages in the World Wide Web.

Computer Networks(Amsterdam,Netherlands:1999),31(11–16):1467–1479,1999.

9.Gary William Flake,Steve Lawrence,C.Lee Giles,and Frans Coetzee.Self-organization of

the web and identiﬁcation of communities.IEEE Computer,35(3):66–71,2002.

10.Taher H.Haveliwala,Aristides Gionis,Dan Klein,and Piotr Indyk.Evaluating strategies

for similarity search on the web.In Proceedings of the11th World Wide Web Conference (WWW),pages432–442.ACM Press,2002.

11.N.Heintze.Scalable documentﬁngerprinting.In1996USENIX Workshop on Electronic

Commerce,November1996.

12.Glen Jeh and Jennifer Widom.SimRank:A measure of structural-context similarity.In

Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

ACM,2002.

13.Jon Kleinberg.Authoritative sources in a hyperlinked environment.Journal of the ACM,

46(5):604–632,1999.

14.Wangzhong Lu,Jeannette Janssen,Evangelos Milios,and Nathalie Japkowicz.Node simi-

larity in networked information spaces.In Proceedings of the2001conference of the Centre for Advanced Studies on Collaborative research,page11.IBM Press,2001.

15.Ulrich Meyer,Peter Sanders,and Jop Sibeyn.Algorithms for Memory Hierarchies,Advanced

Lectures.LNCS,Springer,2003.

16.Open Directory Project(ODP).http://www.dmoz.org.

17.Lawrence Page,Sergey Brin,Rajeev Motwani,and Terry Winograd.The PageRank citation

ranking:Bringing order to the web.Technical report,Stanford Digital Library Technologies Project,1998.

18.Christopher R.Palmer,Phillip B.Gibbons,and Christos Faloutsos.ANF:a fast and scalable

tool for data mining in massive graphs.In Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,pages81–90.ACM Press,2002.

19.J.S.Rosenthal.Parallel computing and Monte Carlo algorithms.Far East J.Theor.Stat.,

4:207–236,2000.

20.Ian H.Witten,Alistair Moffat,and Timothy C.Bell.Managing gigabytes(2nd ed.):Com-

pressing and indexing documents and images.Morgan Kaufmann Publishers Inc.,1999.

A scalable randomized method to compute link-based

AScalableRandomizedMethodtoComputeLink-BasedSimilarityRankontheWebGraphDánielFogaras1,2andBalázsRácz1,21ComputerandAutomationResearchInstituteoftheHungarianAcademyofSciences2BudapestUniversityofTechnologyandEconomicsfd@cs.bme.hu,bracz+c31@math.bme.h

推荐度：

点击下载本文 文档为doc格式

热门焦点

A scalable randomized method to compute link-based

A scalable randomized method to compute link-based

A scalable randomized method to compute link-based

最新推荐

猜你喜欢

热门推荐