点击下载
本文文档

当前位置：首页 - 正文

Efficient Pairwise Classification and Ranking

来源：动视网责编：小OO 时间：2025-09-30 15:26:27

Efficient Pairwise Classification and Ranking

TechnischeUniversitätDarmstadtKnowledgeEngineeringGroupHochschulstrasse10,D-2Darmstadt,Germanyhttp://www.ke.informatik.tu-darmstadt.deTechnicalReportTUD–KE–2007–03Sang-HyeunPark,JohannesFürnkranzEfﬁcientPairwiseClassiﬁcationandRankingAshortversi

推荐度：

点击下载本文 文档为doc格式

导读TechnischeUniversitätDarmstadtKnowledgeEngineeringGroupHochschulstrasse10,D-2Darmstadt,Germanyhttp://www.ke.informatik.tu-darmstadt.deTechnicalReportTUD–KE–2007–03Sang-HyeunPark,JohannesFürnkranzEfﬁcientPairwiseClassiﬁcationandRankingAshortversi

Technische Universität Darmstadt

Knowledge Engineering Group

Hochschulstrasse10,D-2Darmstadt,Germany

http://www.ke.informatik.tu-darmstadt.de

Technical Report TUD–KE–2007–03

Sang-Hyeun Park,Johannes Fürnkranz

Efﬁcient Pairwise Classiﬁcation and

Ranking

A short version of this paper appeared as:

Sang-Hyeun Park,Johannes Fürnkranz.Efﬁcient Pairwise Classiﬁcation, Proceedings of the18th European Conference on Machine Learning(ECML-07),

Springer-Verlag,2007Eﬃcient Pairwise Classiﬁcation and Ranking

Sang-Hyeun Park park@ke.informatik.tu-darmstadt.de Johannes F¨u rnkranz juffi@ke.informatik.tu-darmstadt.de Knowledge Engineering Group

Department of Computer Science

TU Darmstadt,Germany

Abstract

Pairwise classiﬁcation is a class binarization procedure that converts a multi-class problem into

a series of two-class problems,one problem for each pair of classes.While it can be shown that for

training,this procedure is more eﬃcient than the more commonly used one-against-all approach, it still has to evaluate a quadratic number of classiﬁers when computing the predicted class for

a given example.In this paper,we propose a method that allows a faster computation of the

predicted class when weighted or unweighted voting are used for combining the predictions of the individual classiﬁers.While its worst-case complexity is still quadratic in the number of classes, we show that even in the case of completely random base classiﬁers,our method still outperforms the conventional pairwise classiﬁer.For the more practical case of well-trained base classiﬁers,its asymptotic computational complexity seems to be almost linear.We also propose a method for approximating the full class ranking,based on the Swiss System,a common scheme for conducting multi-round chess tournaments.Our results indicate that this adaptive scheme oﬀers a better trade-oﬀbetween approximation quality and number of performed comparisons than alternative,ﬁxed schemes for ordering the evaluation of the pairwise classiﬁers.

1.Introduction

Many learning algorithms can only deal with two-class problems.For multi-class problems,they have to rely on class binarization procedures that transform the original learning problem into a series of binary learning problems.A standard solution for this problem is the one-against-all approach, which constructs one binary classiﬁer for each class,where the positive training examples are those belonging to this class and the negative training examples are formed by the union of all other classes.

An alternative approach,known as pairwise classiﬁcation or round robin classiﬁcation has re-cently gained attention(F¨u rnkranz,2002;Wu et al.,2004).Its basic idea is to transform a c-class problem into c(c−1)/2binary problems,one for each pair of classes.This approach has been shown to produce more accurate results than the one-against-all approach for a wide variety of learning algorithms such as support vector machines(Hsu&Lin,2002)or rule learning algorithms (F¨u rnkranz,2002).Moreover,F¨u rnkranz(2002)has also proved that despite the fact that its com-plexity is quadratic in the number of classes,the algorithm can in fact be trained faster than the conventional one-against-all technique.1However,in order to obtain aﬁnal prediction,we still have to combine the predictions of all c(c−1)/2classiﬁers,which can be very ineﬃcient for large values of c.

The main contribution of this paper is a novel solution for this problem.Unlike previous proposals (such as(Platt et al.,2000);cf.Section3.2)our approach is not heuristic but is guaranteed to produce exactly the same prediction as the full pairwise classiﬁer,which in turn has been shown to optimize the Spearman rank correlation with the target labels(H¨u llermeier&F¨u rnkranz,2004b).In essence, 1.It is easy to see this,if one considers that in the one-against-all case each training example is used c times(namely

in each of the c binary problems),while in the round robin approach each example is only used c−1times,namely only in those binary problems,where its own class is paired against one of the other c−1classes.the algorithm selects and evaluates iterative pairwise classiﬁers using a simple heuristic to minimize the number of used pairwise classiﬁers that are needed to determine the correct top rank class of the complete(weighted)voting.We will describe and evaluate this algorithm in Section3.

In Section4,we will then investigate the case when we are not only interested in the prediction of a single class,but in a ranking of all possible classes.For this case,we propose the so-called Swiss System,an algorithm that is commonly used for organizing multi-round chess tournaments. Its key idea is to always evaluate classiﬁers that have similar positions in the current,incomplete ranking.Our results show that this algorithm oﬀers a good trade-oﬀbetween the number of evaluated classiﬁers and the quality of the approximation of the complete ranking.

2.Pairwise Classiﬁcation

In the following,we assume that a multi-class problem has c classes,which we denote with c1,...,c c.

A pairwise or round robin classiﬁer trains a set of c(c−1)/2binary classiﬁers C i,j,one for each pair of classes(c i,c j),iTypically,the binary classiﬁers are class-symmetric,i.e.,the classiﬁers C i,j and C j,i are identical. However,for some types of classiﬁers this does not hold.For example,rule learning algorithms will always learn rules for the positive class,and classify all uncovered examples as negative.Thus, the predictions may depend on whether class c i or class c j has been used as the positive class.As has been noted in(F¨u rnkranz,2002),a simple method for solving this problem is to average the predictions of C i,j and C j,i,which basically amounts to the use of a so-called double round robin procedure,where we have two classiﬁers for each pair of classes.We will use this procedure for our results with Ripper.

At classiﬁcation time,each binary classiﬁer C i,j is queried and issues a vote(a prediction for either c i or c j)for the given example.This can be compared with sports and games tournaments, in which each player plays each other player once.In each game,the winner receives a point,and the player with the maximum number of points is the winner of the tournament.In our case,the class with the maximum number of votes is predicted(ties are broken arbitrarily for the larger class).In this paper,we will assume binary classiﬁers that return class probabilities p(c i|c i∨c j) and p(c j|c i∨c j).These can be used for weighted voting,i.e.,we predict the class that receives the

maximum number of votes:

c =arg max

i=1...c

j=1

p(c i|c i∨c j)

This procedure optimizes the Spearman rank correlation with the target ranking(H¨u llermeier& F¨u rnkranz,2004b).Other algorithms for combining votes exist(cf.pairwise coupling(Hastie& Tibshirani,1998;Wu et al.,2004)),but are not subject of this paper.2

Note that weighted or unweighted voting produce a ranking of all classes.For prediction prob-lems,one is typically only interested in the top ranked class,but in some applications one might also be interested in the complete ranking of classes.We will focus on classiﬁcation in the next section, ranking will be considered in Section4.

2.Furthermore,comparisons in(H¨u llermeier&F¨u rnkranz,2004a)with more advanced methods showed that the

simple voting method is competitive with more complex methods.3.Eﬃcient Pairwise Classiﬁcation

3.1The QWeighted Algorithm

Weighted or unweighted voting predicts the top rank class by returning the class with the highest accumulated voting mass after evaluation of all pairwise classiﬁers.During such a procedure there exist many situations where particular classes can be excluded from the set of possible top rank classes,even if they reach the maximal voting mass in the remaining evaluations.Consider following simple example:Given c classes with c>j,if class a has received more than c−j votes and class b lost j votings,it is impossible for b to achieve a higher total voting mass than a.Thus further evaluations with b can be safely ignored.

To increase the reduction of evaluations we are interested in obtaining such exploitable situations frequently.Pairwise classiﬁers will be selected depending on a loss value,which is the amount of potential voting mass that a class has not received.More speciﬁcally,the loss l i of a class i is deﬁned as l i:=p i−v i,where p i is the number of evaluated incident classiﬁers of i and v i is the current vote amount of i.Obviously,the loss will begin with a value of zero and is monotonically increasing.3 The class with the current minimal loss is one of the top candidates for the top rank class. Algorithm1:QWeighted(Quick Weighted Voting)

begin

while c top not determined do

c a←class c i∈K with minimal l i;

c b←class c j∈K\\{c a}with minimal l j&classiﬁer C a,b has not yet been evaluated;

if no c b exists then

c top←c a;

else

v ab←Evaluate(C a,b);

l a←l a+(1−v ab);

l b←l b+v ab;

end

First the pairwise classiﬁer C a,b will be selected for which the losses l a and l b of the relevant classes c a and c b are minimal,provided that the classiﬁer C a,b has not yet been evaluated.In the case of multiple classes that have the same minimal loss,there exists no further distinction,and we select a class randomly from this set.Then,the losses l a and l b will be updated based on the evaluation returned by C a,b(recall that v ab is interpreted as the amount of the voting mass of the classiﬁer C a,b that goes to class c a and1−v ab is the amount that goes to class c b).These two steps will be repeated until all classiﬁers for the class c m with the minimal loss has been evaluated.Thus the current/estimated loss l m is the correct loss for this class.As all other classes already have a greater loss,c m is the correct top rank class.

Theoretically,a minimal number of comparisons of c−1is possible(best case).Assuming that the incident classiﬁers of the correct top rank c top always returns the maximum voting amount

(l top=0),c top is always in the set{c j∈K|l j=min c

i ∈K

l i}.In addition,c top should be selected as

theﬁrst class in step1of the algorithm among the classes with the minimal loss value.It follows that exactly c−1comparisons will be evaluated,more precisely all incident classiﬁers of c top.The algorithm terminates and returns c top as the correct top rank.

The worst case,on the other hand,is still c(c−1)/2comparisons,which can,e.g.,occur if all pairwise classiﬁers classify randomly with a probability of0.5.In practice,the number of comparisons 3.This loss is essentially identical to the voting-against principle introduced by Cutzu(2003a;2003b),which we will

discuss later on in Section3.2.will be somewhere between these two extremes,depending on the nature of the problem.The next section will evaluate this trade-oﬀ.

3.2Related Work

Cutzu(2003a;2003b)recognized the importance of the voting-against principle and observed that it allows to reliably conclude a class when not all of the pairwise classiﬁers are present.For example, Cutzu claims that using the voting-against rule one could correctly predict class i even if none of the pairwise classiﬁers C ik(k=1...c,k=i)are used.However,this argument is based on the assumption that all base classiﬁers classify correctly.Moreover,if there is a second class j that should ideally receive c−2votes,voting-against could only conclude a tie between classes i and j,as long as the vote of classiﬁer C ij is not known.The main contribution of his work,however,is a method for computing posterior class probabilities in the voting-against scenario.Our approach builds upon the same ideas as Cutzu’s,but our contribution is the algorithm that exploits the voting-against principle to eﬀectively increase the prediction eﬃciency of pairwise classiﬁers without changing the predicted results.

The voting-against principle was already used earlier in the form of DDAGs Platt et al.(2000), which organize the binary base classiﬁers in a decision graph.Each node represents a binary decision that rules out the class that is not predicted by the corresponding binary classiﬁer.At classiﬁcation time,only the classiﬁers on the path from the root to a leaf of the tree(at most c−1classiﬁers) are consulted.While the authors empirically show that the method does not lose accuracy on three benchmark problems,it does not have the guarantee of our method,which will always predict the same class as the full pairwise classiﬁer.Intuitively,one would also presume that aﬁxed evaluation routine that uses only c−1of the c(c−1)/2base classiﬁers will sacriﬁce one of the main strengths of the pairwise approach,namely that the inﬂuence of a single incorrectly trained binary classiﬁer is diminished in large ensemble of classiﬁers(F¨u rnkranz,2003).

3.3Evaluation

We compare the QWeighted algorithm with the full pairwise classiﬁer and with DDAGs Platt et al.(2000)on seven arbitrarily selected multi-class datasets from the UCI database of machine learning databases(Hettich et al.,1998).We used four commonly used learning algorithms as base learners(the rule learner Ripper,a Naive Bayes algorithm,the C4.5decision tree learner,and a support vector machine)all in their implementations in the Weka machine learning library(Witten &Frank,2005).Each algorithm was used as a base classiﬁer for QWeighted,and the combination was run on each of the datasets.As QWeighted is guaranteed to return the same predictions as the full pairwise classiﬁer,we are only interested in the number of comparisons needed for determining the winning class.4These are measured for all examples of each dataset via a10-fold cross-validation except for letter,where the supplied testset was used.Table1shows the results.

With respect to accuracy,there is only one case in a total of28experiments(4base classiﬁers×7 datasets)where DDAGs outperformed the QWeighted,which,as we have noted above,optimizes the Spearman rank correlation.This and the fact that,to the best of our knowledge,it is not known what loss function is optimized by DDAGs,conﬁrm our intuition that QWeighted is a more principled approach than DDAGs.It can also be seen that the average number of comparisons needed by QWeighted is much closer to the best case than to the worst case.

Next to the absolute numbers,we show the trade-oﬀbetween best and worst case(in brackets).

A value of0indicates that the average number of comparisons is c−1,a value of1indicates that the value is c(c−1)/2(the value in the last column).As we have ordered the datasets by their respective number of classes,we can observe that this value has a clear tendency to decrease with 4.As mentioned above,we used a double round robin for Ripper for both,the full pairwise classiﬁer and for

QWeighted.In order to be comparable to the other results,we,in this case,divide the observed number of comparisons by two.Table1:Comparison of QWeighted and DDAGs with diﬀerent base learners on seven multi-class datasets.The right-most column shows the number of comparisons needed by a full pairwise classiﬁer (c(c−1)/2).Next to the average numbers of comparisons for QWeighted we show their trade-oﬀn−(c−1)

between best and worst case(in brackets).

max−(c−1)

Accuracy∅Comparisons dataset c learner QWeighted DDAG QWeighted DDAG full

vehicle4NB45.3944.92 4.27(0.423)36

SMO75.0675.06 3.(0.213)

J4871.9970.92 3.96(0.320)

JRip73.8872.46 3.98(0.327)

glass7NB49.0749.079.58(0.238)621

SMO57.0157.949.92(0.261)

J4871.5069.169.69(0.246)

JRip74.7774.309.75(0.250)

image7NB80.0980.099.03(0.202)621

SMO93.5193.518.29(0.153)

J46.9396.758.55(0.170)

JRip96.6296.418.75(0.183)

yeast10NB57.5557.2115.86(0.191)945

SMO57.6857.4115.52(0.181)

J4858.5657.7515.48(0.180)

JRip58.9658.0915.87(0.191)

vowel11NB63.8463.17.09(0.158)1055

SMO81.9281.5215.28(0.117)

J4882.9378.2817.13(0.158)

JRip82.4276.6717.42(0.165)

soybean19NB92.9792.9727.70(0.063)18171

SMO94.1493.4128.36(0.068)

J43.5691.8029.45(0.075)

JRip94.0093.5627.65(0.063)

letter26NB63.0863.0044.40(0.065)25325

SMO83.8082.5842.26(0.058)

J41.5086.1547.77(0.076)

JRip92.3388.3345.01(0.068)

the number of the classes.For example,for the19-class soybean and the26-class letter datasets, only about6−7%of the possible number of additional pairwise classiﬁers are used,i.e.,the total number of comparisons seems to grow only linearly with the number of classes.This can also be seen from Figure1a,which plots the datasets with their respective number of classes together with a curve that indicates the performance of the full pairwise classiﬁer.

Finally,we note that the results are qualitatively the same for all base classiﬁers.QWeighted does not seem to depend on a choice of base classiﬁers.For a more systematic investigation of the complexity of the algorithm,we performed a simulation experiment.We assume classes in the form of numbers from1...c,and,without loss of generality,1is always the correct class.We further assume pairwise base pseudo-classiﬁers i≺ j,which,for two numbers iTable2:Average number n of pairwise comparisons for various number of classes and diﬀerent error probabilities of the pairwise classiﬁers,and the full pairwise classiﬁer.Below,we show their

trade-oﬀn−(c−1)

max−(c−1)between the best and worst case,and an estimate of the growth ratio log(n2/n1)

log(c2/c1)

of successive values of n.

c =0.0 =0.05 =0.1 =0.2 =0.3 =0.5full

5 5.43 5.72 6.07 6.45 6.907.1210

0.238—0.287—0.345—0.408—0.483—0.520—

1014.1116.1918.3421.9025.3928.7445

0.142 1.3780.200 1.5010.259 1.5950.358 1.70.455 1.8800.548 2.013

2542.4560.0176.82113.75151.19198.51300

0.067 1.2020.130 1.4300.191 1.5630.325 1.7980.461 1.9740.632 2.109

5091.04171.53251.18422.58606.74868.251,225

0.036 1.1010.104 1.5150.172 1.7090.318 1.30.474 2.0050.697 2.129

1001.51530.17900.291,684.212,504.543,772.454,950

0.019 1.0580.0 1.6280.165 1.8420.327 1.9950.496 2.0450.757 2.119 corresponds to a pairwise classiﬁer where all predictions are consistent with a total order of the possible class labels,and =0.5corresponds to the case where the predictions of the base classiﬁers are entirely random.

Table2shows the results for various numbers of classes(c=5,10,25,50,100)and for various settings of the error parameter( =0.0,0.05,0.1,0.2,0.3,0.5).Each data point is the average outcome of1000trials with the corresponding parameter settings.We can see that even for entirely random data,our algorithm can still save about1/4of the pairwise comparisons that would be needed for the entire ensemble.For cases with a total order and error-free base classiﬁers,the number of needed comparisons approaches the number of classes,i.e.,the growth appears to be linear.

To shed more light on this,we provide two more measures below each average:the lower left number(in italics)shows the trade-oﬀbetween best and worst case,as deﬁned above.The result conﬁrms that for a reasonable performance of the base classiﬁers(up to about =0.2),the fraction of additional work reduces with the number of classes.Above that,we start to observe a growth. The reason for this is that with a low number of classes,there is still a good chance that the random base classiﬁers produce a reasonably ordered class structure,while this chance is decreasing with increasing numbers of classes.On the other hand,the inﬂuence of each individual false prediction of a base classiﬁer decreases with an increasing number of classes,so that the true class ordering is still clearly visible and can be better exploited by the QWeighted algorithm.

This is illustrated in Figure1b,which shows the distribution of the votes produced by the SVM base classiﬁer for the dataset vowel.As shown on the scale to right,diﬀerent color codes are used for encoding diﬀerent numbers of received votes.Each line in the plot represents one example,the left shows the highest number of votes,the right the lowest number of votes.If all classes receive the same number of votes,the area should be colored uniformly.However,here we observe a fairly clear change in the color distribution,the bright areas to the left indicating the the top-rank class often receives nine or more votes,and the areas to the right indicating that the lowest ranking class typically receives less than one vote(recall that we use weighted voting).

We tried to directly estimate the exponent of the growth function of the number of comparisons of QWeighted,based on the number of classes c.The resulting exponents,based on two successive measure points,are shown in bold font below the absolute numbers.For example,the exponent of the growth function between c=5and c=10is estimated(for =0)as log(14.11/5.43)

log(10/5)

≈1.378.We

2025

100150

200250300

number of classes c

n u m b e r o f c o m p a r i s o n s

vehicle glass/image

yeast vowel

soybean

letter

c (c −1)2

full voting

QWeighted

(a)(b)

Figure 1:a)Eﬃciency of QWeighted in comparison to a full pairwise classiﬁer,b)Distribution of votes for vowel (11-class problem,base learner NB ).The x-axis describes the ranking positions.can see that in the growth rate starts almost linearly (for a high number of classes and no errors in the base classiﬁers)and approaches a quadratic growth when the error rate increases.5

In addition to the small datasets from Table 1,we evaluated the QWeighted algorithm on three more real world datasets with a relative high number of classes:

Uni-label RCV1-v2

RCV1-v2(Lewis et al.,2004)is a dataset consisting of over 800,000categorized news articles from Reuters,Ltd.For the category topic multiple labels from a total of 103hierarchically organized labels are assigned to the instances.We transformed this original multi-label dataset to uni-label to be suitable for the QWeighted algorithm.For each instance,the label with the greatest depth in the hierarchical tree among the assigned labels was selected as true class label.In case,that multiple labels have the same greatest depth,the respective instance was removed.We applied this procedure on the provided trainset and testset no.0by Lewis et al.resulting to a multi-classiﬁcation dataset with 100classes,23,149train-and 199,328test-instances with at least one positive example for each of the 100classes.6We will refer to this created dataset as urcv1-v2.ASTRAL 2&3

These datasets describe protein sequences retrieved from the SCOP 1.71protein database (Murzin et al,1995).We used ASTRAL (Brenner et al.,2000)to ﬁlter these sequences so that no two sequences share greater than 95%identity.The class labels are organized in a 3-level hierarchy,consisting of protein folds,superfamilies and families (in descending order).astral3consists of 1,588classes and contains the original hierarchy.To ﬁll the gap between datasets urcv1-v2and astral3in terms of number of classes,we constructed a second dataset astral2by limiting the hierarchical depth to 2.So,two instances which previously shared the same superfamily x are now assigned to superfamily x as new class label.By decreasing the depth,the number of classes were reduced to 971.Both datasets have 13,006instances and 21numeric attributes (20amino acids plus selenocysteine).

5.At ﬁrst sight,it may be surprising that some of the numbers are greater than 2.This is a result of the fact that c (c −1)/2=c 2−c/2is quadratic in the limit ,but for low values of c ,the subtraction of the linear term c/2has a more signiﬁcant eﬀect.Thus,e.g.,the estimated growth of the full pairwise classiﬁer from c =5to c =10is log(45/10)

log(10/5)

≈2.17.6.The amount of train and test-instances is identical to their unmodiﬁed versions.So,there was in no case an instance with multiple possible candidates as most speciﬁc label.Furthermore,only one of the originally 101assigned labels was removed by the described procedure.

max−(c−1)

between the best and worst case(base learner J48).

(a)

Accuracy∅Comparisons s1-vs-1Direct QW full

25060.8653.4260297.334,950 50062.282854.4108306.394,950 1,00061.567454.1043310.4,950 2,00062.080154.5458312.624,950 4,00062.110753.7747312.604,950 8,00062.107253.4350312.294,950 14,00062.090153.94312.214,950 20,00061.322053.6568312.354,950

(b)

dataset c QW full Acc. urcv1-v2100312.624,95062.1

(0.0440)

astral29719,490.81470,93524.8

(0.0018)

astral31,58828,476.201,260,07820.8

(0.0213)

Dataset urcv1-v2uses more than40,000features(stemmed words).To reduce the complexity

of the multi-classiﬁcation problem,χ2-based feature selection(Yang&Pedersen,1997)was applied.

After computing the ranking of features with decreasingχ2scores,we selected various numbers s of

the top ranked features(s=250,500,1000,2000,4000,8000,14000,20000).The decision tree learner

J48was used as base learner,because of its fast computational speed and for validation,we used

the transformed testset.The results are shown in Table3a.It is surprising that the accuracy value

is only minimally inﬂuenced by the feature set size.All accuracy values lies within 61-62percent

for the pairwise approach and approx.54without any binarization.In fact,the pairwise approach

reaches its maximum accuracy when using only500features.As a side note,we can see that the

pairwise approach outperforms the direct multi-class capability(3rd column)of J48in terms of

accuracy.

The maximal number of pairwise comparisons(at s=2000)from Table3a is listed together

with the results of the QWeighted algorithm for astral2and astral3in Table3b.For astral2and

astral366percent of all instances were used for training and the rest for testing.Once again,trade-

oﬀvalues were estimated for the average number of pairwise comparisons.As these values show,

QWeighted uses only a fairly small amount compared to a full voting aggregation and tends clearly to the best case than to the worst case(c(c−1)/2comparisons).One can see an increasing growth

of the trade-oﬀvalues between astral2and astral3.However,this eﬀect can be explained with the

general poor classiﬁcation accuracy of protein sequences.According to the simulation results,there

exist a correlation between performance of QWeighted and performance of the underlying base

classiﬁers.The decreased accuracy on astral3compared to astral2(right-most column)indicates

weaker base classiﬁers,which leads to a increasing number of needed pairwise comparisons.

In summary,our results indicate that the QWeighted algorithm always increases the eﬃciency

of the pairwise classiﬁer:for high error rates in the base classiﬁers,we can only expect improvements

by a constant factor,whereas for the practical case of low error rates we can also expect a signiﬁcant

reduction in the asymptotic algorithmic complexity.

4.Eﬃcient Pairwise Ranking

Pairwise classiﬁcation not only delivers a single class label but a ranking of all classes,and in many

applications this is of interest.An example might be in speech recognition to combine the ranking

of possible recognized words with further background information like context or grammar to gaina higher accuracy.However,the QWeighted algorithm in the previous section focuses entirely on ﬁnding the top rank.In this section,we will brieﬂy discuss an algorithm for eﬃcient computation of a complete ranking.

4.1The Swiss Algorithm

Our algorithm is based on the swiss system that is commonly used in chess tournaments(FIDE, 2006).Its fundamental ideas are that all players play aﬁxed number of rounds,where in each round players of similar strength are paired with each other.After each round,a ranking is determined based upon the points collected so far,and the next pairings are drawn based on the current ranking (players with the same number of points play each other).Given n entrants,a limitation of log(n) rounds is typical,resulting in n·log(n)games,as opposed to the quadratic amount in the usual round-robin system.In addition,no pairing will be evaluated twice.

Algorithm2:Swiss system

begin

for round=1to max round do

pair classes that have a similar score,starting with classes with byes;

pair the remaining classes among each other;

remaining classes receive a bye;

evaluate the corresponding classiﬁers and update the loss values for each class;

end

We adopted this system for use in pairwise classiﬁcation,so that it computes a ranking based on weighted voting(Algorithm2).Contrary to the regular Swiss system,this algorithm does not try to optimize the pairings,but is greedily producing pairings in various stages.First,it pairs classes that had received a bye in the last round,then classes with a similar score,and then all remaining classes.As a result of the greedy pairing,we may have situations,where for certain classes no valid pairings can be found,e.g.,when all remaining classes were already previously paired with the speciﬁc class.In this situation the class receives a bye,which guarantees a pairing in the beginning of the next round,even if there is no similarly rated class available.The main advantage of this greedy approach is that we can also easily linearize it,i.e.,we can evaluate the ranking quality not only after complete rounds,but after each individual pairing.

4.2Evaluation

Unfortunately,most datasets including those from the UCI Database provide only the correct top rank class but no reference ranking of classes.The main reasons may be that the main purpose of those datasets were aimed at classiﬁcation,and that even with external expert knowledge it is diﬃcult to provide a full ranking of classes.7For this reason,we used the ranking that has been predicted from a full voting procedure as our reference ranking.

For all datasets Ripper(Cohen,1995),more precisely JRIP,its implementation in Weka,was used as base-learner.Ordinary round-based evaluations of the swiss-system leads to a small number of data points.8For the purpose of smoother curves,we plotted the estimated rankings after each single evaluation of a binary classiﬁer.There was no upper limit to the number of rounds,so we could continue until all pairwise comparisons have been performed.

7.Pyle(1999)acknowledges this diﬃculty and,interestingly,proposes a pairwise technique for ranking a list of

options.

8.≤(c(c−1)/2)

,the number is not constant because of possible bye-situations within Swiss-system,please refer to c/2

(FIDE,2006)for details.To measure the accuracy two error values were deployed:

•the normalized ranking error,where the error of each position between predicted rankingτp and true rankingτt is counted.

D R=1

k c

(τt(i)−τp(i))2

The normalizing constant k c=c(c2−1)/3is the maximal possible ranking error.Note that this is a linear transformation of the Spearman rank correlation into a loss function with value range[0,1].

•the position error,which uses the position of the true classˆc in the predicted ranking

D P=τ−1

(ˆc)−1

It may be viewed as a simple generalization of{0,1}-loss for problems in which one is interested inﬁnding the correct label with a minimum amount of trials(H¨u llermeier&F¨u rnkranz,2005). All values were estimated with10-fold cross-validation except for dataset letter,where the supplied testset was used.

First,we investigated the ranking error,i.e.the deviation of the ranking obtained by Swiss from the ranking of the full pairwise classiﬁer.Figure2a shows the ranking error curve for vowel, but all datasets have similar characteristics(Park,2006).It shows that the ranking error decreases exponentially with increasing numbers of used pairwise classiﬁers,i.e.,in the beginning,the ranking improves substantially with each additional comparisons until a certain performance level is reached. After that,each additional comparison yields only small improvements,but the approximation is never100%accurate.

In order to scale this performance,we compared it to various other ordered methods:

•ascending(C1,2,C1,3,...,C1,n,C2,3,C2,4,...)

•descending(C n,n−1,...,C n,1,C n−1,n−2,C n−1,n−3,...)

•diagonal(C2,1,C3,1,C3,2,C4,3,C4,2,C4,1,C5,4,...)

•and a random order.

Class1is theﬁrst class that was listed in the dataset,2for the second and so on.Table4shows

the average performance ratios r=D i,s

D i,swiss where D i,s is the ranking error of one of the four ordered

methods,and D i,Swiss is the performance of the Swiss algorithm.Thus,a value greater than1 indicates a superior performance of the Swiss system.The Swiss system always outperformed the diagonal and random methods,typically by more than10%.The Swiss system also wins in the majority of the cases against ascending and descending,but here we canﬁnd two exceptions,where one of these two wins by a comparably small margin(about4%).A possible explanation could be that in these datasets,the givenﬁxed class order could contain information that holds for most of the datasets(e.g.,the majority class isﬁrst in the list).The ascending and descending ordering schemes could exploit such informations(also note that in these cases a good performance of one is also accompanied with a bad performance of the other),whereas the other two methods distribute the pairwise classiﬁers for each class fairly evenly across the sequence.Nevertheless,it seems safe to conclude that the Swiss system outperforms allﬁxed schemes.

The ascending order for datasets image and yeast and the descending order for vowel seem to break ranks,while most values demonstrate the superior performance of the Swiss system.These exceptions may be based on speciﬁc characteristics of these datasets and can be probably ignored in the general case.

Table 4:Comparison of the ranking error between Swiss system and four class-ordered systems.The values show the average performance ratio,where a value of 1means identical to the Swiss system and a value greater 1indicates a better performance of the Swiss algorithm.

dataset ascending descending diagonal random vehicle 1.102 1.183 1.316 1.170glass 1.122 1.328 1.782 1.413image 0.968 1.491 1.181 1.106yeast 1.123 1.667 1.792 1.486vowel 1.2850.965 1.381 1.106soybean 1.053 1.193 1.262 1.028letter

1.084

1.419

1.263

1.193

0.00.20.40.60.8 1.0

0.0

0.1

0.2

0.3

0.4

0.5

Dataset vowel

Swiss System, 10 CV, RIPPER used Classifiers %

n o r m a l i z e d R a n k i n g E r r o r

q q q q

q q

q q q

q q

q q q

q q

q q q q

q q q

q q q q

q q q

(a)0.0

0.20.40.60.8 1.0

0.0

0.1

0.2

0.3

0.40.5

0.6

Swiss System, 10 CV (except letter: testset), RIPPER

used Classifiers %

P o s i t i o n E r r o r

glass image yeast vowel soybean letter

(b)

Figure 2:a)Ranking error curve for vowel (base learner Ripper ),b)Position error curves using Swiss system for all datasets (base learner Ripper )

Figure 2b shows the results for the position error.As you can see the position error converges faster than the ranking error to the position error of the full pairwise classiﬁer.Contrary to the ranking error,not all pairwise comparisons are needed to predict the winning class.This observation is consistent with the consideration that if all incident classiﬁers of the correct top rank are evaluated during such a sequence,the result is a position error of zero and further pairwise comparisons do not aﬀect it anymore.

However,the saturation point,for which we can say that a reasonable position error has been obtained,diﬀers between datasets.Table 5takes a closer look at this issue,summarizing the results ordered by the number of classes.Here we computed a relative position error D R ,which is the position of the top-ranked class of the full ranking in the incomplete ranking.We assume that the ratings have converged,when D R <0.02(third column)or D R =0.0(last column).For D R <0.02,we can see Since the datasets in the legend are ordered by their number of classes a relation between the number of classes and the convergence of the position error.The higher the number of classes

Table5:Position error of all datasets.The last two columns describe the average number of pairwise comparisons for a position error D P within the threshold D P<0.02and D P=0.

dataset c D R<0.02D R=0

vehicle45(83%)6(100%)

glass717(81%)19(90%)

image714(67%)20(95%)

yeast1016(36%)44(98%)

vowel1127(49%)52(95%)

soybean1939(23%)145(85%)

letter2634(10%)315(97%)

of a dataset,the smaller the percentage of pairwise comparisons are needed to achieve a acceptable position error.This trend is less clear for a perfect position error,as can be seen in the fourth column.Here it can also be seen that in most cases almost all pairwise comparisons were needed for a perfect agreement with the predictions of the full pairwise classiﬁer.Note however that,contrary to the QWeighted algorithm,the Swiss algorithm still attempts to maximize the full ranking,and does not focus on the top ranking only(as QWeighted does).

5.Conclusions

1In this paper,we have proposed two novel algorithms that allow to speed up the prediction phase for pairwise classiﬁers.The QWeighted algorithm will always predict the same class as the full pairwise classiﬁer,but the algorithm is close to linear in the number of classes,in particular for large numbers of classes,where the problem is most stringent.For very hard problems,where the performance of the binary classiﬁers reduces to random guessing,its worst-case performance is still quadratic in the number of classes,but even there practical gains can be expected.

The Swiss algorithm does not focus on the top-ranked class,but tries to approximate the com-plete ranking of the classes.While we could not directly evaluate the ranking ability of this algorithm (because of a lack of datasets with complete class ranking information for each example),we could demonstrate that this adaptive algorithm oﬀers a better trade-oﬀthan alternative,ﬁxed evaluation orders for the pairwise classiﬁers,and therefore allows for a better trade-oﬀbetween the number of evaluated pairwise classiﬁers and the quality of the approximation of the full class ranking.

A restriction of our approaches is that they are only applicable to combining predictions via voting or weighted voting.There are various other proposals for combining the class probability estimates of the base classiﬁers into an overall class probability distribution(this is also known as pairwise coupling(Hastie&Tibshirani,1998;Wu et al.,2004)).Nevertheless,eﬃcient alternatives for other pairwise coupling techniques are an interesting topic for further research. Acknowledgments

This research was supported by the German Science Foundation(DFG).

References

Brenner,S.E.,Koehl,P.,&Levitt,M.(2000).The ASTRAL compendium for sequence and structure analysis.Nucleic Acids Research,28,254–256.Cohen,W.W.(1995).Fast eﬀective rule induction.Proceedings of the12th International Conference on Machine Learning(ML-95)(pp.115–123).Lake Tahoe,CA:Morgan Kaufmann.

Cutzu,F.(2003a).How to do multi-way classiﬁcation with two-way classiﬁers.Proceedings of the International Conference on Artiﬁcial Neural Networks(ICANN-03)(pp.375–384).Springer-Verlag.

Cutzu,F.(2003b).Polychotomous classiﬁcation with pairwise classiﬁers:A new voting princi-ple.Proceedings of the4th International Workshop on Multiple Classiﬁer Systems(pp.115–124). Springer,Berlin.

FIDE(2006).Fide swiss rules.In Fide handbook,chapter4.World Chess Federation—Federation Internationale des Echecs.http://www.fide.com/official/handbook.asp?level=C04.

F¨u rnkranz,J.(2002).Round robin classiﬁcation.Journal of Machine Learning Research,2,721–747. F¨u rnkranz,J.(2003).Round robin ensembles.Intelligent Data Analysis,7,385–404.

Hastie,T.,&Tibshirani,R.(1998).Classiﬁcation by pairwise coupling.Advances in Neural Infor-mation Processing Systems10(NIPS-97)(pp.507–513).MIT Press.

Hettich,S.,Blake,C.L.,&Merz,C.J.(1998).UCI repository of machine learning databases.http: //www.ics.uci.edu/∼mlearn/MLRepository.html.Department of Information and Computer Science,University of California at Irvine,Irvine CA.

Hsu,C.-W.,&Lin,C.-J.(2002).A comparison of methods for multi-class support vector machines. IEEE Transactions on Neural Networks,13,415–425.

H¨u llermeier,E.,&F¨u rnkranz,J.(2004a).Comparison of ranking procedures in pairwise prefer-ence learning.Proceedings of the10th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems(IPMU-04).Perugia,Italy.

H¨u llermeier,E.,&F¨u rnkranz,J.(2004b).Ranking by pairwise comparison:A note on risk minimiza-tion.In Proceedings of the IEEE International Conference on Fuzzy Systems(FUZZ-IEEE-04). Budapest,Hungary.

H¨u llermeier,E.,&F¨u rnkranz,J.(2005).Learning label preferences:Ranking error versus position error.Advances in Intelligent Data Analysis:Proceedings of the6th International Symposium (IDA-05)(pp.180–191).Madrid,Spain:Springer-Verlag.

Lewis,D.D.,Yang,Y.,Rose,T.,&Li,F.(2004).RCV1:A New Benchmark Collection for Text Categorization Research.Journal of Machine Learning Research,5,361–397.

Murzin,A.G.,Brenner,S.E.,Hubbard,T.,Chothia,C.(1995).SCOP:a structural classiﬁcation of proteins database for the investigation of sequences and structures.Journal of Molecular Biology, 247,536–540.

Park,S.-H.(2006).Eﬃcient classiﬁcation and ranking with pairwise comparisons.Diploma Thesis, TU-Darmstadt,In German.

Platt,J.C.,Cristianini,N.,&Shawe-Taylor,J.(2000).Large margin DAGs for multiclass classi-ﬁcation.Advances in Neural Information Processing Systems12(NIPS-99)(pp.547–553).MIT Press.

Pyle,D.(1999).Data preparation for data mining.San Francisco,CA:Morgan Kaufmann.Witten,I.H.,&Frank,E.(2005).Data mining—practical machine learning tools and techniques with Java implementations.Morgan Kaufmann Publishers.2nd edition.

Wu,T.-F.,Lin,C.-J.,&Weng,R.C.(2004).Probability estimates for multi-class classiﬁcation by pairwise coupling.Journal of Machine Learning Research,5,975–1005.

Yang,Y.,&Pedersen,J.O.(1997).A comparative study on feature selection in text categorization. In The Fourteenth International Conference on Machine Learning(ICML-97),412–420.Morgan Kaufmann.

Efficient Pairwise Classification and Ranking

推荐度：

点击下载本文 文档为doc格式

热门焦点

Efficient Pairwise Classification and Ranking

Efficient Pairwise Classification and Ranking

Efficient Pairwise Classification and Ranking

最新推荐

猜你喜欢

热门推荐