
Technische Universität Darmstadt
Knowledge Engineering Group
Hochschulstrasse10,D-2Darmstadt,Germany
http://www.ke.informatik.tu-darmstadt.de
Technical Report TUD–KE–2007–03
Sang-Hyeun Park,Johannes Fürnkranz
Efficient Pairwise Classification and
Ranking
A short version of this paper appeared as:
Sang-Hyeun Park,Johannes Fürnkranz.Efficient Pairwise Classification, Proceedings of the18th European Conference on Machine Learning(ECML-07),
Springer-Verlag,2007Efficient Pairwise Classification and Ranking
Sang-Hyeun Park park@ke.informatik.tu-darmstadt.de Johannes F¨u rnkranz juffi@ke.informatik.tu-darmstadt.de Knowledge Engineering Group
Department of Computer Science
TU Darmstadt,Germany
Abstract
Pairwise classification is a class binarization procedure that converts a multi-class problem into
a series of two-class problems,one problem for each pair of classes.While it can be shown that for
training,this procedure is more efficient than the more commonly used one-against-all approach, it still has to evaluate a quadratic number of classifiers when computing the predicted class for
a given example.In this paper,we propose a method that allows a faster computation of the
predicted class when weighted or unweighted voting are used for combining the predictions of the individual classifiers.While its worst-case complexity is still quadratic in the number of classes, we show that even in the case of completely random base classifiers,our method still outperforms the conventional pairwise classifier.For the more practical case of well-trained base classifiers,its asymptotic computational complexity seems to be almost linear.We also propose a method for approximating the full class ranking,based on the Swiss System,a common scheme for conducting multi-round chess tournaments.Our results indicate that this adaptive scheme offers a better trade-offbetween approximation quality and number of performed comparisons than alternative,fixed schemes for ordering the evaluation of the pairwise classifiers.
1.Introduction
Many learning algorithms can only deal with two-class problems.For multi-class problems,they have to rely on class binarization procedures that transform the original learning problem into a series of binary learning problems.A standard solution for this problem is the one-against-all approach, which constructs one binary classifier for each class,where the positive training examples are those belonging to this class and the negative training examples are formed by the union of all other classes.
An alternative approach,known as pairwise classification or round robin classification has re-cently gained attention(F¨u rnkranz,2002;Wu et al.,2004).Its basic idea is to transform a c-class problem into c(c−1)/2binary problems,one for each pair of classes.This approach has been shown to produce more accurate results than the one-against-all approach for a wide variety of learning algorithms such as support vector machines(Hsu&Lin,2002)or rule learning algorithms (F¨u rnkranz,2002).Moreover,F¨u rnkranz(2002)has also proved that despite the fact that its com-plexity is quadratic in the number of classes,the algorithm can in fact be trained faster than the conventional one-against-all technique.1However,in order to obtain afinal prediction,we still have to combine the predictions of all c(c−1)/2classifiers,which can be very inefficient for large values of c.
The main contribution of this paper is a novel solution for this problem.Unlike previous proposals (such as(Platt et al.,2000);cf.Section3.2)our approach is not heuristic but is guaranteed to produce exactly the same prediction as the full pairwise classifier,which in turn has been shown to optimize the Spearman rank correlation with the target labels(H¨u llermeier&F¨u rnkranz,2004b).In essence, 1.It is easy to see this,if one considers that in the one-against-all case each training example is used c times(namely
in each of the c binary problems),while in the round robin approach each example is only used c−1times,namely only in those binary problems,where its own class is paired against one of the other c−1classes.the algorithm selects and evaluates iterative pairwise classifiers using a simple heuristic to minimize the number of used pairwise classifiers that are needed to determine the correct top rank class of the complete(weighted)voting.We will describe and evaluate this algorithm in Section3.
In Section4,we will then investigate the case when we are not only interested in the prediction of a single class,but in a ranking of all possible classes.For this case,we propose the so-called Swiss System,an algorithm that is commonly used for organizing multi-round chess tournaments. Its key idea is to always evaluate classifiers that have similar positions in the current,incomplete ranking.Our results show that this algorithm offers a good trade-offbetween the number of evaluated classifiers and the quality of the approximation of the complete ranking.
2.Pairwise Classification
In the following,we assume that a multi-class problem has c classes,which we denote with c1,...,c c.
A pairwise or round robin classifier trains a set of c(c−1)/2binary classifiers C i,j,one for each pair of classes(c i,c j),i At classification time,each binary classifier C i,j is queried and issues a vote(a prediction for either c i or c j)for the given example.This can be compared with sports and games tournaments, in which each player plays each other player once.In each game,the winner receives a point,and the player with the maximum number of points is the winner of the tournament.In our case,the class with the maximum number of votes is predicted(ties are broken arbitrarily for the larger class).In this paper,we will assume binary classifiers that return class probabilities p(c i|c i∨c j) and p(c j|c i∨c j).These can be used for weighted voting,i.e.,we predict the class that receives the maximum number of votes: c =arg max i=1...c c j=1 p(c i|c i∨c j) This procedure optimizes the Spearman rank correlation with the target ranking(H¨u llermeier& F¨u rnkranz,2004b).Other algorithms for combining votes exist(cf.pairwise coupling(Hastie& Tibshirani,1998;Wu et al.,2004)),but are not subject of this paper.2 Note that weighted or unweighted voting produce a ranking of all classes.For prediction prob-lems,one is typically only interested in the top ranked class,but in some applications one might also be interested in the complete ranking of classes.We will focus on classification in the next section, ranking will be considered in Section4. 2.Furthermore,comparisons in(H¨u llermeier&F¨u rnkranz,2004a)with more advanced methods showed that the simple voting method is competitive with more complex methods.3.Efficient Pairwise Classification 3.1The QWeighted Algorithm Weighted or unweighted voting predicts the top rank class by returning the class with the highest accumulated voting mass after evaluation of all pairwise classifiers.During such a procedure there exist many situations where particular classes can be excluded from the set of possible top rank classes,even if they reach the maximal voting mass in the remaining evaluations.Consider following simple example:Given c classes with c>j,if class a has received more than c−j votes and class b lost j votings,it is impossible for b to achieve a higher total voting mass than a.Thus further evaluations with b can be safely ignored. To increase the reduction of evaluations we are interested in obtaining such exploitable situations frequently.Pairwise classifiers will be selected depending on a loss value,which is the amount of potential voting mass that a class has not received.More specifically,the loss l i of a class i is defined as l i:=p i−v i,where p i is the number of evaluated incident classifiers of i and v i is the current vote amount of i.Obviously,the loss will begin with a value of zero and is monotonically increasing.3 The class with the current minimal loss is one of the top candidates for the top rank class. Algorithm1:QWeighted(Quick Weighted Voting) begin while c top not determined do c a←class c i∈K with minimal l i; c b←class c j∈K\\{c a}with minimal l j&classifier C a,b has not yet been evaluated; if no c b exists then c top←c a; else v ab←Evaluate(C a,b); l a←l a+(1−v ab); l b←l b+v ab; end First the pairwise classifier C a,b will be selected for which the losses l a and l b of the relevant classes c a and c b are minimal,provided that the classifier C a,b has not yet been evaluated.In the case of multiple classes that have the same minimal loss,there exists no further distinction,and we select a class randomly from this set.Then,the losses l a and l b will be updated based on the evaluation returned by C a,b(recall that v ab is interpreted as the amount of the voting mass of the classifier C a,b that goes to class c a and1−v ab is the amount that goes to class c b).These two steps will be repeated until all classifiers for the class c m with the minimal loss has been evaluated.Thus the current/estimated loss l m is the correct loss for this class.As all other classes already have a greater loss,c m is the correct top rank class. Theoretically,a minimal number of comparisons of c−1is possible(best case).Assuming that the incident classifiers of the correct top rank c top always returns the maximum voting amount (l top=0),c top is always in the set{c j∈K|l j=min c i ∈K l i}.In addition,c top should be selected as thefirst class in step1of the algorithm among the classes with the minimal loss value.It follows that exactly c−1comparisons will be evaluated,more precisely all incident classifiers of c top.The algorithm terminates and returns c top as the correct top rank. The worst case,on the other hand,is still c(c−1)/2comparisons,which can,e.g.,occur if all pairwise classifiers classify randomly with a probability of0.5.In practice,the number of comparisons 3.This loss is essentially identical to the voting-against principle introduced by Cutzu(2003a;2003b),which we will discuss later on in Section3.2.will be somewhere between these two extremes,depending on the nature of the problem.The next section will evaluate this trade-off. 3.2Related Work Cutzu(2003a;2003b)recognized the importance of the voting-against principle and observed that it allows to reliably conclude a class when not all of the pairwise classifiers are present.For example, Cutzu claims that using the voting-against rule one could correctly predict class i even if none of the pairwise classifiers C ik(k=1...c,k=i)are used.However,this argument is based on the assumption that all base classifiers classify correctly.Moreover,if there is a second class j that should ideally receive c−2votes,voting-against could only conclude a tie between classes i and j,as long as the vote of classifier C ij is not known.The main contribution of his work,however,is a method for computing posterior class probabilities in the voting-against scenario.Our approach builds upon the same ideas as Cutzu’s,but our contribution is the algorithm that exploits the voting-against principle to effectively increase the prediction efficiency of pairwise classifiers without changing the predicted results. The voting-against principle was already used earlier in the form of DDAGs Platt et al.(2000), which organize the binary base classifiers in a decision graph.Each node represents a binary decision that rules out the class that is not predicted by the corresponding binary classifier.At classification time,only the classifiers on the path from the root to a leaf of the tree(at most c−1classifiers) are consulted.While the authors empirically show that the method does not lose accuracy on three benchmark problems,it does not have the guarantee of our method,which will always predict the same class as the full pairwise classifier.Intuitively,one would also presume that afixed evaluation routine that uses only c−1of the c(c−1)/2base classifiers will sacrifice one of the main strengths of the pairwise approach,namely that the influence of a single incorrectly trained binary classifier is diminished in large ensemble of classifiers(F¨u rnkranz,2003). 3.3Evaluation We compare the QWeighted algorithm with the full pairwise classifier and with DDAGs Platt et al.(2000)on seven arbitrarily selected multi-class datasets from the UCI database of machine learning databases(Hettich et al.,1998).We used four commonly used learning algorithms as base learners(the rule learner Ripper,a Naive Bayes algorithm,the C4.5decision tree learner,and a support vector machine)all in their implementations in the Weka machine learning library(Witten &Frank,2005).Each algorithm was used as a base classifier for QWeighted,and the combination was run on each of the datasets.As QWeighted is guaranteed to return the same predictions as the full pairwise classifier,we are only interested in the number of comparisons needed for determining the winning class.4These are measured for all examples of each dataset via a10-fold cross-validation except for letter,where the supplied testset was used.Table1shows the results. With respect to accuracy,there is only one case in a total of28experiments(4base classifiers×7 datasets)where DDAGs outperformed the QWeighted,which,as we have noted above,optimizes the Spearman rank correlation.This and the fact that,to the best of our knowledge,it is not known what loss function is optimized by DDAGs,confirm our intuition that QWeighted is a more principled approach than DDAGs.It can also be seen that the average number of comparisons needed by QWeighted is much closer to the best case than to the worst case. Next to the absolute numbers,we show the trade-offbetween best and worst case(in brackets). A value of0indicates that the average number of comparisons is c−1,a value of1indicates that the value is c(c−1)/2(the value in the last column).As we have ordered the datasets by their respective number of classes,we can observe that this value has a clear tendency to decrease with 4.As mentioned above,we used a double round robin for Ripper for both,the full pairwise classifier and for QWeighted.In order to be comparable to the other results,we,in this case,divide the observed number of comparisons by two.Table1:Comparison of QWeighted and DDAGs with different base learners on seven multi-class datasets.The right-most column shows the number of comparisons needed by a full pairwise classifier (c(c−1)/2).Next to the average numbers of comparisons for QWeighted we show their trade-offn−(c−1) between best and worst case(in brackets). max−(c−1) Accuracy∅Comparisons dataset c learner QWeighted DDAG QWeighted DDAG full vehicle4NB45.3944.92 4.27(0.423)36 SMO75.0675.06 3.(0.213) J4871.9970.92 3.96(0.320) JRip73.8872.46 3.98(0.327) glass7NB49.0749.079.58(0.238)621 SMO57.0157.949.92(0.261) J4871.5069.169.69(0.246) JRip74.7774.309.75(0.250) image7NB80.0980.099.03(0.202)621 SMO93.5193.518.29(0.153) J46.9396.758.55(0.170) JRip96.6296.418.75(0.183) yeast10NB57.5557.2115.86(0.191)945 SMO57.6857.4115.52(0.181) J4858.5657.7515.48(0.180) JRip58.9658.0915.87(0.191) vowel11NB63.8463.17.09(0.158)1055 SMO81.9281.5215.28(0.117) J4882.9378.2817.13(0.158) JRip82.4276.6717.42(0.165) soybean19NB92.9792.9727.70(0.063)18171 SMO94.1493.4128.36(0.068) J43.5691.8029.45(0.075) JRip94.0093.5627.65(0.063) letter26NB63.0863.0044.40(0.065)25325 SMO83.8082.5842.26(0.058) J41.5086.1547.77(0.076) JRip92.3388.3345.01(0.068) the number of the classes.For example,for the19-class soybean and the26-class letter datasets, only about6−7%of the possible number of additional pairwise classifiers are used,i.e.,the total number of comparisons seems to grow only linearly with the number of classes.This can also be seen from Figure1a,which plots the datasets with their respective number of classes together with a curve that indicates the performance of the full pairwise classifier. Finally,we note that the results are qualitatively the same for all base classifiers.QWeighted does not seem to depend on a choice of base classifiers.For a more systematic investigation of the complexity of the algorithm,we performed a simulation experiment.We assume classes in the form of numbers from1...c,and,without loss of generality,1is always the correct class.We further assume pairwise base pseudo-classifiers i≺ j,which,for two numbers i trade-offn−(c−1) max−(c−1)between the best and worst case,and an estimate of the growth ratio log(n2/n1) log(c2/c1) of successive values of n. c =0.0 =0.05 =0.1 =0.2 =0.3 =0.5full 5 5.43 5.72 6.07 6.45 6.907.1210 0.238—0.287—0.345—0.408—0.483—0.520— 1014.1116.1918.3421.9025.3928.7445 0.142 1.3780.200 1.5010.259 1.5950.358 1.70.455 1.8800.548 2.013 2542.4560.0176.82113.75151.19198.51300 0.067 1.2020.130 1.4300.191 1.5630.325 1.7980.461 1.9740.632 2.109 5091.04171.53251.18422.58606.74868.251,225 0.036 1.1010.104 1.5150.172 1.7090.318 1.30.474 2.0050.697 2.129 1001.51530.17900.291,684.212,504.543,772.454,950 0.019 1.0580.0 1.6280.165 1.8420.327 1.9950.496 2.0450.757 2.119 corresponds to a pairwise classifier where all predictions are consistent with a total order of the possible class labels,and =0.5corresponds to the case where the predictions of the base classifiers are entirely random. Table2shows the results for various numbers of classes(c=5,10,25,50,100)and for various settings of the error parameter( =0.0,0.05,0.1,0.2,0.3,0.5).Each data point is the average outcome of1000trials with the corresponding parameter settings.We can see that even for entirely random data,our algorithm can still save about1/4of the pairwise comparisons that would be needed for the entire ensemble.For cases with a total order and error-free base classifiers,the number of needed comparisons approaches the number of classes,i.e.,the growth appears to be linear. To shed more light on this,we provide two more measures below each average:the lower left number(in italics)shows the trade-offbetween best and worst case,as defined above.The result confirms that for a reasonable performance of the base classifiers(up to about =0.2),the fraction of additional work reduces with the number of classes.Above that,we start to observe a growth. The reason for this is that with a low number of classes,there is still a good chance that the random base classifiers produce a reasonably ordered class structure,while this chance is decreasing with increasing numbers of classes.On the other hand,the influence of each individual false prediction of a base classifier decreases with an increasing number of classes,so that the true class ordering is still clearly visible and can be better exploited by the QWeighted algorithm. This is illustrated in Figure1b,which shows the distribution of the votes produced by the SVM base classifier for the dataset vowel.As shown on the scale to right,different color codes are used for encoding different numbers of received votes.Each line in the plot represents one example,the left shows the highest number of votes,the right the lowest number of votes.If all classes receive the same number of votes,the area should be colored uniformly.However,here we observe a fairly clear change in the color distribution,the bright areas to the left indicating the the top-rank class often receives nine or more votes,and the areas to the right indicating that the lowest ranking class typically receives less than one vote(recall that we use weighted voting). We tried to directly estimate the exponent of the growth function of the number of comparisons of QWeighted,based on the number of classes c.The resulting exponents,based on two successive measure points,are shown in bold font below the absolute numbers.For example,the exponent of the growth function between c=5and c=10is estimated(for =0)as log(14.11/5.43) log(10/5) ≈1.378.We 5 10 15 2025 50 100150 200250300 number of classes c n u m b e r o f c o m p a r i s o n s vehicle glass/image yeast vowel soybean letter c (c −1)2 full voting QWeighted (a)(b) Figure 1:a)Efficiency of QWeighted in comparison to a full pairwise classifier,b)Distribution of votes for vowel (11-class problem,base learner NB ).The x-axis describes the ranking positions.can see that in the growth rate starts almost linearly (for a high number of classes and no errors in the base classifiers)and approaches a quadratic growth when the error rate increases.5 In addition to the small datasets from Table 1,we evaluated the QWeighted algorithm on three more real world datasets with a relative high number of classes: Uni-label RCV1-v2 RCV1-v2(Lewis et al.,2004)is a dataset consisting of over 800,000categorized news articles from Reuters,Ltd.For the category topic multiple labels from a total of 103hierarchically organized labels are assigned to the instances.We transformed this original multi-label dataset to uni-label to be suitable for the QWeighted algorithm.For each instance,the label with the greatest depth in the hierarchical tree among the assigned labels was selected as true class label.In case,that multiple labels have the same greatest depth,the respective instance was removed.We applied this procedure on the provided trainset and testset no.0by Lewis et al.resulting to a multi-classification dataset with 100classes,23,149train-and 199,328test-instances with at least one positive example for each of the 100classes.6We will refer to this created dataset as urcv1-v2.ASTRAL 2&3 These datasets describe protein sequences retrieved from the SCOP 1.71protein database (Murzin et al,1995).We used ASTRAL (Brenner et al.,2000)to filter these sequences so that no two sequences share greater than 95%identity.The class labels are organized in a 3-level hierarchy,consisting of protein folds,superfamilies and families (in descending order).astral3consists of 1,588classes and contains the original hierarchy.To fill the gap between datasets urcv1-v2and astral3in terms of number of classes,we constructed a second dataset astral2by limiting the hierarchical depth to 2.So,two instances which previously shared the same superfamily x are now assigned to superfamily x as new class label.By decreasing the depth,the number of classes were reduced to 971.Both datasets have 13,006instances and 21numeric attributes (20amino acids plus selenocysteine). 5.At first sight,it may be surprising that some of the numbers are greater than 2.This is a result of the fact that c (c −1)/2=c 2−c/2is quadratic in the limit ,but for low values of c ,the subtraction of the linear term c/2has a more significant effect.Thus,e.g.,the estimated growth of the full pairwise classifier from c =5to c =10is log(45/10) log(10/5) ≈2.17.6.The amount of train and test-instances is identical to their unmodified versions.So,there was in no case an instance with multiple possible candidates as most specific label.Furthermore,only one of the originally 101assigned labels was removed by the described procedure. max−(c−1) between the best and worst case(base learner J48). (a) Accuracy∅Comparisons s1-vs-1Direct QW full 25060.8653.4260297.334,950 50062.282854.4108306.394,950 1,00061.567454.1043310.4,950 2,00062.080154.5458312.624,950 4,00062.110753.7747312.604,950 8,00062.107253.4350312.294,950 14,00062.090153.94312.214,950 20,00061.322053.6568312.354,950 (b) dataset c QW full Acc. urcv1-v2100312.624,95062.1 (0.0440) astral29719,490.81470,93524.8 (0.0018) astral31,58828,476.201,260,07820.8 (0.0213) Dataset urcv1-v2uses more than40,000features(stemmed words).To reduce the complexity of the multi-classification problem,χ2-based feature selection(Yang&Pedersen,1997)was applied. After computing the ranking of features with decreasingχ2scores,we selected various numbers s of the top ranked features(s=250,500,1000,2000,4000,8000,14000,20000).The decision tree learner J48was used as base learner,because of its fast computational speed and for validation,we used the transformed testset.The results are shown in Table3a.It is surprising that the accuracy value is only minimally influenced by the feature set size.All accuracy values lies within 61-62percent for the pairwise approach and approx.54without any binarization.In fact,the pairwise approach reaches its maximum accuracy when using only500features.As a side note,we can see that the pairwise approach outperforms the direct multi-class capability(3rd column)of J48in terms of accuracy. The maximal number of pairwise comparisons(at s=2000)from Table3a is listed together with the results of the QWeighted algorithm for astral2and astral3in Table3b.For astral2and astral366percent of all instances were used for training and the rest for testing.Once again,trade- offvalues were estimated for the average number of pairwise comparisons.As these values show, QWeighted uses only a fairly small amount compared to a full voting aggregation and tends clearly to the best case than to the worst case(c(c−1)/2comparisons).One can see an increasing growth of the trade-offvalues between astral2and astral3.However,this effect can be explained with the general poor classification accuracy of protein sequences.According to the simulation results,there exist a correlation between performance of QWeighted and performance of the underlying base classifiers.The decreased accuracy on astral3compared to astral2(right-most column)indicates weaker base classifiers,which leads to a increasing number of needed pairwise comparisons. In summary,our results indicate that the QWeighted algorithm always increases the efficiency of the pairwise classifier:for high error rates in the base classifiers,we can only expect improvements by a constant factor,whereas for the practical case of low error rates we can also expect a significant reduction in the asymptotic algorithmic complexity. 4.Efficient Pairwise Ranking Pairwise classification not only delivers a single class label but a ranking of all classes,and in many applications this is of interest.An example might be in speech recognition to combine the ranking of possible recognized words with further background information like context or grammar to gaina higher accuracy.However,the QWeighted algorithm in the previous section focuses entirely on finding the top rank.In this section,we will briefly discuss an algorithm for efficient computation of a complete ranking. 4.1The Swiss Algorithm Our algorithm is based on the swiss system that is commonly used in chess tournaments(FIDE, 2006).Its fundamental ideas are that all players play afixed number of rounds,where in each round players of similar strength are paired with each other.After each round,a ranking is determined based upon the points collected so far,and the next pairings are drawn based on the current ranking (players with the same number of points play each other).Given n entrants,a limitation of log(n) rounds is typical,resulting in n·log(n)games,as opposed to the quadratic amount in the usual round-robin system.In addition,no pairing will be evaluated twice. Algorithm2:Swiss system begin for round=1to max round do pair classes that have a similar score,starting with classes with byes; pair the remaining classes among each other; remaining classes receive a bye; evaluate the corresponding classifiers and update the loss values for each class; end We adopted this system for use in pairwise classification,so that it computes a ranking based on weighted voting(Algorithm2).Contrary to the regular Swiss system,this algorithm does not try to optimize the pairings,but is greedily producing pairings in various stages.First,it pairs classes that had received a bye in the last round,then classes with a similar score,and then all remaining classes.As a result of the greedy pairing,we may have situations,where for certain classes no valid pairings can be found,e.g.,when all remaining classes were already previously paired with the specific class.In this situation the class receives a bye,which guarantees a pairing in the beginning of the next round,even if there is no similarly rated class available.The main advantage of this greedy approach is that we can also easily linearize it,i.e.,we can evaluate the ranking quality not only after complete rounds,but after each individual pairing. 4.2Evaluation Unfortunately,most datasets including those from the UCI Database provide only the correct top rank class but no reference ranking of classes.The main reasons may be that the main purpose of those datasets were aimed at classification,and that even with external expert knowledge it is difficult to provide a full ranking of classes.7For this reason,we used the ranking that has been predicted from a full voting procedure as our reference ranking. For all datasets Ripper(Cohen,1995),more precisely JRIP,its implementation in Weka,was used as base-learner.Ordinary round-based evaluations of the swiss-system leads to a small number of data points.8For the purpose of smoother curves,we plotted the estimated rankings after each single evaluation of a binary classifier.There was no upper limit to the number of rounds,so we could continue until all pairwise comparisons have been performed. 7.Pyle(1999)acknowledges this difficulty and,interestingly,proposes a pairwise technique for ranking a list of options. 8.≤(c(c−1)/2) ,the number is not constant because of possible bye-situations within Swiss-system,please refer to c/2 (FIDE,2006)for details.To measure the accuracy two error values were deployed: •the normalized ranking error,where the error of each position between predicted rankingτp and true rankingτt is counted. D R=1 k c · c 1 (τt(i)−τp(i))2 The normalizing constant k c=c(c2−1)/3is the maximal possible ranking error.Note that this is a linear transformation of the Spearman rank correlation into a loss function with value range[0,1]. •the position error,which uses the position of the true classˆc in the predicted ranking D P=τ−1 p (ˆc)−1 It may be viewed as a simple generalization of{0,1}-loss for problems in which one is interested infinding the correct label with a minimum amount of trials(H¨u llermeier&F¨u rnkranz,2005). All values were estimated with10-fold cross-validation except for dataset letter,where the supplied testset was used. First,we investigated the ranking error,i.e.the deviation of the ranking obtained by Swiss from the ranking of the full pairwise classifier.Figure2a shows the ranking error curve for vowel, but all datasets have similar characteristics(Park,2006).It shows that the ranking error decreases exponentially with increasing numbers of used pairwise classifiers,i.e.,in the beginning,the ranking improves substantially with each additional comparisons until a certain performance level is reached. After that,each additional comparison yields only small improvements,but the approximation is never100%accurate. In order to scale this performance,we compared it to various other ordered methods: •ascending(C1,2,C1,3,...,C1,n,C2,3,C2,4,...) •descending(C n,n−1,...,C n,1,C n−1,n−2,C n−1,n−3,...) •diagonal(C2,1,C3,1,C3,2,C4,3,C4,2,C4,1,C5,4,...) •and a random order. Class1is thefirst class that was listed in the dataset,2for the second and so on.Table4shows the average performance ratios r=D i,s D i,swiss where D i,s is the ranking error of one of the four ordered methods,and D i,Swiss is the performance of the Swiss algorithm.Thus,a value greater than1 indicates a superior performance of the Swiss system.The Swiss system always outperformed the diagonal and random methods,typically by more than10%.The Swiss system also wins in the majority of the cases against ascending and descending,but here we canfind two exceptions,where one of these two wins by a comparably small margin(about4%).A possible explanation could be that in these datasets,the givenfixed class order could contain information that holds for most of the datasets(e.g.,the majority class isfirst in the list).The ascending and descending ordering schemes could exploit such informations(also note that in these cases a good performance of one is also accompanied with a bad performance of the other),whereas the other two methods distribute the pairwise classifiers for each class fairly evenly across the sequence.Nevertheless,it seems safe to conclude that the Swiss system outperforms allfixed schemes. The ascending order for datasets image and yeast and the descending order for vowel seem to break ranks,while most values demonstrate the superior performance of the Swiss system.These exceptions may be based on specific characteristics of these datasets and can be probably ignored in the general case. Table 4:Comparison of the ranking error between Swiss system and four class-ordered systems.The values show the average performance ratio,where a value of 1means identical to the Swiss system and a value greater 1indicates a better performance of the Swiss algorithm. dataset ascending descending diagonal random vehicle 1.102 1.183 1.316 1.170glass 1.122 1.328 1.782 1.413image 0.968 1.491 1.181 1.106yeast 1.123 1.667 1.792 1.486vowel 1.2850.965 1.381 1.106soybean 1.053 1.193 1.262 1.028letter 1.084 1.419 1.263 1.193 0.00.20.40.60.8 1.0 0.0 0.1 0.2 0.3 0.4 0.5 Dataset vowel Swiss System, 10 CV, RIPPER used Classifiers % n o r m a l i z e d R a n k i n g E r r o r q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q (a)0.0 0.20.40.60.8 1.0 0.0 0.1 0.2 0.3 0.40.5 0.6 Swiss System, 10 CV (except letter: testset), RIPPER used Classifiers % P o s i t i o n E r r o r glass image yeast vowel soybean letter (b) Figure 2:a)Ranking error curve for vowel (base learner Ripper ),b)Position error curves using Swiss system for all datasets (base learner Ripper ) Figure 2b shows the results for the position error.As you can see the position error converges faster than the ranking error to the position error of the full pairwise classifier.Contrary to the ranking error,not all pairwise comparisons are needed to predict the winning class.This observation is consistent with the consideration that if all incident classifiers of the correct top rank are evaluated during such a sequence,the result is a position error of zero and further pairwise comparisons do not affect it anymore. However,the saturation point,for which we can say that a reasonable position error has been obtained,differs between datasets.Table 5takes a closer look at this issue,summarizing the results ordered by the number of classes.Here we computed a relative position error D R ,which is the position of the top-ranked class of the full ranking in the incomplete ranking.We assume that the ratings have converged,when D R <0.02(third column)or D R =0.0(last column).For D R <0.02,we can see Since the datasets in the legend are ordered by their number of classes a relation between the number of classes and the convergence of the position error.The higher the number of classes Table5:Position error of all datasets.The last two columns describe the average number of pairwise comparisons for a position error D P within the threshold D P<0.02and D P=0. dataset c D R<0.02D R=0 vehicle45(83%)6(100%) glass717(81%)19(90%) image714(67%)20(95%) yeast1016(36%)44(98%) vowel1127(49%)52(95%) soybean1939(23%)145(85%) letter2634(10%)315(97%) of a dataset,the smaller the percentage of pairwise comparisons are needed to achieve a acceptable position error.This trend is less clear for a perfect position error,as can be seen in the fourth column.Here it can also be seen that in most cases almost all pairwise comparisons were needed for a perfect agreement with the predictions of the full pairwise classifier.Note however that,contrary to the QWeighted algorithm,the Swiss algorithm still attempts to maximize the full ranking,and does not focus on the top ranking only(as QWeighted does). 5.Conclusions 1In this paper,we have proposed two novel algorithms that allow to speed up the prediction phase for pairwise classifiers.The QWeighted algorithm will always predict the same class as the full pairwise classifier,but the algorithm is close to linear in the number of classes,in particular for large numbers of classes,where the problem is most stringent.For very hard problems,where the performance of the binary classifiers reduces to random guessing,its worst-case performance is still quadratic in the number of classes,but even there practical gains can be expected. The Swiss algorithm does not focus on the top-ranked class,but tries to approximate the com-plete ranking of the classes.While we could not directly evaluate the ranking ability of this algorithm (because of a lack of datasets with complete class ranking information for each example),we could demonstrate that this adaptive algorithm offers a better trade-offthan alternative,fixed evaluation orders for the pairwise classifiers,and therefore allows for a better trade-offbetween the number of evaluated pairwise classifiers and the quality of the approximation of the full class ranking. A restriction of our approaches is that they are only applicable to combining predictions via voting or weighted voting.There are various other proposals for combining the class probability estimates of the base classifiers into an overall class probability distribution(this is also known as pairwise coupling(Hastie&Tibshirani,1998;Wu et al.,2004)).Nevertheless,efficient alternatives for other pairwise coupling techniques are an interesting topic for further research. Acknowledgments This research was supported by the German Science Foundation(DFG). References Brenner,S.E.,Koehl,P.,&Levitt,M.(2000).The ASTRAL compendium for sequence and structure analysis.Nucleic Acids Research,28,254–256.Cohen,W.W.(1995).Fast effective rule induction.Proceedings of the12th International Conference on Machine Learning(ML-95)(pp.115–123).Lake Tahoe,CA:Morgan Kaufmann. Cutzu,F.(2003a).How to do multi-way classification with two-way classifiers.Proceedings of the International Conference on Artificial Neural Networks(ICANN-03)(pp.375–384).Springer-Verlag. Cutzu,F.(2003b).Polychotomous classification with pairwise classifiers:A new voting princi-ple.Proceedings of the4th International Workshop on Multiple Classifier Systems(pp.115–124). Springer,Berlin. FIDE(2006).Fide swiss rules.In Fide handbook,chapter4.World Chess Federation—Federation Internationale des Echecs.http://www.fide.com/official/handbook.asp?level=C04. F¨u rnkranz,J.(2002).Round robin classification.Journal of Machine Learning Research,2,721–747. F¨u rnkranz,J.(2003).Round robin ensembles.Intelligent Data Analysis,7,385–404. Hastie,T.,&Tibshirani,R.(1998).Classification by pairwise coupling.Advances in Neural Infor-mation Processing Systems10(NIPS-97)(pp.507–513).MIT Press. Hettich,S.,Blake,C.L.,&Merz,C.J.(1998).UCI repository of machine learning databases.http: //www.ics.uci.edu/∼mlearn/MLRepository.html.Department of Information and Computer Science,University of California at Irvine,Irvine CA. Hsu,C.-W.,&Lin,C.-J.(2002).A comparison of methods for multi-class support vector machines. IEEE Transactions on Neural Networks,13,415–425. H¨u llermeier,E.,&F¨u rnkranz,J.(2004a).Comparison of ranking procedures in pairwise prefer-ence learning.Proceedings of the10th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems(IPMU-04).Perugia,Italy. H¨u llermeier,E.,&F¨u rnkranz,J.(2004b).Ranking by pairwise comparison:A note on risk minimiza-tion.In Proceedings of the IEEE International Conference on Fuzzy Systems(FUZZ-IEEE-04). Budapest,Hungary. H¨u llermeier,E.,&F¨u rnkranz,J.(2005).Learning label preferences:Ranking error versus position error.Advances in Intelligent Data Analysis:Proceedings of the6th International Symposium (IDA-05)(pp.180–191).Madrid,Spain:Springer-Verlag. Lewis,D.D.,Yang,Y.,Rose,T.,&Li,F.(2004).RCV1:A New Benchmark Collection for Text Categorization Research.Journal of Machine Learning Research,5,361–397. Murzin,A.G.,Brenner,S.E.,Hubbard,T.,Chothia,C.(1995).SCOP:a structural classification of proteins database for the investigation of sequences and structures.Journal of Molecular Biology, 247,536–540. Park,S.-H.(2006).Efficient classification and ranking with pairwise comparisons.Diploma Thesis, TU-Darmstadt,In German. Platt,J.C.,Cristianini,N.,&Shawe-Taylor,J.(2000).Large margin DAGs for multiclass classi-fication.Advances in Neural Information Processing Systems12(NIPS-99)(pp.547–553).MIT Press. Pyle,D.(1999).Data preparation for data mining.San Francisco,CA:Morgan Kaufmann.Witten,I.H.,&Frank,E.(2005).Data mining—practical machine learning tools and techniques with Java implementations.Morgan Kaufmann Publishers.2nd edition. Wu,T.-F.,Lin,C.-J.,&Weng,R.C.(2004).Probability estimates for multi-class classification by pairwise coupling.Journal of Machine Learning Research,5,975–1005. Yang,Y.,&Pedersen,J.O.(1997).A comparative study on feature selection in text categorization. In The Fourteenth International Conference on Machine Learning(ICML-97),412–420.Morgan Kaufmann.
