View other article » Take our survey »
Elsevier logo

Volume 129, Issue 6, June 2009, Pages 1532-1561

Information Sciences  journal image

Supervised ranking in the weka environment

  1. Corresponding author. Tel.: +32 9 264 59 41; fax: +32 9 264 62 20.
  2. a Department of Applied Mathematics and Computer Science, Ghent University, Krijgslaan 281 S9, B-9000 Gent, Belgium
  3. b Department of Applied Mathematics, Biometrics and Process Control, Ghent University, Coupure links 653, B-9000 Gent, Belgium

Research highlights

  • The supervised ranking problem has its own – very important – distinct characteristics. In this article, we focus on differences with regard to the traditional classification problem.
  • We have given four propositions that all provide solutions to the supervised ranking problem (for more information, we refer to [14, 15, 16]).
  • A software implementation of proposed solutions can be used inside the WEKA environment.
  • We have shown how to get started using this software, and we have given snapshots of WEKA outputs that show that indeed the implementation fits into the framework.

Abstract

Software for solving the supervised ranking problem is presented. Four variants of the Ordinal Stochastic Dominance Learner (OSDL) are given, together with the space and time complexity of their implementations. It is shown that the described software, which includes two further algorithms for supervised ranking, fits seamlessly into the weka environment.


1. The supervised ranking problem

1.1. Problem description

The supervised ranking problem, which is sometimes called the supervised ordered sorting problem, undeniably bears some resemblances to the traditional supervised classification problem, but has its own – very important – distinct characteristics. Given a collection of learning examples or instances, the goal is also, just as in traditional supervised classification, to attach a label to (unseen and/or unlabelled) instances in a way that extends the learning examples in the ‘most reasonable way’.

In traditional classification algorithms, instances are mostly described by feature vectors or attribute vectors belonging to a data space Formula , i.e. an object x from the object space Ω is described by an attribute vector Formula . The measurement scale of these features is quite arbitrary, and may range from nominal to numerical. The first important difference is that in the ranking problem, instances are described by true criteria [7], i.e. each ‘feature’ is measured on (at least) an ordinal scale, and has an associated meaning (semantics) of preference. This is important as it now makes sense to compare two objects a and b from the object space Ω on the basis of a single criterion ci. One says that object a is strictly preferred to object b w.r.t. ci if Formula . A second difference concerns the measurement scale of the labels: in traditional classification problems, the set of labels Formula constitutes a nominal scale, while in the ranking problem, the set Formula is a finite chain, again with associated preference semantics. It thus also makes sense to compare objects on the basis of their labels.

These two differences w.r.t. the traditional classification problem give in a very natural way rise to a basic monotonicity constraint. Indeed, it would be very unnatural that an object a that, on each criterion, scores at least as good as an object b, would get a global evaluation (label) that is lower than that of b. Mathematically speaking this means the following: the product ordering turns Formula into a partially ordered set Formula , while Formula is a chain, and we are looking for an order-preserving mapping Formula . More explicitly, it must hold for all elements x and y of Formula that:

(1)Formula (1)

Sometimes however, a classification algorithm does not return a single label, but a probability mass function (PMF) over the set of labels Formula . We write this as Formula . For the ranking problem, the set of all probability mass functions over Formula , which we already denoted as Formula , has to be equipped with a partial order supporting the idea that one PMF assigns more weight to the higher labels than another PMF. The weak (first order) stochastic dominance relation is a suitable ordering for Formula [9]. If fX and fY are two elements of Formula , with associated cumulative probability distribution functions (CDF) FX and FY, then

(2)Formula (2)

and we say that fX is (weakly) dominated by fY. The set of all CDFs over Formula is denoted by Formula , and the order Formula naturally carries over from Formula to Formula as there is a bijection between the two sets. This concept of stochastic dominance is of primordial importance in the framework described in the next section.

In a stochastic context, the supervised ranking problem can be stated as follows: find a learning algorithm, or a mapping Formula , that satisfies the stochastic monotonicity constraint:

(3)Formula (3)

Of course, we must be able to select a single label from a PMF computed by Formula in a way that doesn’t break the required monotonicity. In [10], it is shown that the Bayesian decision (extracting the label with maximal probability) might break the monotonicity, while taking the mean (and rounding) is valid on numerical scales [12]. Taking the “median” is an operation that is valid on ordinal scales, while preserving the monotonicity property [10]. In general, one has a set of medians for a PMF and choosing a label consistently (e.g. left end point, right end point or midpoint) from this set, which is actually an interval, preserves the monotonicity property. In the rest of this article, we will use the midpoint (or the biggest of the two midpoints) of the set of medians to select labels from PMFs.

At this point, we would also like to stress that the collection of learning examples Formula is a multiset, meaning that in general a learning example Formula can occur more than once. In the setting of traditional supervised classification algorithms, it is possible that there occurs doubt, i.e. there may be examples which have an identical feature vector, but with a different label assigned to them. In the context of supervised ranking algorithms a second kind of mistake, related to the monotonicity constraint, may be present: there may be feature vectors x and y for which Formula , while it holds for their assigned labels d(x) and d(y) that Formula . One says that there is reversed preference between x and y. The algorithms presented here can handle data containing both doubt and reversed preferences, which is a prerequisite for supervised ranking algorithms to be useful in a real life setting, as most real life data sets are pervaded with noise, resulting in doubt and reversed preferences.

The outline of this paper is as follows: in Section 2, we briefly recall the theory developed in [10] which serves as the basis for instance-based supervised ranking algorithms. We concentrate on those parts of the theory that have immediate practical consequences and state four propositions each of which gives rise to a different OSDL variant. Section 3 gives pseudo code of the implemented algorithms and we state the time and space complexity of building the classifier and of classifying an arbitrary element. Three of these algorithms have one degree of freedom and we describe how one can ‘tune’ this parameter. The following section then gives some examples of how to use the software from within the weka environment [4]. We have opted for weka as it is an increasingly popular collection of machine learning algorithms for data mining tasks, issued as open source software under the GNU General Public License. It contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well suited for developing new machine learning schemes.

2. Methods for solving the supervised ranking problem

In [10], a probabilistic framework for dealing with the supervised ranking problem is developed. This general framework begins with describing two particular CDFs, the so-called minimal and maximal extension Fmin and Fmax, that obey the stochastic monotonicity constraint (3); the framework then gives the necessary and sufficient conditions that any ‘interpolation scheme’ between Fmin and Fmax has to satisfy in order to give rise to a supervised ranking algorithm satisfying the stochastic monotonicity constraint (3). Here, we give the definition of the minimal and maximal extensions Fmin and Fmax, and then state some propositions, which are in fact corollaries of the main theorem of [10], without proof.

Let Formula be a collection of learning examples, and let Formula be the image of Formula under the mapping which associates each object with its feature vector, i.e. Formula . For each Formula , let Formula denote a CDF, that is estimated from the collection of learning examples. For the main theorem in [10], the way in which this estimation is done is immaterial, but for the propositions, we will assume that Formula is in fact the maximum likelihood estimation Formula :

(4)Formula (4)

The set Formula is called the stochastic training data set. It is natural to say that the stochastic training data set Formula is monotone – compare with the monotonicity constraint (3) – if for all elements x and y of Formula it holds that:

Formula

Given a stochastic training data set Formula , we define the minimal and maximal extension as

(5)Formula (5)

and

(6)Formula (6)

When Formula , the PMF fmin is defined as that PMF that assigns probability 1 to Formula , this is the PMF that is dominated by all other PMFs over Formula . Analogously, when Formula , the PMF fmax assigns probability 1 to Formula ; fmax is then the PMF dominating all other PMFs.

One can easily prove some important properties of these minimal and maximal extensions, the first one being that for each element x of Formula is indeed a CDF over Formula , and for all x and y of Formula one has that Formula implies Formula (and likewise for Fmax). A second important property is that when the stochastic training data set Formula is monotone, then it holds for any monotone extension Formula of Formula from Formula to Formula that Formula , for each element x of Formula . (Note that this also explains the names minimal and maximal extension.) Thirdly, when Formula is not monotone, and when x and y, with Formula , do not satisfy the stochastic monotonicity constraint, i.e. Formula is a so-called couple of reversed preference, then for each z between x and y there exists at least one label ℓ of Formula such that Formula . We often refer to the situation Formula as the monotone situation, while we term Formula as the reversed preference situation.

From the first property given in the previous paragraph, it is clear that both the minimal and maximal extension yield a solution to the supervised ranking problem. Next, we will give some ‘interpolation’ schemes for which one can show that they all are valid solutions to the supervised ranking problem. We start with a very simple interpolation scheme, which is especially useful when the stochastic training data set is monotone.

Proposition 2.1 OSDL Let Formula be a collection of learning examples with associated stochastic training data set Formula . For an arbitrary but fixed real number S from the interval [0, 1], define the mapping Formula as follows:

(7)Formula (7)

For all elements x of Formula , the mapping Formula is an element of Formula . Moreover, the partial mapping

Formula

is order preserving.

The only degree of freedom left in this proposition is the choice of the interpolation parameter S. We call the algorithm resulting from this proposition the Ordinal Stochastic Dominance Learner, or OSDL for short.

Let x denote an element of Formula for which one would like to make a prediction. When Formula is very small (≈0) this means that Formula predicts a label strictly greater than ℓ for x. On the other hand, when Formula approaches 1, the maximal extension predicts a label at most ℓ for x. When both situations occur simultaneously, we are likely to end up in the reversed preference situation. When this is the case, the idea is to weigh Fmin and Fmax by the number of learning examples that support their respective cases. We define the following mappings:

Formula

counting the number of instances that indicate that x should receive a label strictly greater than ℓ, and

Formula

which counts the number of instances indicating that x should receive a label at most ℓ. We have the following proposition.

Proposition 2.2 B-OSDL Proposition 2.1 is still valid if one replaces Formula by:

(8)Formula (8)

In (8), we distinguish between the monotone situation, for which the interpolation parameter S is used, and the reversed preference situation for which the counting argument is triggered. The algorithm based on this second proposition is called Balanced OSDL, or B-OSDL for short. Note that this algorithm reduces to the OSDL when the stochastic training data set Formula is monotone, since in this case the reversed preference situation will never occur.

It is possible to devise a counting argument for the monotone situation as well. We define the weight of Fmin in the monotone situation as

(9)Formula (9)

while the weight for Fmin in the reversed preference situation, denoted Formula , is defined as in the reversed preference situation of (8).

We now have the following proposition, with a resulting algorithm called B2-OSDL, or doubly balanced OSDL.

Proposition 2.3 B2-OSDL Proposition 2.1 is still valid if one replaces Formula by:

(10)Formula (10)

where Formula and Formula are defined in (9) and (8) respectively.

Finally, let us mention that sometimes the monotonicity constraints (1), (3) are relaxed somewhat in the sense that we only require it for elements x and y that are separated by a learning example. Stated otherwise, one only requires quasi-monotonicity w.r.t. the set Formula [11,​13]. If one does not want to make a distinction between the monotone and reversed preference situation, the weights for Fmin and Fmax should only depend on x (and not on ℓ). A simple way of achieving this is by counting the number of learning examples that are smaller or bigger than x. We define:

Formula

However, to avoid division by zero, we add, if not already present, the elements Formula and Formula to the collection of learning examples Formula . These elements are very natural boundary conditions when trying to predict or reconstruct an order-preserving mapping from Formula to Formula (and this is exactly what the supervised ranking problem is). One now has the following proposition.

Proposition 2.4 Let Formula be a collection of learning examples in which, if not already present, the elements Formula and Formula are added. The associated stochastic training data set is Formula . The mapping Formula is defined as follows:

(11)Formula (11)

For all elements x of Formula the mapping Formula belongs to Formula . Moreover, the partial mapping

Formula

is quasi-monotone w.r.t. Formula .

3. Implementation of the propositions

3.1. Pseudo code of the two phases

Given a collection of learning examples Formula , the first phase in any classification algorithm involves building the classifier. In the case of the various OSDL variants, this is not too difficult as it simply involves the estimation of the CDFs Formula . As the propositions assume the maximum likelihood estimation Formula , this is what we will use in the implementation as well. The pseudo code of this first phase is given as the method buildClassifier in the top part of Algorithm 1. The fact that this first phase is quite simple explains why we have previously termed algorithms resulting from the aforementioned framework as instance-based.

In this code, a Map represents an implementation of an associative array as a hash table. An associative array maps or binds keys to values. In this case, the keys are elements of the data space Formula , while the values are e.g. frequency counts of the labels, or elements of Formula . We assume that the implementation of the map is such that retrieval and insertion of key/value pairs both have amortised constant time complexity (ignoring the time to compute the hash code).

In practice, we must attach a hash code to each element of Formula . An arbitrary element of Formula consists of a vector of criterion values, i.e. Formula , where each ci(x) is measured on an ordinal scale Formula . Let Formula be the unique increasing mapping that identifies Formula with the first Formula natural numbers. For simplicity, we denote si(ci(x)) as xi. We now interpret x as a number represented in a mixed radix system [8, Section 4.1]:

Formula

This yields an increasing bijection between Formula and the set of natural numbers smaller than Formula , and for each element of Formula the corresponding hash code is calculated in Formula (n) time, where n denotes the number of criteria.

We can now state the complexity of buildClassifier easily as Formula . Indeed, the loop starting on the second line traverses Formula elements, the get and replace methods each take Formula (n) time (for the computation of the hash code), while the update method is either constant (when freqCount already exists) or of order Formula (n) (otherwise). The loop starting on line 7 considers Formula elements, and the conversion of a frequency count to a CDF takes Formula (n), while putting the CDF in the map takes again Formula (n) time. As Formula the total time complexity of buildClassifier is Formula . Any classifier which makes use of all learning examples presented, has at least this time complexity, as one has to consider each learning example at least once. So in this sense, the time to build the classifier is optimal. As for the space complexity, this is easily seen to be of order Formula (ignoring the space required for storing the initial learning examples).

Once buildClassifier has been executed, a user generally wants to compute the PMF associated with some element x; this is the second phase of the algorithm. For the OSDL variant described in Proposition 2.1, the pseudo code for computing this PMF is given in the second part of Algorithm 1. One simply iterates over the map containing the estimated CDFs; if the element considered (i.e. the key of the map) is incomparable with x, one simply ignores it, otherwise one updates the CDFs Fmin and/or Fmax depending on whether the element considered is smaller and/or bigger than x. An upper bound for the time complexity of classifying an instance is easily seen to be Formula . (Here, we assume that comparing two elements takes Formula time, and that the methods takeMin and takeMax each take Formula time.)

We don’t give explicit pseudo code for the other, slightly more involved, versions of the OSDL algorithm, as in fact this code would be equally straightforward as the one just given. All that is needed are some extra counting arguments in the appropriate places, e.g. to calculate Nmin and Nmax; this is easily achieved without increasing the order of growth of the running time complexity.

Algorithm 1. The Ordinal Stochastic Dominance Learner
buildClassifier()
Require: a collection of learning examples Formula
Ensure: a Map containing Formula pairs for the elements of Formula .
 1: map ← ∅; map_cdf ← ∅
 2: forFormula do
 3:  Formula
 4:  freqCount ← freqCount.update(d(x))
 5:  Formula
 6: end for
 7: forFormula do
 8:  Formula
 9:  Formula
10: end for
11: return map_cdf
distributionForInstance()
Require: an element x of Formula and S ∈ [0,1]
Ensure: a PMF over Formula for x
 1: Fmin ←  (1, 1, … , 1) // minimal CDF
 2: Fmax ←  (0, 0, … , 1) // maximal CDF
 3: forFormula do
 4:  ifFormula then
 5:  Formula
 6:  end if
 7:  ifFormula then
 8:  Formula
 9:  end if
10: end for
11: F ← SFmin +  (1 − S)Fmax
12: return F.toProbabilityMassFunction()

3.2. Tuning the interpolation parameter S

The attentive reader will have noticed that nowhere in the preceding text have we given a hint as to how to choose the interpolation parameter S that turns up in the various algorithms. In order to lift the burden of making a (more or less arbitrary) choice for the parameter S from the shoulders of the user of the program, we have made an attempt to let the program devise a reasonable value for S. Of course, we did not want this feature to increase the running time of the program dramatically. To this end, we use a simple scheme that resembles the way in which weka determines the number of neighbours to use in its implementation of the nearest neighbour algorithm, the IBk-class. This scheme is based on leave-one-out cross validation [17, Section 5.4]. This is, for each possible value of S under consideration, we determine the performance of the algorithm for a specified loss function, typically the 0–1 loss function, using leave-one-out cross validation.

This is not as bad as it sounds at first sight since this can be done rather efficiently, as is shown in Algorithm 2. Basically, one considers the learning examples one by one and one classifies the learning example under consideration on the basis of the remaining learning examples. One does this for values of S that are equally spaced in the interval [SlSu]. The time complexity of this procedure is determined as follows: building the classifier takes Formula time. The operation that has to be done for each element takes Formula time. Basically, this involves two calls to distributionForInstance and a loop calculating the loss between d(x) and the median prediction. Computing the median takes Formula time. So, if we ignore the time to build the classifier (this has to be done anyhow), the time complexity of tuning the interpolation parameter is Formula .

Algorithm 2. Tuning the interpolation parameter S
Require: (1) a collection of learning examples Formula ;
     (2) a lower bound Sl and an upper bound Su for S;
     (3) a natural number N determining the subdivision of [SlSr];
     (4) a loss function L
Ensure: the tuned value for S
 1: map ← buildClassifierFormula
 2: step ← (Su − Sl)/N
 3: loss ← (0, 0, … , 0) // length N + 1
 4: for allFormula do
 5:  Formula
 6:  Formula
 7:  Formula
 8:  S ← Sl
 9:  for i = 0 to N do
10:  loss[i] ← loss[i] + L(d(x),median(Sfmin + (1 − S)fmax))
11:  S ← S + step
12:  end for
13:  Formula
14: end for
15: i ← findIndexOfMin(loss)
16: return Sl + i × step

4. Use of the software

The described algorithms are available as a Java-implementation that is distributed under the GNU General Public License [1]. To facilitate the usage and dissemination of the algorithms, we have chosen to develop an implementation that fits seamlessly into the weka environment. The weka-workbench [17], also distributed under the terms of the GNU GPL, is a very comprehensive collection of machine learning algorithms and data preprocessing tools, written in the Java programming language. A major advantage of using Java as language is that it is platform independent (one only needs a JVM for the goal platform), and the fact that it is open source allows one to learn how actual classifiers are implemented (and the provided sources may serve as a basis for implementing your own classifiers).

For a user only interested in employing machine learning algorithms, weka offers various graphical user interfaces, such as the “Explorer” and the “Knowledge Flow” interface, both designed for working conveniently with filters and classifiers, and the ”Experimenter” interface, designed for conducting experiments regarding the performance of different classifiers on a range of data sets in a quite automated manner. Since our implementation of the OSDL variants adheres to all the conventions imposed on weka classifiers, it is very easy to make them available as choices in the various menus of these interfaces, as is shown in Fig. 1. One can also use the various algorithms (including the OSDL algorithm) from the command line.

Fig. 1

Fig. 1: Snapshot of weka showing the available algorithms for supervised ranking.

As can be seen in Fig. 1, there also reside two other classes in theweka.classifiers.monotone package, namely MinMaxExtension and OLM. These classes implement other solutions to the supervised ranking problem. As the name indicates, MinMaxExtension implements the minimal and maximal extension, two very simple solutions to the supervised ranking problem, which are in fact non-stochastic analogues of the minimal and maximal extensions Fmin and Fmax introduced earlier. The class OLM is an implementation of (some variants of) the Ordinal Learning Method [6,​5], the first algorithm devised especially for supervised ranking.

After obtaining and installing weka and the provided implementation of the OSDL algorithm [2], a typical first usage using the command line interface would be as follows:

java weka.classifiers.monotone.OSDL

This will display an error message because no training file was given, but, more interestingly, it will also show the various options available when using the OSDL class. We ignore the options common to all weka-classifiers, but we do give an overview of the options specific to the OSDL class together with their meaning. These can be found in Table 1.

Table 1: Options specific for the OSDL class.

-C 〈classification type〉
Sets the classification type to be used (default: MED)
Valid classification types are: REG, WSUM, MAX, MED, RMED
-B Use the balanced version of the Ordinal Stochastic Dominance Learner
-W Use the weighted version of the Ordinal Stochastic Dominance Learner
-S 〈value of interpolation parameter〉
Sets the value of the interpolation parameter (default: 0.5)
-L 〈Lower bound for interpolation parameter〉
Lower bound for the interpolation parameter (default: 0)
-U 〈Upper bound for interpolation parameter〉
Upper bound for the interpolation parameter (default: 1)
-P 〈Number of parts〉
Determines the step size for tuning the interpolation parameter, nl. (U-L)/P
-D Outputs debugging info (default: false)

The -C option sets the way in which a single label is deduced from the PMF generated by the algorithm. The value MED is the most important one, as this uses the median, which is valid on ordinal scales and on top of that ensures monotonicity. More in particular, MED chooses the midpoint (or the biggest of the two midpoints) of the set of medians. The value RMED does the same, but does not “round” the result to the closest label, so “halfway” labels are possible, i.e. when the set of medians consists of an even number of labels, an intermediate label is produced. When one chooses MAX, the Bayesian decision is made, which is valid on ordinal scales, but which may break the monotonicity property. Hence, choosing MAX does not guarantee a valid solution to the supervised ranking problem! Finally, REG and WSUM bot use the mean to deduce a label from the PMF, the first choice does no rounding, while the second choice does. The scores used for the calculation of the mean are equidistant, as they are the internal weka values for the labels. Although ensuring monotonicity, these last two choices are not semantically valid on intrinsically ordinal scales.

When the switch -B is set, one uses B-OSDL; when the switch -W is set, the quasi-monotone version of OSDL, described in Proposition 2.4 is triggered. When both are set simultaneously, the doubly balanced OSDL algorithm, B2-OSDL, is used.

The meaning of the -S option should be clear; it simply sets the interpolation parameter S that is used in the OSDL variants (if the quasi-monotone variant is used, this option is simply ignored). The following three options -L, -U and -P, set the values of Sl, Su and N in Algorithm 2, respectively. The switch -D finally does as it says in the description; when it is set some extra debugging information is shown on the console.

A basic use is thus the following:

java weka.classifiers.monotone.OSDL -t car.arff -C MED -S 0.5

This applies the OSDL algorithm to the data set car.arff, which is the “Car Evaluation Database” from the UCI Machine Learning Repository [3] put into arff format, with interpolation parameter 1/2 and with the median as selection criterion. In Fig. 2, the output of this command is shown. There, one sees that the resubstitution error on the original data is 1.2153%. This means that this data set is quite monotone, but that there are indeed couples of reversed preference present in the data. (By construction, the data set does not contain doubt and a property of OSDL is that on a data set containing no doubt nor reversed preference, the resubstitution error is zero.) An estimate for the error on an independent test set was made using 10-fold stratified cross validation and in this case it is 3.8194%. Also, some other statistics are shown, together with two confusion matrices.

Fig. 2

Fig. 2: Output from the command java weka.classifiers.monotone.OSDL -t car.arff -C MED -S 0.5.

Finally, as a last example, in Fig. 3 a snapshot of the “Analyse” tab of the Experiment interface is shown. In fact, it is the result of running the input of Fig. 1. It shows the average percentage of correctly classified instances over 10 runs of 10-fold cross validation. We used six instantiations of the algorithms given using the four propositions, and for the first two propositions, we also use the option that allows tuning of the interpolation parameter. One sees that the six instantiations give slightly different results, some of them being statistically different for a significance level of 5%.

Fig. 3

Fig. 3: Screen shot showing the results of running six variants of the OSDL algorithm on the dataset car.arff.

5. Conclusion

In this article, we have stated the supervised ranking problem, focusing on the differences w.r.t. the traditional classification problem. We have given, without proof, four propositions that all provide solutions to the supervised ranking problem. Three of the propositions distinguish between the monotone situation and the reversed preference situation. An alternative approach is to relabel the data set so that it becomes monotone; in that case only the basic Ordinal Stochastic Dominance Learner is of use. For more information, we refer to [14,​15,​16].

Our main objective, however, was to introduce a software implementation of these solutions that can be used inside the weka environment. We have shown how to get started using this software, and we have given snapshots of weka outputs that show that indeed the implementation fits into the framework. The software was a key to a successful application in group decision making [18,​19,​20].

In the near future, we plan to undertake an experimental study that compares the newly developed algorithms for supervised ranking with existing ones and with traditional classification algorithms that do not guarantee the monotonicity of their solutions. The weka environment will be used to carry out this experimental study.

References