Volume 129, Issue 6, June 2009, Pages 1532-1561

Supervised ranking in the weka environment

Stijn Lievens ^a,
Bernard De Baets ^b^⁎

^⁎ Corresponding author. Tel.: +32 9 264 59 41; fax: +32 9 264 62 20.
^a Department of Applied Mathematics and Computer Science, Ghent University, Krijgslaan 281 S9, B-9000 Gent, Belgium
^b Department of Applied Mathematics, Biometrics and Process Control, Ghent University, Coupure links 653, B-9000 Gent, Belgium

Research highlights

The supervised ranking problem has its own – very important – distinct characteristics. In this article, we focus on differences with regard to the traditional classification problem.
We have given four propositions that all provide solutions to the supervised ranking problem (for more information, we refer to [14, 15, 16]).
A software implementation of proposed solutions can be used inside the WEKA environment.
We have shown how to get started using this software, and we have given snapshots of WEKA outputs that show that indeed the implementation fits into the framework.

Abstract

Software for solving the supervised ranking problem is presented. Four variants of the Ordinal Stochastic Dominance Learner (OSDL) are given, together with the space and time complexity of their implementations. It is shown that the described software, which includes two further algorithms for supervised ranking, fits seamlessly into the weka environment.

1. The supervised ranking problem

1.1. Problem description

The supervised ranking problem, which is sometimes called the supervised ordered sorting problem, undeniably bears some resemblances to the traditional supervised classification problem, but has its own – very important – distinct characteristics. Given a collection of learning examples or instances, the goal is also, just as in traditional supervised classification, to attach a label to (unseen and/or unlabelled) instances in a way that extends the learning examples in the ‘most reasonable way’.

In traditional classification algorithms, instances are mostly described by feature vectors or attribute vectors belonging to a data space $X$ , i.e. an object x from the object space Ω is described by an attribute vector $x = (q_{1} (x), q_{2} (x), \dots, q_{n} (x)) \in \prod_{i = 1}^{n} X_{i} \equiv X$ . The measurement scale of these features is quite arbitrary, and may range from nominal to numerical. The first important difference is that in the ranking problem, instances are described by true criteria [7], i.e. each ‘feature’ is measured on (at least) an ordinal scale, and has an associated meaning (semantics) of preference. This is important as it now makes sense to compare two objects a and b from the object space Ω on the basis of a single criterion c_i. One says that object a is strictly preferred to object b w.r.t. c_i if $c_{i} (b) <_{c_{i}} c_{i} (a)$ . A second difference concerns the measurement scale of the labels: in traditional classification problems, the set of labels $L$ constitutes a nominal scale, while in the ranking problem, the set $L$ is a finite chain, again with associated preference semantics. It thus also makes sense to compare objects on the basis of their labels.

These two differences w.r.t. the traditional classification problem give in a very natural way rise to a basic monotonicity constraint. Indeed, it would be very unnatural that an object a that, on each criterion, scores at least as good as an object b, would get a global evaluation (label) that is lower than that of b. Mathematically speaking this means the following: the product ordering turns $X$ into a partially ordered set $(X, ⩽_{X})$ , while $(L, ⩽_{L})$ is a chain, and we are looking for an order-preserving mapping $λ : (X, ⩽_{X}) \to (L, ⩽_{L})$ . More explicitly, it must hold for all elements x and y of $X$ that:(1) $x <_{X} y \Rightarrow λ (x) ⩽_{L} λ (y) .$ Sometimes however, a classification algorithm does not return a single label, but a probability mass function (PMF) over the set of labels $L$ . We write this as $\hat{λ} : X \to F (L)$ . For the ranking problem, the set of all probability mass functions over $L$ , which we already denoted as $F (L)$ , has to be equipped with a partial order supporting the idea that one PMF assigns more weight to the higher labels than another PMF. The weak (first order) stochastic dominance relation is a suitable ordering for $F (L)$ [9]. If f_X and f_Y are two elements of $F (L)$ , with associated cumulative probability distribution functions (CDF) F_X and F_Y, then (2) $f_{X} ⪯_{FSD} f_{Y} \Leftrightarrow (\forall ℓ \in L) (F_{X} (ℓ) ⩾ F_{Y} (ℓ)),$ and we say that f_X is (weakly) dominated by f_Y. The set of all CDFs over $L$ is denoted by $F_{cum} (L)$ , and the order $⪯_{FSD}$ naturally carries over from $F (L)$ to $F_{cum} (L)$ as there is a bijection between the two sets. This concept of stochastic dominance is of primordial importance in the framework described in the next section.

In a stochastic context, the supervised ranking problem can be stated as follows: find a learning algorithm, or a mapping $\tilde{λ} : (X, ⩽_{X}) \to (F (L), ⪯_{FSD})$ , that satisfies the stochastic monotonicity constraint:(3) $x <_{X} y \Rightarrow \tilde{λ} (x) ⪯_{FSD} \tilde{λ} (y) .$ Of course, we must be able to select a single label from a PMF computed by $\tilde{λ}$ in a way that doesn’t break the required monotonicity. In [10], it is shown that the Bayesian decision (extracting the label with maximal probability) might break the monotonicity, while taking the mean (and rounding) is valid on numerical scales [12]. Taking the “median” is an operation that is valid on ordinal scales, while preserving the monotonicity property [10]. In general, one has a set of medians for a PMF and choosing a label consistently (e.g. left end point, right end point or midpoint) from this set, which is actually an interval, preserves the monotonicity property. In the rest of this article, we will use the midpoint (or the biggest of the two midpoints) of the set of medians to select labels from PMFs.

At this point, we would also like to stress that the collection of learning examples $(S, d) = {〈 a, d (a) 〉 | a \in S}$ is a multiset, meaning that in general a learning example $〈 x, d (x) 〉$ can occur more than once. In the setting of traditional supervised classification algorithms, it is possible that there occurs doubt, i.e. there may be examples which have an identical feature vector, but with a different label assigned to them. In the context of supervised ranking algorithms a second kind of mistake, related to the monotonicity constraint, may be present: there may be feature vectors x and y for which $x <_{X} y$ , while it holds for their assigned labels d(x) and d(y) that $d (x) ≰_{L} d (y)$ . One says that there is reversed preference between x and y. The algorithms presented here can handle data containing both doubt and reversed preferences, which is a prerequisite for supervised ranking algorithms to be useful in a real life setting, as most real life data sets are pervaded with noise, resulting in doubt and reversed preferences.

The outline of this paper is as follows: in Section 2, we briefly recall the theory developed in [10] which serves as the basis for instance-based supervised ranking algorithms. We concentrate on those parts of the theory that have immediate practical consequences and state four propositions each of which gives rise to a different OSDL variant. Section 3 gives pseudo code of the implemented algorithms and we state the time and space complexity of building the classifier and of classifying an arbitrary element. Three of these algorithms have one degree of freedom and we describe how one can ‘tune’ this parameter. The following section then gives some examples of how to use the software from within the weka environment [4]. We have opted for weka as it is an increasingly popular collection of machine learning algorithms for data mining tasks, issued as open source software under the GNU General Public License. It contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well suited for developing new machine learning schemes.

2. Methods for solving the supervised ranking problem

In [10], a probabilistic framework for dealing with the supervised ranking problem is developed. This general framework begins with describing two particular CDFs, the so-called minimal and maximal extension F_min and F_max, that obey the stochastic monotonicity constraint (3); the framework then gives the necessary and sufficient conditions that any ‘interpolation scheme’ between F_min and F_max has to satisfy in order to give rise to a supervised ranking algorithm satisfying the stochastic monotonicity constraint (3). Here, we give the definition of the minimal and maximal extensions F_min and F_max, and then state some propositions, which are in fact corollaries of the main theorem of [10], without proof.

Let $(S, d)$ be a collection of learning examples, and let $S_{X}$ be the image of $S$ under the mapping which associates each object with its feature vector, i.e. $S_{X} = {a | a \in S}$ . For each $x \in S_{X}$ , let $\hat{F} (x, \cdot) : L \to [0, 1]$ denote a CDF, that is estimated from the collection of learning examples. For the main theorem in [10], the way in which this estimation is done is immaterial, but for the propositions, we will assume that $\hat{F}$ is in fact the maximum likelihood estimation $\hat{F} *$ :(4) $\hat{F} * (x, ℓ) = \frac{|{〈 y, d (y) 〉 \in (S, d) | y = x \land d (y) ⩽_{L} ℓ}|}{|{〈 y, d (y) 〉 \in (S, d) | y = x}|} .$ The set $(S_{X}, \hat{F})$ is called the stochastic training data set. It is natural to say that the stochastic training data set $(S_{X}, \hat{F})$ is monotone – compare with the monotonicity constraint (3) – if for all elements x and y of $S_{X}$ it holds that: $x <_{X} y \Rightarrow \hat{F} (x, \cdot) ⪯_{FSD} \hat{F} (y, \cdot) .$ Given a stochastic training data set $(S_{X}, \hat{F})$ , we define the minimal and maximal extension as (5) $\begin{matrix} F_{\min} : X \times L & \to [0, 1] \\ (x, ℓ) & \mapsto \min \{\hat{F} (y, ℓ) | y \in S_{X} \land y ⩽_{X} x\} \end{matrix}$ and (6) $\begin{matrix} F_{\max} : X \times L & \to [0, 1] \\ (x, ℓ) & \mapsto \max \{\hat{F} (y, ℓ) | y \in S_{X} \land y ⩾_{X} x\} . \end{matrix}$ When ${y \in S_{X} \land y ⩽_{X} x} = \emptyset$ , the PMF f_min is defined as that PMF that assigns probability 1 to $\min L$ , this is the PMF that is dominated by all other PMFs over $L$ . Analogously, when ${y \in S_{X} \land y ⩾_{X} x} = \emptyset$ , the PMF f_max assigns probability 1 to $\max L$ ; f_max is then the PMF dominating all other PMFs.

One can easily prove some important properties of these minimal and maximal extensions, the first one being that for each element x of $X, F_{\min} (x, \cdot) : L \to [0, 1]$ is indeed a CDF over $L$ , and for all x and y of $X$ one has that $x <_{X} y$ implies $F_{\min} (x, \cdot) ⪯_{FSD} F_{\min} (y, \cdot)$ (and likewise for F_max). A second important property is that when the stochastic training data set $(S_{X}, \hat{F})$ is monotone, then it holds for any monotone extension $\tilde{F}$ of $\hat{F}$ from $S_{X}$ to $X$ that $F_{\min} (x, \cdot) ⪯_{FSD} \tilde{F} (x, \cdot) ⪯_{FSD} F_{\max} (x, \cdot)$ , for each element x of $X$ . (Note that this also explains the names minimal and maximal extension.) Thirdly, when $(S_{X}, \hat{F})$ is not monotone, and when x and y, with $x <_{X} y$ , do not satisfy the stochastic monotonicity constraint, i.e. $(x, y)$ is a so-called couple of reversed preference, then for each z between x and y there exists at least one label ℓ of $L$ such that $F_{\min} (z, ℓ) < F_{\max} (z, ℓ)$ . We often refer to the situation $F_{\min} (z, ℓ) ⩾ F_{\max} (z, ℓ)$ as the monotone situation, while we term $F_{\min} (z, ℓ) < F_{\max} (z, ℓ)$ as the reversed preference situation.

From the first property given in the previous paragraph, it is clear that both the minimal and maximal extension yield a solution to the supervised ranking problem. Next, we will give some ‘interpolation’ schemes for which one can show that they all are valid solutions to the supervised ranking problem. We start with a very simple interpolation scheme, which is especially useful when the stochastic training data set is monotone.

Proposition 2.1 OSDL Let $(S, d)$ be a collection of learning examples with associated stochastic training data set $(S_{X}, \hat{F} *)$ . For an arbitrary but fixed real number S from the interval [0, 1], define the mapping ${\tilde{F}}_{OSDL}$ as follows: (7) $\begin{matrix} {\tilde{F}}_{OSDL} : X \times L & \to [0, 1] \\ (x, ℓ) & \mapsto {SF}_{\min} (x, ℓ) + (1 - S) F_{\max} (x, ℓ) . \end{matrix}$ For all elements x of $X$ , the mapping ${\tilde{F}}_{OSDL} (x, \cdot) : L \to [0, 1]$ is an element of $F_{cum} (L)$ . Moreover, the partial mapping $\begin{matrix} {\tilde{F}}_{OSDL} (\cdot, \cdot) : (X, ⩽_{X}) & \to (F_{cum} (L), ⪯_{FSD}) \\ x & \mapsto {\tilde{F}}_{OSDL} (x, \cdot) \end{matrix}$ is order preserving.

The only degree of freedom left in this proposition is the choice of the interpolation parameter S. We call the algorithm resulting from this proposition the Ordinal Stochastic Dominance Learner, or OSDL for short.

Let x denote an element of $X$ for which one would like to make a prediction. When $F_{\min} (x, ℓ)$ is very small (≈0) this means that $F_{\min} (x, \cdot)$ predicts a label strictly greater than ℓ for x. On the other hand, when $F_{\max} (x, ℓ)$ approaches 1, the maximal extension predicts a label at most ℓ for x. When both situations occur simultaneously, we are likely to end up in the reversed preference situation. When this is the case, the idea is to weigh F_min and F_max by the number of learning examples that support their respective cases. We define the following mappings: $\begin{matrix} N_{\min} : X \times L & \to N \\ (x, ℓ) & \mapsto |\{〈 y, d (y) 〉 \in (S, d) | y ⩽_{X} x \land d (y) >_{L} ℓ\}|, \end{matrix}$ counting the number of instances that indicate that x should receive a label strictly greater than ℓ, and $\begin{matrix} N_{\max} : X \times L & \to N \\ (x, ℓ) & \mapsto |\{〈 y, d (y) 〉 \in (S, d) | x ⩽_{X} y \land d (y) ⩽_{L} ℓ\}|, \end{matrix}$ which counts the number of instances indicating that x should receive a label at most ℓ. We have the following proposition.

Proposition 2.2 B-OSDL Proposition 2.1 is still valid if one replaces ${\tilde{F}}_{OSDL}$ by: (8) $\begin{matrix} {\tilde{F}}_{B - OSDL} : X \times L & \to [0, 1] \\ (x, ℓ) & \mapsto \{\begin{matrix} {SF}_{\min} (x, ℓ) + (1 - S) F_{\max} (x, ℓ) if F_{\min} (x, ℓ) ⩾ F_{\max} (x, ℓ), \\ \frac{N_{\min} (x, ℓ) F_{\min} (x, ℓ) + N_{\max} (x, ℓ) F_{\max} (x, ℓ)}{N_{\min} (x, ℓ) + N_{\max} (x, ℓ)} if F_{\min} (x, ℓ) < F_{\max} (x, ℓ) . \end{matrix} \end{matrix}$ In (8), we distinguish between the monotone situation, for which the interpolation parameter S is used, and the reversed preference situation for which the counting argument is triggered. The algorithm based on this second proposition is called Balanced OSDL, or B-OSDL for short. Note that this algorithm reduces to the OSDL when the stochastic training data set $(S_{X}, \hat{F} *)$ is monotone, since in this case the reversed preference situation will never occur.

It is possible to devise a counting argument for the monotone situation as well. We define the weight of F_min in the monotone situation as(9) $s (x, ℓ) = \{\begin{matrix} \frac{N_{\max} (x, ℓ)}{N_{\min} (x, ℓ) + N_{\max} (x, ℓ)} & if N_{\min} (x, ℓ) + N_{\max} (x, ℓ) \neq 0, \\ S & otherwise, \end{matrix}$ while the weight for F_min in the reversed preference situation, denoted $t (x, ℓ)$ , is defined as in the reversed preference situation of (8).

We now have the following proposition, with a resulting algorithm called B2-OSDL, or doubly balanced OSDL.

Proposition 2.3 B2-OSDL Proposition 2.1 is still valid if one replaces ${\tilde{F}}_{OSDL}$ by: (10) $\begin{matrix} {\tilde{F}}_{B 2 - OSDL} : X \times L & \to [0, 1] \\ (x, ℓ) & \mapsto \{\begin{matrix} s (x, ℓ) F_{\min} (x, ℓ) + (1 - s (x, ℓ)) F_{\max} (x, ℓ) if F_{\min} (x, ℓ) ⩾ F_{\max} (x, ℓ), \\ t (x, ℓ) F_{\min} (x, ℓ) + (1 - t (x, ℓ)) F_{\max} (x, ℓ) if F_{\min} (x, ℓ) < F_{\max} (x, ℓ), \end{matrix} \end{matrix}$ where $s (x, ℓ)$ and $t (x, ℓ)$ are defined in (9) and (8) respectively.

Finally, let us mention that sometimes the monotonicity constraints (1), (3) are relaxed somewhat in the sense that we only require it for elements x and y that are separated by a learning example. Stated otherwise, one only requires quasi-monotonicity w.r.t. the set $S_{X}$ [11,13]. If one does not want to make a distinction between the monotone and reversed preference situation, the weights for F_min and F_max should only depend on x (and not on ℓ). A simple way of achieving this is by counting the number of learning examples that are smaller or bigger than x. We define: $A_{\min} (x) = |{a \in S | a ⩽_{X} x}| and A_{\max} (x) = |{a \in S | x ⩽_{X} a}| .$ However, to avoid division by zero, we add, if not already present, the elements $〈 \min X, \min L 〉$ and $〈 \max X, \max L 〉$ to the collection of learning examples $(S, d)$ . These elements are very natural boundary conditions when trying to predict or reconstruct an order-preserving mapping from $X$ to $L$ (and this is exactly what the supervised ranking problem is). One now has the following proposition.

Proposition 2.4 Let $(S, d)$ be a collection of learning examples in which, if not already present, the elements $〈 \min X, \min L 〉$ and $〈 \max X, \max L 〉$ are added. The associated stochastic training data set is $(S_{X}, \hat{F} *)$ . The mapping ${\tilde{F}}_{QM - OSDL}$ is defined as follows: (11) $\begin{matrix} {\tilde{F}}_{QM - OSDL} : X \times L & \to [0, 1] \\ (x, ℓ) & \mapsto \frac{A_{\min} (x) F_{\min} (x, ℓ) + A_{\max} (x) F_{\max} (x, ℓ)}{A_{\min} (x) + A_{\max} (x)} . \end{matrix}$ For all elements x of $X$ the mapping ${\tilde{F}}_{QM - OSDL} (x, \cdot) : L \to [0, 1]$ belongs to $F_{cum} (L)$ . Moreover, the partial mapping $\begin{matrix} {\tilde{F}}_{QM - OSDL} (\cdot, \cdot) : (X, ⩽_{X}) & \to (F_{cum} (L), ⪯_{FSD}) \\ x & \mapsto {\tilde{F}}_{QM - OSDL} (x, \cdot) \end{matrix}$ is quasi-monotone w.r.t. $S_{X}$ .

3. Implementation of the propositions

3.1. Pseudo code of the two phases

Given a collection of learning examples $(S, d)$ , the first phase in any classification algorithm involves building the classifier. In the case of the various OSDL variants, this is not too difficult as it simply involves the estimation of the CDFs $\hat{F}$ . As the propositions assume the maximum likelihood estimation $\hat{F} *$ , this is what we will use in the implementation as well. The pseudo code of this first phase is given as the method buildClassifier in the top part of Algorithm 1. The fact that this first phase is quite simple explains why we have previously termed algorithms resulting from the aforementioned framework as instance-based.

In this code, a Map represents an implementation of an associative array as a hash table. An associative array maps or binds keys to values. In this case, the keys are elements of the data space $X$ , while the values are e.g. frequency counts of the labels, or elements of $F_{cum} (L)$ . We assume that the implementation of the map is such that retrieval and insertion of key/value pairs both have amortised constant time complexity (ignoring the time to compute the hash code).

In practice, we must attach a hash code to each element of $X$ . An arbitrary element of $X$ consists of a vector of criterion values, i.e. $x = (c_{1} (x), c_{2} (x), \dots, c_{n} (x))$ , where each c_i(x) is measured on an ordinal scale $X_{i}$ . Let $s_{i} : X_{i} \to \{0, 1, \dots, | X_{i} | - 1\}$ be the unique increasing mapping that identifies $X_{i}$ with the first $| X_{i} |$ natural numbers. For simplicity, we denote s_i(c_i(x)) as x_i. We now interpret x as a number represented in a mixed radix system [8, Section 4.1]: $[\begin{matrix} x_{n}, x_{n - 1}, \dots, x_{1} \\ | X_{n} |, | X_{n - 1} |, \dots, | X_{1} | \end{matrix}] \equiv x_{1} + x_{2} | X_{1} | + x_{3} | X_{1} | | X_{2} | + \dots + x_{n} | X_{1} | \dots | X_{n - 1} | = x_{1} + | X_{1} | (x_{2} + | X_{2} | (x_{3} + \dots + x_{n} | X_{n - 1} |)) .$ This yields an increasing bijection between $X$ and the set of natural numbers smaller than $| X |$ , and for each element of $X$ the corresponding hash code is calculated in $O$ (n) time, where n denotes the number of criteria.

We can now state the complexity of buildClassifier easily as $O (n | (S, d) |)$ . Indeed, the loop starting on the second line traverses $| (S, d) |$ elements, the get and replace methods each take $O$ (n) time (for the computation of the hash code), while the update method is either constant (when freqCount already exists) or of order $O$ (n) (otherwise). The loop starting on line 7 considers $| S_{X} |$ elements, and the conversion of a frequency count to a CDF takes $O$ (n), while putting the CDF in the map takes again $O$ (n) time. As $| S_{X} | ⩽ | (S, d) |$ the total time complexity of buildClassifier is $O (n | (S, d) |)$ . Any classifier which makes use of all learning examples presented, has at least this time complexity, as one has to consider each learning example at least once. So in this sense, the time to build the classifier is optimal. As for the space complexity, this is easily seen to be of order $O (n | S_{X} |)$ (ignoring the space required for storing the initial learning examples).

Once buildClassifier has been executed, a user generally wants to compute the PMF associated with some element x; this is the second phase of the algorithm. For the OSDL variant described in Proposition 2.1, the pseudo code for computing this PMF is given in the second part of Algorithm 1. One simply iterates over the map containing the estimated CDFs; if the element considered (i.e. the key of the map) is incomparable with x, one simply ignores it, otherwise one updates the CDFs F_min and/or F_max depending on whether the element considered is smaller and/or bigger than x. An upper bound for the time complexity of classifying an instance is easily seen to be $O (| S_{X} | | L | n)$ . (Here, we assume that comparing two elements takes $O (n)$ time, and that the methods takeMin and takeMax each take $O (| L |)$ time.)

We don’t give explicit pseudo code for the other, slightly more involved, versions of the OSDL algorithm, as in fact this code would be equally straightforward as the one just given. All that is needed are some extra counting arguments in the appropriate places, e.g. to calculate N_min and N_max; this is easily achieved without increasing the order of growth of the running time complexity.

Algorithm 1. The Ordinal Stochastic Dominance Learner
buildClassifier()
Require: a collection of learning examples $(S, d)$
Ensure: a Map containing $〈 x, \hat{F} * (x, \cdot) 〉$ pairs for the elements of $S_{X}$ .
1: map ← ∅; map_cdf ← ∅
2: for $〈 x, d (x) 〉 \in (S, d)$ do
3: $freqCount \leftarrow map . get (x)$
4: freqCount ← freqCount.update(d(x))
5: $map . replace (x, freqCount)$
6: end for
7: for $〈 x, freqCount (x) 〉 \in map$ do
8: $\hat{F} * \leftarrow freqCount (x) . toCumulativeDistribution ()$
9: $map_cdf . put (x, \hat{F} *)$
10: end for
11: return map_cdf
distributionForInstance()
Require: an element x of $X$ and S ∈ [0,1]
Ensure: a PMF over $L$ for x
1: F_min ← (1, 1, … , 1) // minimal CDF
2: F_max ← (0, 0, … , 1) // maximal CDF
3: for $〈 y, \hat{F} (y) 〉 \in map_cdf$ do
4: if $y ⩽_{X} x$ then
5: $F_{\min} \leftarrow takeMin (F_{\min}, \hat{F})$
6: end if
7: if $y ⩾_{X} x$ then
8: $F_{\max} \leftarrow takeMax (F_{\max}, \hat{F})$
9: end if
10: end for
11: F ← SF_min + (1 − S)F_max
12: return F.toProbabilityMassFunction()

3.2. Tuning the interpolation parameter S

The attentive reader will have noticed that nowhere in the preceding text have we given a hint as to how to choose the interpolation parameter S that turns up in the various algorithms. In order to lift the burden of making a (more or less arbitrary) choice for the parameter S from the shoulders of the user of the program, we have made an attempt to let the program devise a reasonable value for S. Of course, we did not want this feature to increase the running time of the program dramatically. To this end, we use a simple scheme that resembles the way in which weka determines the number of neighbours to use in its implementation of the nearest neighbour algorithm, the IBk-class. This scheme is based on leave-one-out cross validation [17, Section 5.4]. This is, for each possible value of S under consideration, we determine the performance of the algorithm for a specified loss function, typically the 0–1 loss function, using leave-one-out cross validation.

This is not as bad as it sounds at first sight since this can be done rather efficiently, as is shown in Algorithm 2. Basically, one considers the learning examples one by one and one classifies the learning example under consideration on the basis of the remaining learning examples. One does this for values of S that are equally spaced in the interval [S_l, S_u]. The time complexity of this procedure is determined as follows: building the classifier takes $O (n | S, d |)$ time. The operation that has to be done for each element takes $O (| S_{X} | | L | n + N | L |)$ time. Basically, this involves two calls to distributionForInstance and a loop calculating the loss between d(x) and the median prediction. Computing the median takes $O (| L |)$ time. So, if we ignore the time to build the classifier (this has to be done anyhow), the time complexity of tuning the interpolation parameter is $O (| (S, d) | | S_{X} | | L | n + | (S, d) | | L | N)$ .

Algorithm 2. Tuning the interpolation parameter S

Require: (1) a collection of learning examples

(S, d)

;

(2) a lower bound S_l and an upper bound S_u for S;

(3) a natural number N determining the subdivision of [S_l, S_r];

(4) a loss function L

Ensure: the tuned value for S

1: map ← buildClassifier

((S, d))

2: step ← (S_u − S_l)/N

3: loss ← (0, 0, … , 0) // length N + 1

4: for all

〈 x, d (x) 〉 \in (S, d)

map \leftarrow map . updateRemove (〈 x, d (x) 〉)

f_{\min} \leftarrow map . distributionForInstance (x, 1)

f_{\max} \leftarrow map . distributionForInstance (x, 0)

8: S ← S_l

9: for i = 0 to N do

10: loss[i] ← loss[i] + L(d(x),median(Sf_min + (1 − S)f_max))

11: S ← S + step

12: end for

13:

map \leftarrow map . updateAdd (〈 x, d (x) 〉)

14: end for

15: i ← findIndexOfMin(loss)

16: return S_l + i × step

4. Use of the software

The described algorithms are available as a Java-implementation that is distributed under the GNU General Public License [1]. To facilitate the usage and dissemination of the algorithms, we have chosen to develop an implementation that fits seamlessly into the weka environment. The weka-workbench [17], also distributed under the terms of the GNU GPL, is a very comprehensive collection of machine learning algorithms and data preprocessing tools, written in the Java programming language. A major advantage of using Java as language is that it is platform independent (one only needs a JVM for the goal platform), and the fact that it is open source allows one to learn how actual classifiers are implemented (and the provided sources may serve as a basis for implementing your own classifiers).

For a user only interested in employing machine learning algorithms, weka offers various graphical user interfaces, such as the “Explorer” and the “Knowledge Flow” interface, both designed for working conveniently with filters and classifiers, and the ”Experimenter” interface, designed for conducting experiments regarding the performance of different classifiers on a range of data sets in a quite automated manner. Since our implementation of the OSDL variants adheres to all the conventions imposed on weka classifiers, it is very easy to make them available as choices in the various menus of these interfaces, as is shown in Fig. 1. One can also use the various algorithms (including the OSDL algorithm) from the command line.

Fig. 1: Snapshot of weka showing the available algorithms for supervised ranking.

As can be seen in Fig. 1, there also reside two other classes in theweka.classifiers.monotone package, namely MinMaxExtension and OLM. These classes implement other solutions to the supervised ranking problem. As the name indicates, MinMaxExtension implements the minimal and maximal extension, two very simple solutions to the supervised ranking problem, which are in fact non-stochastic analogues of the minimal and maximal extensions F_min and F_max introduced earlier. The class OLM is an implementation of (some variants of) the Ordinal Learning Method [6,5], the first algorithm devised especially for supervised ranking.

After obtaining and installing weka and the provided implementation of the OSDL algorithm [2], a typical first usage using the command line interface would be as follows:

java weka.classifiers.monotone.OSDL

This will display an error message because no training file was given, but, more interestingly, it will also show the various options available when using the OSDL class. We ignore the options common to all weka-classifiers, but we do give an overview of the options specific to the OSDL class together with their meaning. These can be found in Table 1.

Table 1: Options specific for the OSDL class.

-C	〈classification type〉
	Sets the classification type to be used (default: MED)
	Valid classification types are: REG, WSUM, MAX, MED, RMED
-B	Use the balanced version of the Ordinal Stochastic Dominance Learner
-W	Use the weighted version of the Ordinal Stochastic Dominance Learner
-S	〈value of interpolation parameter〉
	Sets the value of the interpolation parameter (default: 0.5)
-L	〈Lower bound for interpolation parameter〉
	Lower bound for the interpolation parameter (default: 0)
-U	〈Upper bound for interpolation parameter〉
	Upper bound for the interpolation parameter (default: 1)
-P	〈Number of parts〉
	Determines the step size for tuning the interpolation parameter, nl. (U-L)/P
-D	Outputs debugging info (default: false)

The -C option sets the way in which a single label is deduced from the PMF generated by the algorithm. The value MED is the most important one, as this uses the median, which is valid on ordinal scales and on top of that ensures monotonicity. More in particular, MED chooses the midpoint (or the biggest of the two midpoints) of the set of medians. The value RMED does the same, but does not “round” the result to the closest label, so “halfway” labels are possible, i.e. when the set of medians consists of an even number of labels, an intermediate label is produced. When one chooses MAX, the Bayesian decision is made, which is valid on ordinal scales, but which may break the monotonicity property. Hence, choosing MAX does not guarantee a valid solution to the supervised ranking problem! Finally, REG and WSUM bot use the mean to deduce a label from the PMF, the first choice does no rounding, while the second choice does. The scores used for the calculation of the mean are equidistant, as they are the internal weka values for the labels. Although ensuring monotonicity, these last two choices are not semantically valid on intrinsically ordinal scales.

When the switch -B is set, one uses B-OSDL; when the switch -W is set, the quasi-monotone version of OSDL, described in Proposition 2.4 is triggered. When both are set simultaneously, the doubly balanced OSDL algorithm, B2-OSDL, is used.

The meaning of the -S option should be clear; it simply sets the interpolation parameter S that is used in the OSDL variants (if the quasi-monotone variant is used, this option is simply ignored). The following three options -L, -U and -P, set the values of S_l, S_u and N in Algorithm 2, respectively. The switch -D finally does as it says in the description; when it is set some extra debugging information is shown on the console.

A basic use is thus the following:

java weka.classifiers.monotone.OSDL -t car.arff -C MED -S 0.5

This applies the OSDL algorithm to the data set car.arff, which is the “Car Evaluation Database” from the UCI Machine Learning Repository [3] put into arff format, with interpolation parameter 1/2 and with the median as selection criterion. In Fig. 2, the output of this command is shown. There, one sees that the resubstitution error on the original data is 1.2153%. This means that this data set is quite monotone, but that there are indeed couples of reversed preference present in the data. (By construction, the data set does not contain doubt and a property of OSDL is that on a data set containing no doubt nor reversed preference, the resubstitution error is zero.) An estimate for the error on an independent test set was made using 10-fold stratified cross validation and in this case it is 3.8194%. Also, some other statistics are shown, together with two confusion matrices.

Fig. 2: Output from the command java weka.classifiers.monotone.OSDL -t car.arff -C MED -S 0.5.

Finally, as a last example, in Fig. 3 a snapshot of the “Analyse” tab of the Experiment interface is shown. In fact, it is the result of running the input of Fig. 1. It shows the average percentage of correctly classified instances over 10 runs of 10-fold cross validation. We used six instantiations of the algorithms given using the four propositions, and for the first two propositions, we also use the option that allows tuning of the interpolation parameter. One sees that the six instantiations give slightly different results, some of them being statistically different for a significance level of 5%.

Fig. 3: Screen shot showing the results of running six variants of the OSDL algorithm on the dataset car.arff.

5. Conclusion

In this article, we have stated the supervised ranking problem, focusing on the differences w.r.t. the traditional classification problem. We have given, without proof, four propositions that all provide solutions to the supervised ranking problem. Three of the propositions distinguish between the monotone situation and the reversed preference situation. An alternative approach is to relabel the data set so that it becomes monotone; in that case only the basic Ordinal Stochastic Dominance Learner is of use. For more information, we refer to [14,15,16].

Our main objective, however, was to introduce a software implementation of these solutions that can be used inside the weka environment. We have shown how to get started using this software, and we have given snapshots of weka outputs that show that indeed the implementation fits into the framework. The software was a key to a successful application in group decision making [18,19,20].

In the near future, we plan to undertake an experimental study that compares the newly developed algorithms for supervised ranking with existing ones and with traditional classification algorithms that do not guarantee the monotonicity of their solutions. The weka environment will be used to carry out this experimental study.

References

- The GNU General Public License. <http://www.gnu.org>.
- View reference in sidebar
- Software for supervised ranking. <http://www.kermit.ugent.be>.
- View reference in sidebar
- UCI repository of machine learning databases. <http://www.ics.uci.edu/mlearn/MLRepository.html>.
- View reference in sidebar
- Weka 3: Data Mining Software in Java. <http://www.cs.waikato.ac.nz/ml/weka>.
- View reference in sidebar
Ben-David, A.
- Automatic generation of symbolic multiattribute ordinal knowledge-based DSSs: methodology and applications
- Decision Sciences, 23, 1357-1372 (1992)
- Full Text via CrossRef | View Record in Scopus | Cited By in Scopus (?!?)
- View reference in sidebar
Ben-David, A., Sterling, L., Pao, Y.H.
- Learning and classification of monotonic ordinal concepts
- Computational Intelligence, 5, 45-49 (1989)
- Full Text via CrossRef | View Record in Scopus | Cited By in Scopus (?!?)
- View reference in sidebar
Bouyssou, D.
- Building criteria: a prerequisite for MCDA
- Readings in Multiple Criteria Decision Aid, 58-80 (1990)
- Full Text via CrossRef | View Record in Scopus | Cited By in Scopus (?!?)
- View reference in sidebar
Knuth, D.E.
- Seminumerical Algorithms
- The Art of Computer Programming, vol. 2 (1969)
- Full Text via CrossRef | View Record in Scopus | Cited By in Scopus (?!?)
- View reference in sidebar
Levy, H.
- Stochastic Dominance
- Kluwer Academic Publishers, MA (1998)
- Full Text via CrossRef | View Record in Scopus | Cited By in Scopus (?!?)
- View reference in sidebar
Lievens, S., De Baets, B., Cao-Van, K.
- A probabilistic framework for the design of instance-based supervised ranking algorithms in an ordinal setting
- Annals of Operations Research, 80, 105-136 (2008)
- Full Text via CrossRef | View Record in Scopus | Cited By in Scopus (?!?)
- View reference in sidebar
Makino, K., Suda, T., Ono, H., Ibaraki, T.
- Data analysis by positive decision trees
- IEICE Transactions on Information and Systems, E82-D, 76-88 (1999)
- Full Text via CrossRef | View Record in Scopus | Cited By in Scopus (?!?)
- View reference in sidebar
Ogryczak, W., Ruszczyński, A.
- Dual stochastic dominance and related mean-risk models
- SIAM Journal of Optimization, 13, 60-78 (2002)
- Full Text via CrossRef | View Record in Scopus | Cited By in Scopus (?!?)
- View reference in sidebar
- R. Potharst, J.C. Bioch, Quasi-monotone decision trees for ordinal classification, in: F. Verdenius, W. van den Broek (Eds.), Proceedings of the Eighth Belgian-Dutch Conference on Machine Learning, 1998, pp. 122–131.
- Full Text via CrossRef | View Record in Scopus | Cited By in Scopus (?!?)
- View reference in sidebar
- M. Rademaker, B. De Baets, Optimal restoration of stochastic monotonicity w.r.t. cumulative label frequency loss functions, Information Sciences, submitted for publication.
- Full Text via CrossRef | View Record in Scopus | Cited By in Scopus (?!?)
- View reference in sidebar
Rademaker, M., De Baets, B., De Meyer, H.
- Loss optimal monotone relabelling of noisy multi-criteria data sets
- Information Sciences, 179, 4089-4096 (2009)
- Full Text via CrossRef | View Record in Scopus | Cited By in Scopus (?!?)
- View reference in sidebar
- M. Rademaker, B. De Baets, H. De Meyer, Optimal monotone relabelling of partially non-monotone ordinal data, Optimization Methods and Software, submitted for publication, doi:10.1080/10556788.2010.507272.
- Full Text via CrossRef | View Record in Scopus | Cited By in Scopus (?!?)
- View reference in sidebar
Witten, I.H., Frank, E.
- Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations
- Morgann Kaufmann (2005)
- Full Text via CrossRef | View Record in Scopus | Cited By in Scopus (?!?)
- View reference in sidebar
Zendehdel, K., Rademaker, M., De Baets, B., Van Huylenbroeck, G.
- Qualitative valuation of environmental criteria through a group consensus based on stochastic dominance
- Ecological Economics, 67, 253-264 (2008)
- Full Text via CrossRef | View Record in Scopus | Cited By in Scopus (?!?)
- View reference in sidebar
Zendehdel, K., Rademaker, M., De Baets, B., Van Huylenbroeck, G.
- Improving tractability of group decision making on environmental problems through the use of social intensities of preferences
- Environmental Modelling and Software, 24, 1457-1466 (2009)
- Full Text via CrossRef | View Record in Scopus | Cited By in Scopus (?!?)
- View reference in sidebar
Zendehdel, K., Rademaker, M., De Baets, B., Van Huylenbroeck, G.
- Environmental decision making with conflicting social groups: a case study of the Lar rangeland in Iran
- Journal of Arid Environments, 74, 394-402 (2010)
- Full Text via CrossRef | View Record in Scopus | Cited By in Scopus (?!?)
- View reference in sidebar

Outline