37th Annual Hawaii International Conference on System Sciences, 2004. Proceedings of the
Download PDF

Abstract

Active learning is a learning paradigm that actively acquires extra information with an "effort" for a certain "gain" when building learning models. This paper unifies the effort and gain by studying active learning in the cost-sensitive framework. The major advantage of studying active cost-sensitive learning aims at the business goal of minimizing the total cost directly, thus the potential applications of the proposed methods are significant. We first study a simple random active learner "buying" additional examples at random in order to reduce the total cost of example acquisition and future misclassifications. Then we propose a novel pool-based cost-sensitive active learner "buying" labels of unlabeled examples in a pool. We evaluate our new cost-sensitive active learning algorithms and compare them to previous active cost-sensitive learning methods. Experiment results show that our pool-based cost-sensitive active learner requires a fewer number of examples yet it produces a smaller total cost compared to the previous methods.

1.   Introduction

Active learning, broadly speaking, is a learning program which actively acquires extra information with an effort for a certain gain in building learning models. The extra information is acquired from Oracle. It could be a missing feature value of a labeled example, or a class label for an unlabeled example. The learning program determines which information should be acquired to improve the performance (the “gain”) of the learning model the most. Active learning is widespread in building predictive models for real-world applications, for example, medical diagnosis. When we want to build an automatic medical diagnosis system to classify X-rays, we need to construct the training set first, which contains the series of X-rays and their diagnosis results (positive or negative). Obviously, it takes time for doctors to accurately label each X-ray positive or negative. Suppose there are a great amount of X-rays, we need to determine which X-rays we need their diagnosis results, and how many we need. If we make small efforts to just get a few diagnoses, the learning model built from these cases would perform not well. If we make great efforts to get a great amount of diagnoses, the learning model built from these cases would perform much better. Thus, we need to balance these two factors: “efforts” and “gains”.

Most previous works of active learning treat the “efforts” and “gains” in different forms and do not unify the two. For example, in terms of the “efforts”, the traditional “pool-based” active learner can select to label examples from a pool of unlabeled example. On the other hand, active learners with “feature-value acquisition” may acquire values of selected missing feature values in training or test examples. Similarly, the “gain” from the effort of active learning is also varied. For instance, accuracy and often the learning curves with accuracy have been used to show the usefulness (speedup) in active learning. In other work “deficiency” of active learning has also been explored. Indeed, Baram et al. [1] indicates that “there is no consensus on appropriate performance measures for active learning”.

A few recent works [8][9] unify the two aspects (effort and gain) of active learning by studying active learning in the cost-sensitive framework (which itself has been extensively studied). That is, both efforts and gains are quantified with cost on the same scale (such as dollars), and active learning becomes the minimization of the total cost. This study further extends traditional active learning literature into costsensitive framework. We propose a novel method based on the version space for selecting examples to label. It adopts the total cost of example acquisition and misclassification as the evaluation criterion of the active learning.

Studying active learning in the cost-sensitive framework has many unique and important advantages. First of all, accuracy (as a performance measure) may not be appropriate in many real-world applications, as different kinds of errors (such as false positive and false negative in binary classification) can cost very differently. Second, the performance of a learning program can be evaluated by many different measures: accuracy, recall, computation time, and so on. However, as long as they can be converted to the cost, they can be included in the total cost of the existing learning system. Third, as active learning is converted to the minimization problem, many optimization techniques (such as gradient descent) can be used. Finally, and most importantly, the ultimate goal of most real-world applications is to minimize the total cost (or maximize the total profit). Studying active learning in the cost-sensitive framework aims at the business goal directly. The potential applications of the proposed methods are significant.

It is true that it may be difficult to provide exact costs for the efforts and gains. However, as “efforts” and “gains” are two opposite measures, any tractable optimization combining the two has to make some explicit or implicit assumptions on the relative magnitude of the efforts and gains. More abstract terms such as time, love, and life may also be assigned dollar values [3]. In addition, we believe that exact cost values are unnecessary. We will show that small variations in cost assignments do not affect results much. We do make some simplifying assumption in this work: the total cost is simply the sum of the two (efforts and gains) in active learning. More complicated forms of cost aggregation will be considered in the future.

In this paper we assume that the extra examples or labels of examples to be acquired are the “effort”, and that reduction in misclassification costs in prediction is the “gain” in active learning. The active learner must attempt to strike a “tradeoff” between the efforts and gains by minimizing the total cost of acquiring examples and misclassifications. No other costs (such as computational resources, which are a part of the future work) are considered in this work.

More specifically we first study a simple case of active learning in the cost-sensitive framework: the random active leaner which may “buy” or acquire additional random examples (with labels) at a cost. That is, the learner can decide to “buy” extra training examples given randomly. At a first glance, acquiring training examples at random is usually regarded as “passive learning” (not active learning). However, this can be regarded as a weak active learner who can decide if or not to buy extra examples. Studying this simple case also sets up a baseline to which the pool-based active learning algorithm can be compared.

Next we explore the standard pool-based active learning in the cost-sensitive framework. That is, the learner is given a pool of unlabeled examples, and it may “buy” labels of the selected unlabeled examples in the pool. We propose a novel active learning technique which selects unlabeled examples that can split and reduce the version space [11] most rapidly.

Applications of these two types of active learning are widespread, such as image classification and webpage categorization. An active learner of image classification can find out the right number of extra training examples (images) needed to buy or collect to reduce the total cost. In webpage categorization, the web pages found by a search engine form a pool of unlabeled examples. The active learner can select a small number of “critical” pages to be labeled (at the cost of the editors) in order to minimize the total cost to the search engine company.

2.   Review of Previous Work

The most popular type of active learning is called “pool-based” active learning. A pool of unlabeled examples is given, and the learner can choose which ones to label during learning. Many works have been published in recent years on pool-based active learning, including, for instance, [16][19][13][14][1]1. However, almost all of these previous methods are evaluated by accuracy (or learning curves with accuracy) and “deficiency” of active learning [1], as well as the number of examples acquired. Margineantu [9] published a short paper on pool-based active costsensitive learner (called Active-CSL). Its goal is to choose an unlabeled example to label, which can provide the most expected gain in terms of the total cost (labeling and misclassification cost). More specifically, Active-CSL tried to estimate the gain by assigning different class labels to an unlabeled example. For each assigned class label, Active-CSL added this example into training data to build a new model. Then Active-CSL used this model to estimate the total cost with the (mis)classification costs given. After all possible class labels assigned, Active-CSL just summed all the costs obtained under each potential class label and chose the unlabeled example to label that produced maximum gain. We propose a novel, different method for selecting examples to label (based on the version space) and a new evaluation criterion (based on the actual cost reduction). We compare our pool-based active learner with the random active learner and Active-CSL in Section 5.2.

Cost-sensitive learning is also an active research topic in recent years. Turney [18] gives an excellent survey on a variety of costs that may be considered in learning. Most previous works on cost-sensitive learning only consider minimizing misclassification costs (e.g., [17][5][6][15]). However, the learning algorithms proposed are not active learners. That is, they assume that a fixed set of training examples is given, and the learners cannot acquire additional information during learning.

Our active cost-sensitive learning algorithm (to be presented in Section 3 and 4) utilizes the cost-sensitive decision tree (CSDT in short) as a base learning algorithm; thus we briefly review it here. CSDT [8] is similar to C4.5 [12], but it uses the total cost of misclassifications, instead of entropy, as the attribute split criterion. (In the original CSDT , there are also attribute costs, which we do not have, and are set to zero here). At each step, CSDT always chooses an attribute from the available attribute set with the maximum cost reduction on training data, similar to maximum entropy reduction, to build decision trees. Cost reduction (EEA) is the difference between the expected cost E before splitting and the sum of the expected cost EA after splitting with attribute A. The attribute with the largest cost reduction, if it is greater than 0, is chosen to split the training data (otherwise it is not worth to build the branch further, and a leaf is formed). The same procedure is applied recursively to build subtrees. However, CSDT and C4.5 assume that all examples are available to build the tree.

3.   Random Cost-Sensitive Active Learning

In this section we will describe our random active learning, called RanCA , using cost-sensitive decision trees as a base learner to establish the evaluation criteria and the baseline for the pool-based active learning.

3.1. RanCA with Cost-Sensitive Decision Trees

The general idea of the RanCA is quite simple, and it can use any cost-sensitive learning algorithm as the base learner (as long as it produces probability estimates of the predictions; we use cost-sensitive decision tree CSDT reviewed in Section 2 in RanCA ). The framework of RanCA is an incremental learner. That is, it acquires random examples at a cost gradually while monitoring an evaluation criterion (see Section 3.2.) until it is met. The evaluation computes the sum of the acquisition cost and misclassification cost of test examples to see when it reaches the minimum.

More specifically, each time, the RanCA acquires one or m examples (adding one example each time often has too little effect so m can be greater than 1) randomly, and includes them in the training set. Then a new cost-sensitive decision tree is built from the expanded training set, and is evaluated to see if the total cost of the example acquisition and misclassification of test examples is reduced. If it is, then the process repeats (more examples are acquired); if not, the learner stops acquiring more examples, and the current decision tree is produced.

Note that even though RanCA calls for an evaluation process for each example acquired, it is more “conservative” and accurate than estimating the examples needed, as it uses the actual cost obtained so far directly. In addition, the base learner is an efficient decision tree algorithm, and computation resource is not part of the cost in our paper.

We will discuss the evaluation method in the next subsection.

3.2. Evaluation Methods

To evaluate the trees built in the random (and later pool-based) active learners for future test performance, we can use a (small) part of training examples as (future) test examples. These test examples are not used in training. However, from the active learning perspective which attempts to minimize the total cost of acquiring training examples and future misclassification on test examples, holding out some examples for testing excludes them from building the model, making some acquiring examples wasted. To reduce the waste of acquired training examples used for testing, we use leave-one-out cross-validation (LOO in short) to evaluate the learned model, so only one example is “wasted”. As the decision tree learning algorithm is quite efficient, LOO does not pose a major concern in computation time. For example, each iteration of example acquisition via LOO on a typical dataset (such as Thyroid; see Section 5) takes only 71 seconds on average on a modern PC. (Further, the computational cost is not considered in the total cost in this paper.)

The following procedure describes details on how to estimate the misclassification cost of one test example (in LOO). For binary classification (used in this paper), we use the following notations: TP is the cost of true positive, TN is the cost of true negative, tp and fn are the number of positive examples, and tn and fp are the number of true negative examples. Both TP and TN usually are negative values (benefits of correct classification; usually set as 0 .); FP and FN are positive values (misclassification costs). For a leaf in the cost-sensitive decision tree, CP=tp×TP+fp×FP is the total misclassification cost of being a positive leaf, and CN=tn×TN+fn×FN is the total misclassification cost of being a negative leaf. Then the probability of a leaf being positive is estimated by the relative cost of CP and CN the smaller the cost, the larger the probability (as minimum cost is sought). Thus, the probability of the leaf being positive is thus: 1CPCP+CN=CNCP+CN. Similarly, the probability of a leaf being a negative is CPCP+CN. However, these probabilities are not used directly in estimating misclassification costs, because the number of training examples in leaves is usually very small, especially at the beginning of example acquisition. To reduce the effect of extreme probability estimations, we apply the Laplace correction [7][4] to smooth probability estimates in leaves. We modify the original Laplace based on accuracy for estimation with misclassification cost. The original Laplace correction for accuracy can be expressed as nC+1N+n, where nC is the number of examples which belong to class C, N is the number of training examples, and n is the number of classes. As we consider misclassification costs now, the Laplace-corrected probability of a leaf being positive is CN+λCN+CP+k, where λ=|FPFN| and k=FP+FN. Similarly, the probability of a leaf being negative is CP+λCP+CN+k. Thus, the expected misclassification cost of a true negative example is CN+λCp+CN+k×FP, and the expected misclassification cost of a true positive example is CP+λCP+CN+k×FN

The next question is how to integrate the misclassification cost of one test example (obtained above) with the cost of training examples acquired. A simple sum of the two, as in [9], would not be reasonable, as it depends on how many future test examples (or how often) the model will be used to predict. Clearly, if the model built will be seldom used (only once or twice), we can reduce the total cost through building a rough model with acquiring only a few examples. On the other hand, if the model built will be used very frequently (for instance, millions of times), it would be worthwhile to acquire more examples to build a highly accurate model with very low misclassification cost. As we may not know the number of future test examples during the model building process, we introduce a variable t to represent the number of future test examples. Thus, the total cost on test examples is the misclassification cost of one test example obtained by LOO multiplied by t. The total cost is simply the sum of the total cost on t test examples and the cost of the so-far acquired examples.

The next question is how to use the total cost as a stopping criterion for the incremental cost-sensitive active learners. Ideally, we can expect the total cost to decrease initially, and it would then reach a minimum before it goes up. This is because the gain in reduced misclassification cost would tailor off as more and more examples or labels are acquired (with a constant cost for each). Thus, we can obtain a learning curve in terms of the total cost and the number of examples required (see Figure 2). If the curve is smooth, then indeed the learning algorithm has found the optimal number of training examples needed. However, as we will see in Section 5, the curve of the total cost may not be smooth, and the local minimum may not be the global one. Thus it is necessary to “look ahead” and acquire a few more (units of) examples to ensure that the local minimum is a global one. The extra examples in LOO and look-ahead will be extra (or wasted) on top of the minimum total cost found. We will describe the look-ahead strategy obtained empirically in Section 5.1.

4.   Pool-based Cost-Sensitive Active Learning

In this section we propose a novel pool-based costsensitive active leaner, called PoolCA , which selects examples from the pool of unlabeled examples for labeling at a cost. The basic framework of PoolCA is the same as RanCA: it is also an incremental learner that uses the same evaluation criterion. However, PoolCA chooses example(s) to be labeled based on the version space [11], which consists of all possible hypotheses (trees) consistent with labeled examples acquired so far. More specifically, PoolCA will attempt to split (and reduce) the version space by half (as rapidly as possible) after examples are labeled. The basic idea has been proposed and implemented with SVM [16] but not with decision trees. The main advantage of using decision trees is its much higher efficiency in the tree building compared to the SVM learning.

For decision trees, the version space is the set of all possible cost-sensitive decision trees built on the available (labeled) training examples. However, as the version space may contain a very large number of trees, we “sample” the version space by building a small committee of trees. In this paper, we build 50 trees as a committee of trees to represent the version space.

To make the trees built be representative of small cost-sensitive trees, CSDT is modified to have some random component in attribute selection. While the original CSDT chooses the attribute with the maximum cost reduction deterministically , the modified version chooses an attribute with a probability proportional to the cost reduction (at every stage of the tree building). This way, less favorable attributes may still be chosen in the tree building, and a set (or committee) of different but small cost-sensitive decision trees are built from the same training set.

After the committee of 50 trees is built, PoolCA uses them to predict the class label (with probability estimates) of each unlabeled example. Then each unlabeled example is assigned an uncertainty score, based on the predictions of the 50 trees. Specially, if Pi is the probability of being class 1 (estimated by one of the 50 trees), then weighted probability of being class 1 of this example is P=Pi/50. Then the uncertainty score of this example is: 1max{P, 1P}. Thus, when P is 0.5, the prediction of the example by the committee is most uncertain, and the uncertainty score is the largest.

Then PoolCA chooses m examples from unlabeled set according to the uncertainty scores. Instead of selecting the m examples with the highest uncertainty scores, sampling is used to introduce some varieties in the chosen examples. That is, examples are repeatedly sampled according to the probability proportional to their uncertain scores, until m examples are obtained.

After the chosen examples are labeled (at a cost) and added into the training set, the same evaluation criterion (Section 3.2.) will be applied. The pseudo code of PoolCA is presented in Figure 1. Graphic: The PoolCA algorithm.

Figure 1.Figure 1.

After having explained the two cost-sensitive active learning models and reviewed one previous model (Active-CSL) in details, we indicate explicitly the two major differences between the compared models (i.e., RanCA , Active-CSL, and PoolCA ) shown in Table 1 as follows.

TABLE 1: Summary of the differences among the compared models (RanCA , Active-CSL, and PoolCA )

5.   Experiments

We conduct experiments with our cost-sensitive active learning algorithms on 10 real-world datasets from the UCI Machine Learning Repository [2]. These datasets are chosen because they have at least some discrete attributes and binary class. In addition, these datasets are also used in our previous work.

The datasets are listed in Table 2. As misclassification costs of these datasets are not available, we assign values in a reasonable range as costs, following [5][8][20]. This is fair and reasonable as all experimental comparisons are conducted with the same cost assignments. In this paper we assign random numbers from 500 to 2000 as the cost of a random example or the label of an example in the pool, and FP/FN=2000/6000 (we assume that TP=TN=0). We have tried other cost assignments and results are very similar.

TABLE 2: Features of the 10 Datasets used in the experiments

5.1. Experiments with RanCA

We first conduct experiments for the random active learner RanCA (Section 3.1.). The experiment results are shown in Figure 2 which displays the learning curve for a typical dataset (Thyroid). The unit m is chosen to be 10 in all experiments. The horizon axis is the number of acquired examples, and the vertical axis is the total cost, which is the sum of the cost of acquisition examples and the average misclassification cost of t future test examples, assuming t is 1,000 here (other values of t have been tried and resulted in similar curves).

From Figure 2, we can see clearly that the LOO curve produced by RanCA has a desirable trend: it goes down at the beginning (after a few units of training examples are acquired), and then it goes up after a certain point, forming a global minimum (the minimal total cost). All datasets produce similar curves as shown in Figure 2, except Cars, which has no global minimum – the curve keeps on going lower. This can happen when the available training set is not large enough to witness the global minimum to happen.

However, the LOO curve is not always smooth, thus look-ahead is needed to find the global minimum. We have calculated the minimum length of look-ahead needed for each dataset in Table 3, and have found that the number of look-ahead can be set to be 2. This is relatively small for the total cost wasted in searching for the global minima. Note that the total costs shown in the paper have included the 2 units of look-ahead.

TABLE 3: The minimum length of look-ahead needed for ACTree to find the true global minimum in the LOO curve for each dataset

5.2. Experiments with PoolCA

We conduct experiments for PoolCA and compare its results with RanCA (the baseline) and Active-CSL [9]. The results are shown in Figure 3. Note that the segments of the curves after the global minima are computed and plotted for demonstrating the incensement of the total cost only. Note that the segment of Active-CSL after its global minima is not computed and plotted completely for the four datasets (Breast, Tic-tac-toe, Mushroom, and Kr-vs-kp), just because Active-CSL is very slow. From Figure 3, we can see clearly that the cost curves of all cost-sensitive active learners have a similar trend: they go down at the beginning after a few units of training examples are acquired, and then they go up, forming global minima. Overall, PoolCA has the lowest (and best) total cost performance. Active-CSL does not always perform better than the random active learner (RanCA ). It performs worse than RanCA on the two datasets Thyroid and Australia.

The minimum total costs and the number of units of examples acquired for the 10 datasets are summarized in Table 4. Again the total costs shown have included the 2 units of look-ahead in search of the global minima.

From Table 4, we can draw several interesting and surprising conclusions. First of all, we can see that PoolCA reaches the global minimum with fewer units of examples on average than RanCA (14.0 vs. 16.1). It also requires fewer units of examples than Active-CSL (14.0 vs. 16.1). However, PoolCA produces a lower average total cost than RanCA and Active-CSL. This indicates that by choosing fewer “critical” examples to label, PoolCA can produce better results (lower total costs) than the random learner RanCA with more examples. It is a good indication that noisy examples were not chosen by PoolCA (it is our future research to verify this). Graphic: Total cost curve with RanCA for a typical dataset Thyroid

Figure 2.Figure 2.

TABLE 4: Minimum total cost and the number of units of acquired examples (separated by /) by the three methods

Second, if we look at individual datasets, PoolCA produces smaller total costs in 8 out of 10 datasets compared to RanCA , and in 7 out of 10 datasets compared to Active-CSL. On the datasets in which PoolCA loses, the difference in total costs is usually small. Third, on average, Active-CSL does perform slightly better than RanCA; it performs better than RanCA (in the total cost) in 6 out of 10 datasets. To sum, we can conclude that PoolCA performs much better than random active learner (RanCA) and the previous method of Active-CSL.

5.3. Varying the Committee Size Sampling the Version Space

In previous subsection, we conduct experiments on PoolCA with 50 trees as a committee to sample the version space. In this subsection, we investigate the performance of PoolCA under different committee sizes on a typical data (Thyroid here). The committee size (CS in short) varies from 20 to 200, and the experimental results are shown in Figure 4. From Figure 4, we can see that PoolCA performs better when the committee size increases from 20 to 50. However, the performance becomes very close when the committee size further increases. As PoolCA will become slower when the committee size increases, we choose 50 as the optimal committee size for the efficiency and effectiveness, as used in PoolCA. Graphic: comparing PoolCA, RanCA, and Active-CSL

Figure 3.Figure 3.

5.4. Varying the Size of Future Test Sets

In the previous subsections, we study our active learning algorithms under the assumption that the future test size is t=1,000 In this subsection, we investigate these algorithms under different sizes for the future test examples. The results of various test sizes are very similar to the one study earlier, thus we only show the summary of their global minima and the number of iterations needed to achieve the minima of each dataset with t=500 and t=2,000 presented in Table 5 and 6 respectively, and summarized in Figure 5 and 6. We can make the following conclusions.

TABLE 5: Summary of the global minima and the number of iterations (displayed in each cell separated by /) for each dataset under the three active learning methods with the assumption the number of future test examples t=500

TABLE 6: Summary of the global minima and the number of iterations (displayed in each cell separated by /) for each dataset under the three active learning methods with the assumption the number of future test examples t=2,000

Graphic: Varied committee sizes of version space

Figure 4.Figure 4.

From Figure 5, we can conclude that the minimum average total cost of the three methods decreases when the size of future test set increases. This is expected, as the share of the cost of acquiring training examples, which each test example burdens, decreases when the number of future test examples increases. Graphic: Summarization of minimum total costs under different number of future test examples

Figure 5.Figure 5.

Figure 6 shows that the number of training examples acquired to achieve the minimum total cost increases when the size of future test set increases. This is also expected, as the more the number future test examples, the more accurate model should be built. Thus, more training examples are required.

From Figure 5 and 6 above, we can also conclude that PoolCA is best, followed by Active-CSL, followed by RanCA. Furthermore, in the aspect of the minimum total cost, the difference between PoolCA and Active-CSL increases when the size of future test set decreases. Active-CSL is very close to RanCA. In the aspect of the number of training examples acquired PoolCA acquires fewer examples than Active-CSL and RanCA. And the difference of acquired examples by PoolCA and Active-CSL increases when the size of future test set increases. However, Active-CSL acquires fewer examples than RanCA.

5.5. Varying Cost Assignments

As mentioned in the Introduction, one possible criticism of our work is that costs of various types must be assumed or obtained. However, we believe that accurate cost assignments are not necessary, as small variations of costs will not affect the discrete decision (the number of units of examples) dramatically. To investigate this, we conduct experiments using RanCA under the same setting, except that we inject a small variation (5%) to the cost of a random example, false positive cost (FP) and false negative cost (FN). The results on a typical dataset (Thyroid here) are shown in Figure 7. Three curves in the figure are from the cost variations of each cost (example cost, FP, FN) independently, and the “All” curve is from injecting noise to all three costs simultaneously. We can see clearly that all the curves have a very similar trend. In addition, they all reach their global minimum after acquiring a very similar number of examples. The global minimums are also very close to each other. We can thus conclude that small variations in the exact values of cost assignments do not affect much of the results. Graphic: Summarization of the number of iterations needed to achieve the minima under different number of future test examples.

Figure 6.Figure 6.

Graphic: Varying cost assignments in a typical dataset Thyroid.

Figure 7.Figure 7.

6.   Conclusions and Future Work

In this paper we study active learning in the costsensitive learning framework to achieve a single goal: minimizing the sum of the cost of “effort” and “gain”. More specifically, we study several active learners which may acquire additional examples randomly at a cost, or class labels from the pool of unlabeled examples at a cost, to minimize the total cost of example acquisition and future misclassifications. We compare these new strategies with previous work and conclude that the pool-based active learner (PoolCA ) achieves a smaller total cost with fewer numbers of examples than the random active learner RanCA and the previous work Active-CSL.

In our future work, we plan to study other forms of active learning (such as stream-based and membership query) in the cost-sensitive learning framework. We also plan to study the line search optimization method to speed up the search for the optimal number of examples to acquire for the minimal total cost.

Footnotes

  • 1 Our references are surely incomplete. We have only included several typical works published in recent years. See [1] for a good review of active learning approaches.

References


  • [1]Baram, Y. , El-Yaniv, R. , and Luz, K. 2004. “Online Choice of Active Learning Algorithms”. Journal of Machine Learning Research, 5: 255–291.
  • [2]Blake, C.L. , and Merz, C.J. , 1998. UCI Repository of machine learning databases (website). Irvine, CA: University of California, Department of Information and Computer Science.
  • [3]Blanchflower, D.G. and Oswald, A.J. 2004. “Money, Sex and Happiness: An Empirical Study”. The Scandinavian Journal of Economics. 106(3) 393–415.
  • [4]Cestnik, B. 1990. “Estimating probabilities: A crucial task in machine learning”. in Proceedings of the 9th European Conference on Artificial Intelligence, 147–149, Sweden.
  • [5]Domingos, P. 1999. “MetaCost: A General Method for Making Classifiers Cost-Sensitive”. in Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining, 155–164. San Diego, CA: ACM Press.
  • [6]Elkan, C. 2001. “The Foundations of Cost-Sensitive Learning”. International Joint Conference of Artificial Intelligence (IJCAI 2001), 973–978.
  • [7]Good, I.J. , 1965. The estimation of probabilities: An essay on modern Bayesian methods. M.I.T. Press, Cambridge, Mass.
  • [8]Ling, C.X. , Yang, Q. , Wang, J. , and Zhang, S. 2004. “Decision Trees with Minimal Costs”. In Proceedings of the Twenty-First International Conference on Machine Learning, Banff, Alberta: Morgan Kaufmann.
  • [9]Margineantu, D.D. 2005. “Active Cost-Sensitive Learning”, In Proceedings of the International Joint Conference on Artificial Intelligence.
  • [10]McCallum, A.K. , and Nigam, K. 1998. “Employing EM in pool-based active learning for text classification”. In Proceedings of the 15th International Conference on Machine Learning. Morgan Kaufmann Publishers.
  • [11]Mitchell, T.M. 1982. “Generalization as search”. Artificial Intelligence, 18:203–226.
  • [12]Quinlan, J.R. , eds. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann.
  • [13]Roy, N. , and McCallum, A. 2001. “Toward optimal active learning through sampling estimation of error reduction”. In Proceedings of the 18th International Conference on Machine Learning, 441–448.
  • [14]Saar-Tsechansky, M. and Provost, F. 2004. “Active sampling for class probability estimation and ranking”. Machine Learning, 54(2) 153–178.
  • [15]Ting, K.M. 1998. “Inducing Cost-Sensitive Trees via Instance Weighting”. In Proceedings of the Second European Symposium on Principles of Data Mining and Knowledge Discovery, 23–26. Springer-Verlag.
  • [16]Tong, S. , and Koller, D. 2001. “Support vector machine active learning with applications to text classification”. Journal of Machine Learning Research, 2:45–66.
  • [17]Turney, P.D. 1995. “Cost-Sensitive Classification: Empirical Evaluation of a Hybrid Genetic Decision Tree Induction Algorithm”. Journal of Artificial Intelligence Research2:369–409.
  • [18]Turney, P.D. 2000. “Types of cost in inductive concept learning”. In Proceedings of the Workshop on Cost-Sensitive Learning at the Seventeenth International Conference on Machine Learning, Stanford University, California.
  • [19]Zhang, T. , and Oles, F. 2000. “A probability analysis on the value of unlabeled data for classification problems”. In International Joint Conference on Machine Learning, 1191–1198.
  • [20]Zubek, V.B. and Dietterich, T. 2002. “Pruning improves heuristic search for cost-sensitive learning”. In Proceedings of the Nineteenth International Conference of Machine Learning, 27–35, Sydney, Australia: Morgan Kaufmann.