Exemplar-based pattern recognition via semi-supervised learning

of 6

Please download to get full document.

View again

All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
6 pages
0 downs
Exemplar-based pattern recognition via semi-supervised learning
  Exemplar-based Pattern Recognition via Semi-Supervised Learning Georgios C. Anagnostopoulos Department of ECE, Florida Institute of Technology Melboume, FL 32901, USA Madan Bharadwaj School of EE & CS, University of Central Florida Orlando, FL 32816, USA Michael Georgiopoulos School of EE & CS, University of Central Florida Orlando, FL 32816, USA Stephen J. Verzi Department of Computer Science, University of New Mexico Albuquerque, NM 87131, USA Gregory L. Heileman Department of EECE, University of New Mexico Albuquerque, NM 87131, USA Abstmct -The focus of this paper is semi-supervised earning in the context of pattern recognition. Semi-supervised learning (SSL) refers to the semi-supervised construction of clusters during the training phase of exemplar-based classifiers. Using artificially generated data sets we present experimental results of classifiers that follow the SSL paradigm and we show that, especially for difficult pattern recognition problems featuring high class overlap, for exemplar-based classifiers implementing SSL i) the generalization performance improves, while ii) the number of necessary exemplars decreases significantly, when compared to the srcinal versions of the Classifiers. 1. INTRODUCTION Exemplar-based classifiers (EBC) are pattem recognizers that encode their accumulated evidence with the use of exemplars. These exemplars, whose geometric representation is usually a geometric shape (like a hyper-rectangle, a hyper- sphere, a hyper-ellipsoid and others) embedded in the classification problem's input domain, are formulated via clustering of training patterns attributed with the same class label. In essence, these classifiers use exemplars to summarize training data belonging to the same class and then utilize a similarity or proximity measure to classify a previously unseen test pattern. The associations of exemplars to class labels are typically one-to-one and in exemplar-based Michael Gcorgiopaulos and Madan Bharadwaj acknowledge the partial support from the National Scicncc Foundation through the CRCO gran1 number 0203446 titled Machinc Lcaming Advanecs for Engineering Educalion. neural networks they are recorded via interconnection weights wir relating the neuron containing the information of the jth xemplar to the kth class label, whenever wj,bO. The aforementioned summarization of input patterns corresponds to a form of local learning, since the information regarding a cluster of patterns is represented by a single exemplar rather than being distributed. Therefore, EBCs readily lend themselves to efficient, incremental, online leaming. An example of exemplar-based recognizers is the family of ART neural classifiers that are based on the principle of adaptive resonance theory (ART) studied in [I]. In the context of ART, exemplars are called categories. Typically, EBCs featurefinite, stable learning, with a zera post-training error, that is, their training phase completes after a finite number of epochs and, if the training set is propagated one more time through the classifiers after training has completed, they classify all training patterns correctly. This occurs because of the supervised learning scheme they employ, when forming and expanding exemplars to signify clusters of similar data. Apart from the satisfaction of a similarity or proximity condition, a specific training pattem can influence the formation of a specific exemplar only if both of them are associated with the same class label. Therefore, it is not unusual that for some classification tasks EBCs are forced to employ a large number of exemplars to attain the final, zero post-training error (referred to as categoryproliferation problem in the ART literature). 0-7803-7898-9/03/$17.00 02003 IEEE 2182  In this paper we introduce the concept of semi-supervised learning (SSL) for EBCs, which aims to conserve the property of stable learning, while achieving a non-zero post- training error to avoid training over-fitting and, thus, loss of good generalization performance. As it will he demonstrated via our experimental results, EBCs following the SSL paradigm not only exhibit improved classification accuracy over their fully supervised counterparts, hut also utilize less exemplars in order to cope with their classification tasks, especially when dealing with problems of high class overlap. Moreover, additional advantages that SSL provides are the capability of dealing with inconsistent training patterns and the capability of coping with non-stationary classification environments. The rest of the paper is organized as follows: Section I1 provides more material on the motivation behind SSL, how it can be implemented in general and, finally, the three neural architectures that we have considered equipping with SSL capabilities. Section Ill details our experimental setting using artificial data sets, reports comparative experimental results related to the latter architectures and states some major observations regarding the utility of SSL. Finally, Section IV summarizes our findings and underlines the discovered importance of SSL. 11. SEMI-SUPERVISED EARNTNG A. Motivation behind Semi-supervised Learning Semi-supervised learning (SSL) refers to the semi- supervised manner, according to which exemplars are formed during training to identify clusters. According to the typical, fully supervised learning scheme of EBCs, training patterns that are rendered to he pertinent to an exemplar by virtue of their position in the feature domain can he associated with or can influence the structure of this exemplar only if both of them correspond to the same class label. Furthermore, training is considered incomplete, if there is at least one exemplar that mispredicts the class label of a training pattern. Therefore, while in fully supervised learning mode, an exemplar is not allowed to commit any misclassification error. Eventually, after completion of the learning process, a typical EBC will feature a zero post-training error. The fact that, under fully supervised learning, EBCs trained lo completion attain a zero post-training error may signify that these classifiers have been over-trained. For any classifier andor pattern recognition problem the difference between test set performance and the post-training accuracy is minimized, in general, for a non-zero post-training error (see [2] and [3]). Additionally, as we have mentioned in the previous section, it might be the case that for some classification tasks EBCs using a fully supervised learning mode are forced to employ a large number of exemplars to train to perfection. Instead, a learning scheme that would allow exemplars to occasionally misclassify training patterns by permitting training patterns, under certain circumstances, to modify exemplars associated with not necessarily the. same class, would also increase the post-training error, potentially increase the performance on the test set and, in general, reduce the amount of exemplars utilized by the classifier. While it is highly desirable to.achieve better accuracy on previously unseen patterns by allowing the post-training error to increase, it is also desirable to preserve the stable learning property of EBCs. A 1earning.scheme that has accomplished both objectives is presented in [4], where it is applied to a variation of the EllipsoidARTMAP (EAM) classifier (see [5] and [6]), srcinally named as Boosted EAM. Throughout the rest of text we will refer to the latter architecture as semi- supervised EAM (SEAM). B. Implementation of Semi-supervised Learning Semi-supervised EAM, which extends the main idea behind Boosted ARTMAP-S described in 171, features a tunable network parameter EE [O,l] called categoiyprediction error tolerance. This parameter regulates the amount of permissible misclassification error during training for all exemplars , (categories) maintained by the classifier. Whenever a category is formed it is attributed an initial class label, which is identical to the class label of the training pattern that initiated the category creation. In SEAM every category is guaranteed that it will not exceed a prediction error of 1006 with respect to its initial class label. For &= 0, ssEAM operates in fully supervised learning mode and behaves like the srcinal EAM classifier; no misclassification errors are allowed. At the other extreme, when E= 1, ssEAM allows for every category a maximum misclassification error of 100 with respect to its initial class label. In other words, for E = 1, SEAM forms categoriesiclusters without laking into account the class label information of the training patterns. Therefore, in this case we say that SEAM operates in a fully unsupervised learning mode; for intermediate settings of E (between 0 and.1) we say that ssEAM operates in semi-supervised learning mode. Ultimately, the role of E is to determine the level of ssEAM:s post-training error and, consequently, the level of generalization performance, although this particular objective is being accomplished in a rather indirect manner. It is worth mentioning that ssEAM stores category-class label association frequencies in weights w .k like the ones we have described in the previous section. The interested reader will find more implementation details regarding ssEAM in [6]. Semi-supervised EAM’s learning scheme is conceptually general enough and can be readily applied to other EBCs to maintain stable learning. Apart from EAM, we also implemented semi-supervised learning to two additional neural, exemplar-based classifiers: Fuzzy ARTMAP (FAM) [SI and the planar Restricted Coulomb Energy (RCE) classifier 191. The semi-supervised variations of these two classifiers will be denoted as ssFAM and ssRCE respectively. 2783  While EAM’s exemplars are geometrically represented as arbitrarily oriented hyper-ellipsoids, for FAM they are axes- parallel hyper-rectangles and for RCE they are either hyper- spheres or polytopes depending on the particular geometry chosen. EAM and FAM feature two common network parameters, namely p[O I] and a>O, that primarily determine the maximum size of learned categories; EAM has an extra parameter p~[0,1], which influences the shape of its categories. Finally, RCE features only one network parameter, R>O that specifies the maximum size of its exemplars. Note that ssFAM and ssRCE feature EE[O,~] s their category prediction error tolerance just as ssEAM does. Since by varying the value of E one can obtain a semi- supervised classifier exhibiting different degrees of post- training error, a natural question arises: how should the best value of E be chosen so that we get an optimal test set classification accuracy? For the experiments presented in the next section we employ cross-validation (see [lo]) as the procedure to identify the optimal E value. The exploration of various E choices and their effect on classification performance in conjunction with cross-validation-based parameter selection will lead us to the identification of the best performing classifiers that utilize a minimal amount of exemplars (classifiers of low hypothesis complexity). Therefore, this procedure can be viewed as a structural risk minimizafion process (see again [2] and 131). Before leaving this section, it is worth mentioning a few additional advantages that are gained via the utilization of an SSL scheme. Consider the case, where the training set contains two identical patterns with conflicting class labels (inconsistent patterns). It is shown in [6] that for E >0.5 BEAM can successfully deal with this issue. Slated more generally, SSL is able to accommodate the case, where inconsistent patterns are present in the training set. Additionally, consider the case, where the classification problem at hand is of non-stationary nature, i.e., decision boundaries between classes are changing over time. Again, for >OS, lassifiers such as SEAM, ssFAM and ssRCE will be able track this change, since for this particular value of E exemplars are allowed to adjust class label associations accordingly. 111. EXPERIMENTS In order to show the utility of SSL we conducted a collection of experiments using ssEAM, ssFAM and ssRCE on artificially generated data sets. The advantage of using artificial databases is that we can generate as many training, cross-validation, and test data, as we desire. We experimented with various values of the network parameter including 11 values for E ranging from 0 to 1 with the extreme values corresponding to the fully supervised and the fully unsupervised learning modes respectively. Through this extensive experimentation we chose the network that achieved the maximum generalization performance on the cross-validation set. Working with simulated data allowed us to generate enough validation data-points so that we can safely state that the hest (with respect to generalization) network parameter values found are indeed optimal. The other advantage of the artificial databases is that we can experiment with different input domain dimensionalities, the number of output classes and the amount of inter-class overlap. Specifically, as it will become apparent from the experimental results, the optimum epsilon value is dependent on the amount of inter-class overlap that is present in the data. A Artijicial Databases In this paper we kept the dimensionality of the input patterns fixed, and equal to 2, and we experimented with the number of output classes (2, 4, or 6) and the amount of overlap amongst data belonging to different classes (overlap values of 5 , 15%, 30% and 40 were used). The artificial databases consist of Gaussianly distributed data. Input patterns were drawn from either 2 or 4 or 6 classes. Qualitatively the overlap can he classified as low (5%), medium (15%) or high (30% or 40 ). The amount of overlap of the simulated data was defined to be the error rate of the optimum Bayes classifier designed as pertinent to the data. For instance, if the Bayes classifier for a simulated data set exhibited a misclassification rate of x , then the amount of overlap corresponding to this data-set was defined to be x . For each one of the above 12 sets of artificial data (3 different number of classes x 4 different degrees of class overlap) we generated a training set, a validation set and a test set of 500, 5,000 and 5,000 data-points respectively. In Fig 1 we show a scatter plot of the training data used for a 2, 4 and 6 class problem with 5 overlap. . Fig. I Scatterplots of data points for 2 class, 4 class and 6 class problems with 5 overlap E Observations The major observations from our experiments are identified and elaborated below. Higher overlap data domains are well suited for the semi- supervised learning approach. The validity of the above statement can be gathered from the plots in Fig. 2, where the x-coordinate represents the PCC XV (percent correct classification on the cross-validation set) value, while the y- coordinate represents the corresponding PCC Train (percent 2184  6 _ _ x .__.-ll_--._ll_.. *:--..-..-,.---~-__-l * l_...__.l.ll..l.--._l__-.- 1 *w*sm-* =<w.-- -m--   10 m o . L) P n ,=cn.r-nnr-ar rra-s 1 s x II II m P I sun. Fig. 2. The plots rcprcsent the Pcrccnlage of Cancct Classification on the Cross-validation sct PCC XV) versus thc Percentage ofConcct Classificalian ovcr the training set PCC Train). The best PCC XV values occur at PCC Train values that arc less than 100%. Ilk effect is amplified whcn the data avcrlap in data increases. Gencralizatian pcrfarmance is cnhanccd whcn thcre is some rcsidual training error prescnt aflcr training is complctcd. correct classification on the training set) value. It is worth noting from these figures that, as the amount of overlap amongst the data increases, the best PCC XV values occurs at values of PCC Train that are further away from the 100% training performance that completely supervised networks (such as EAM) enforce. This is a manifestation of the over- training issue associated with neural networks that has been oAeu been reported in the literature. As we see from Fig. 2 the issue of over-training and its detrimental effects gradually becomes more pronounced as the amount of overlap in a problem increases. In other words, training the network to perfection affects the generalization performance of the network more severely for higher overlap values than for lower overlap values. It is worth mentioning that although the results in Fig. 2 correspond to EAM, similar type of results and associated conclusions correspond to FAM and RCE networks (they are omitted due to lack of space). The best epsilon value migrates from lower to higher values for increasing data overlap. In Fig. 3  we depict the PCC Test and the PCC XV for epsilon values ranging from 0 to 1 with step 0.1. Note that a value of &equal o 0 represents EAM. As it can be seen from the figure the hest E value for the low overlap case (overlap of 5 ) is equal to 0.1, while the best E value for the high overlap case (overlap of 40 ) is equal to 0.8. This result is intuitively pleasing. For low overlap data overlapping data are scarcely witnessed. As a result, there is a lesser need for categories to expand and allow high errors in the training, in an effort to avoid over- training. On the other hand high overlap data cause over- training. By incorporating higher E values when data overlap is high, categories are allowed to expand in a way that patterns of the erroneous label are incorporated within them, thus sidestepping the'over-training issue. Hence, by avoiding over-training, higher generalization performances are observed for higher E values when the data overlap is high. Once more, although the figures depict the results pertaining to EAM networks, similar types of results are valid for FAM and RCE networks (again, they are omitted due to lack of space). Ep.b .PCC T P 0mU- lor EIU I0rlCh.d.k Olndng 0I.n.p. s ~- Do I as * Ep.,b Fig. 3. Pcrccnlage of Correct Classification on Tcst set PCC Tcst) versus cpsilon far SEAM. As thc data ovcrlap increases, thc best epsilon value migratcs towards I indicating the nced for less supcrvision. Better compression rates are achieved at higher epsilon values. This is an obvious result because higher E values allow more error during training with the direct effect of creating fewer categories. In Fig. 4 we depict the number of categories created by SEAM for different E values and for the 2 class 40% overlap dataset. What is worth mentioning regarding this figure is that the E value that achieves the highest generalization performance on the test set does not necessarily create a network with the highest number of categories. On the contraly, the number of categories created at the value of E that maximizes generalization is relatively small, at times amongst the smallest number observed for all &values. 2785  0 0 2 03 01 5 8 07 011 oe I Eph Fig. 4. Categories crcatcd and Pcrccntage Correct Classification over Tcsting set PCC Tcst) against epsilon. Note thc drastic fall in the number of catcgarics created from 241 by EAM to 35 by the best SEAM. The epsilon paramctcr ushers not only better classification but also good comprcssion Performance gains are universal In the previous plots we have depicted the performance of the best possible network for each specific E (the one achieving the highest generalization performance on the validation set with respect to the remaining network parameters; e.g., p, a, ,L and order of training pattem presentation for EAM). Fig. 5  depicts the generalization performance of the best network, and the average performance of all the networks that we have experimented with. What is worth mentioning from this figure is that the peak average performance of all the networks coincides with the performance of the best network at the optimal &value. , i 0 01 0 01 04 os o 1 08 os I EO- Fig. 5. PCC Tcst of best ssEAM network and average PCC XV vcrsus epsilon. The peaks coincide, assuring that thc gains witnessed arc universal and not limited to the best network alone. Eplm Fig. 6. Pcrccntagc Concct Classification on Test set PCC Test) of SEAM and ssFAM against epsilon. An optimum epsilon value for one ypc of ncrwork is generally a good epsilon value for anothcr 'ypc of network oo. Furthermore, beyond this optimum &value we notice that the average generalization performance of all the networks exhibits a monotonically decreasing behavior with increasing E values. The optimal epsilon value /or one network is a good epsilon value for the other networks too For illustration purposes please refer to Fig. 6, where the performance results of all ssFAM and ssEAM with a 4-class problem and 25% overlap is depicted. This result is an additional testament to the goodness of the hest E value for a particular data set. Hence, the best E value obtained for a particular data set can be considered more or less independent of the semi- supervised classifier that is employed for classification purposes. Optimizing with respect to epsilon makes goodsense. If we compare the performance of semi-supervised networks with fully supervised networks across all the experiments that we have performed we observe the following: the generalization performance of the best semi-supervised network outperformed the generalization performance of the best fully supervised network by 0.14 in the worst case and by 11.06% in the hest case with an average of around 6.5% performance enhancement. Furthermore, the ratio of utilized categories achieved by the hest semi-supervised networks compared to the best fully supervised network was around 10. Similar types of qualitative results were observed, when the best ssFAM was compared to the hest FAM and the hest ssRCE was compared with the hest RCE. 2786
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks