CHIMERGE DISCRETIZATION OF NUMERIC ATTRIBUTES PDF
Request PDF on ResearchGate | ChiMerge: Discretization of Numeric Attributes. | Many classification algorithms require that the training data contain only. THE CHIMERGE AND CHI2 ALGORITHMS. . We discuss methods for discretization of numerical attributes. We limit ourself to investigating methods. Discretization can turn numeric attributes into dis- discretize numeric attributes repeatedly until some in- This work stems from Kerber’s ChiMerge 4] which.
|Published (Last):||3 February 2011|
|PDF File Size:||13.37 Mb|
|ePub File Size:||13.21 Mb|
|Price:||Free* [*Free Regsitration Required]|
To receive news and publication updates for Journal of Applied Mathematics, enter your email address in the box below. This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Discretization algorithm for real value attributes is of very chimetge uses in many areas such as intelligence and machine learning.
The algorithms related to Chi2 algorithm includes modified Chi2 algorithm and extended Chi2 algorithm are famous discretization algorithm exploiting the technique of probability and statistics.
In this paper the algorithms are analyzed, and their drawback is pointed. Based on the analysis a new modified algorithm based attrobutes interval similarity is proposed.
An Algorithm for Discretization of Real Value Attributes Based on Interval Similarity
The new algorithm defines an interval similarity function which is regarded as a new merging standard in the process of discretization. At the same time, two important parameters condition parameter and tiny move parameter in the process of discretization and discrepancy extent of a number of adjacent two intervals are given in the form of function. The related theory analysis and the experiment results show that the presented algorithm is effective. In machine learning and data mining, many algorithms have already been developed according to processing discrete data.
Discretization of real value attributes is an important method of compression data and simplification analysis and also is an indeterminable in pattern recognition, machine learning, and rough set analysis domain.
Journal of Applied Mathematics
The key of discretization lies with dividing the cut point. At present, there are five different axes by which the proposed discretization algorithms can atttibutes classified [ 1 — 4 ]: Continuous attributes need to be discretized in many algorithms such as rule extraction and tag sort, especially rough set theory in research of data mining.
In view of an algorithm for discretization of real value attributes based on rough set, people have conducted extensive research and proposed a lot of new discretization method [ 5 ], one kind of thought of which is that the decision table compatibility is not changed during discretion.
Rough set chimerve Boolean logical method proposed by Nguyen and Skowron are quite influential [ 6 ]. Moreover, there are two quite influential discretization methods which are the algorithms of the correlation based on information entropy and the algorithms of the correlation of Chi2 algorithm based on statistical method for supervised discretization.
Reference [ 7 ] is an algorithm for discretization of real value discgetization based on decision table riscretization information entropy, which belongs to a heuristic and local algorithm that seeks the best results. Reference [ 8 ] proposed a discretization algorithm for real value attributes based on information attriutes, which regards class-attribute interdependence as an important discretization criterion and selects the candidate cut point which can lead to the better correlation between the class labels and the discrete intervals.
But this algorithm has the following disadvantages.
It uses a user-specified number of intervals when initializing the discretization intervals. The significance test used in the algorithm requires training for selection of a confidence interval. Nu,eric initializes the discretization intervals using a maximum entropy discretization method. Such initialization doscretization be the worst starting point in terms of the CAIR criterion.
And it is very easy to cause the lower degree of discretization which is not immoderate. Huang has solved the above problem, but at the expense of very high-computational cost [ 9 ]. Kurgan and Cios have improved in the discretization criterion and attempted to cause class-attribute interdependence maximization [ 10 ]. But this criterion merely considered dependence between the most classes in the interval and the attribute, which will cause the excessive discretization and the result is not to be precise.
References [ 3discretizwtion1112 ] are the algorithms of the correlation of Chi2 algorithm based on the statistics. Mumeric ChiMerge algorithm introduced by Kerber in is a supervised global discretization method [ 11 ]. The method uses test to determine whether the current point is merged or not. Tay and Shen further improved the Chi2 algorithm and proposed the modified Chi2 algorithm in [ 4 ].
The authors showed that it is unreasonable to decide the degree of freedom by the number of decision classes on the whole system in the Chi2 algorithm. Conversely, the degree of freedom should be determined by the number of decision classes of each two adjacent intervals. In [ 3 ], the authors pointed out that the method of calculating the freedom degrees in the modified Chi2 algorithm is not accurate and proposed the extended Chi2 algorithm, which replaced with.
Approximate reasoning is an important research content of artificial intelligence domain [ 14 — 17 ]. It needs measuring similarity between the different pattern and the object.
Similarity measure is a function that is used in comparing similarity among information, data, shape, and discretizatio etc. In some domain nnumeric as picture matching, information retrieval, computer vision, image fusion, remote sensing, and weather forecast, similarity measure has the extremely vital significance [ 1319 — 22 ].
The traditional similarity measure method often directly adopts the research results in statistics, such as the cosine distance, the overlap distance, the Euclid distance, and Manhattan distance.
Using statistic and significance level codetermines whether that cut point attributtes be merged is the main role of algorithms related to Chi2 algorithm. In this paper, we point out that using the importance of nodes determined by the distance, divided byfor extended Chi2 algorithm of reference [ 3 ] lacks theory basis and is not accurate. It is unreasonable to merge first adjacent two intervals which have the maximal difference value.
At the same time, based on the study of applied meaning of statistic, the drawback of the algorithm is analyzed. To solve these problems, a new modified algorithm based on interval similarity is proposed. Besides, two important stipulations are given in the algorithm.
At first, a few of conceptions about discretization are introduced as follows. A single value of continuous attributes is a cut point; two cut points produce an interval. Adjacent two intervals have a cut point. Discretization algorithm of real value attributes actually is in the process of removing cut point and merging adjacent intervals based on definite rules. The discretizaion for computing the value is where: In statistics, the asymptotic distribution of statistic with degrees of freedom is discretizzation with degrees of freedom, namely, distribution.
When condition attribute values of objects are the same and decision attribute value is numerric the same, the classified information of the decision table has definite chiimerge rate error ratewhere.
Rectified Chi2 algorithm proposed in this paper controls merger extent and information loss in the attributss process with.
ChiMerge discretization algorithm
Extended Chi2 algorithm is as shown in Algorithm 1 [ 1 ]. In formula 1is the proportion of a number of patterns in th class accounting for a total number of patterns, and is a number of patterns in the th interval. Therefore, statistical indicates the equality degree of the th class distribution of adjacent two intervals. The smaller the value is, the more the similar is class distribution, athributes the more unimportant the cut point is. It should be merged.
For the newest extended Chi2 algorithm, it is very possible to have such two attriutes of adjacent intervals: Yet, the difference of class distribution of adjacent two intervals which have the less number of classes is smaller and the corresponding value is smaller. Moreover, degree of freedom of adjacent two intervals with the greater number of classes is bigger.
Then, quantile is possibly much more than see Figure 1. Therefore, discretiation if, we still have such situation: But in fact, adjacent two intervals with the bigger difference of class distribution and the greater number of classes should not be first merged.
This merging standard in the computation is not precise. So it is unreasonable to merge first the adjacent two o with the maximal difference.
In algorithms of the series of Chi2 algorithm, expansion to is as follows: In formula 3under certain situations is not very accurate: When the number of some class increases two intervals both have this class, and are invariable, value of one of two intervals is invariable ; the numerator and the denominator of expansion to formula are increasing at the same time.
Regardingits value may be increased first and then chimetge to be decreased. In other words, when is quite bigger thanvalue will increase degree of freedom chkmerge to change and probability of interval merging will be reduced.
In fact, when the number of some class increases, this class has attdibutes independence with intervals, and it has leader’s class status. Therefore, compared with not increased, this time should have the same opportunity of competition and even should merge first these two intervals.
The situation when value is 0 is as follows. There exists the case that class distribution of adjacent two intervals is completely uniform, namely.
Thus, is very big relatively and the two intervals are possibly first merged. But in fact, it is possibly unreasonable that they are first merged. For example see Table 1,and c are condition attributes and is decision attribute. The number of samples of two intervals is the same.
Classification in is completely uniform, Namely, ; is quite big relatively. Even if degree of freedom in is bigger thanbut because the difference of degree of freedom between atrributes is very small, it is possible that the difference of is bigger than the difference of. From the computation with Table 1we get and inthen. We can see and get. Regarding in Table 1.
ChiMerge: Discretization of Numeric Attributes
Thus, two intervals of attribute in will be first merged, and then the sample 3, 4 and the sample 1, 5 in could have the conflict, but it is not the case in. So, when value is equal to 0, using difference as the standard of interval merging is inaccurate. Let be a database, or an information table, and let be two arrays then their similar degree is defined as a mapping to the interval. A good similarity measure should have the following characteristic: Based on the analysis to the drawback of the correlation of Chi2 algorithm, we propose the similarity function as follows.
Given two intervals objectslet be a class label according to the th value in the first interval, and let be a class label according to the th value in the second interval. Then, the difference between and is where. Similarity function of adjacent two intervalsis defined as In the formula 5is a condition parameter: Considering any adjacent two intervals andcan express the difference degree between adjacent two intervals. But, because the number of each group of adjacent intervals artributes different, it is unreasonable to merely take as a difference measure standard.
In order to obtain a uniform standard of difference measure and a fair compete opportunity among each group of adjacent intervals, it is reasonable to take as a difference measure standard.
In formula 5when the number of adjacent two intervals has only onesimilar degree between them is the biggest obviously. In order to enable similar degree among various intervals to compare in the uniform discretizwtion, we can take arc tangent function to normalized processing, making similar value mapped in. The formula expresses the average normative value numsric cut points before discretizing.