Jump to content

CN2 algorithm

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Derek Ross (talk | contribs) at 21:55, 2 March 2011 (add pseudocode). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

The CN2 induction algorithm is a learning algorithm for rule induction. It is designed to work even when the training data is imperfect. It is based on ideas from the AQ algorithm and the ID3 algorithm. As a consequence it creates a rule set like that created by AQ but is able to handle noisy data like ID3.

Description of Algorithm

The algorithm must be given a set of examples, TrainingSet, which have already been classified in order to generate a list of classification rules. A set of conditions, SimpleConditionSet, which can be applied, alone or in combination, to any set of examples is predefined to be used for the classification.

routine CN2(TrainingSet)

   let the ClassificationRuleList be empty
   repeat
       let the BestConditionSet be Find_BestConditionSet(TrainingSet)
       if the BestConditionSet is not nil 
       then
           let the TrainingSubset be the examples covered by the BestConditionSet
           remove from the TrainingSet the examples in the TrainingSubset
           let the MostCommonClass be the most common class of examples in the TrainingSubset
           append to the ClassificationRuleList the rule 
               'if ' the BestConditionSet ' then the class is ' the MostCommonClass
   until the TrainingSet is empty or the BestConditionSet is nil

return the ClassificationRuleList

routine Find_BestConditionSet(TrainingSet)

   let the ConditionalFormulaSet be the set containing the empty formula
   let the BestConditionSet be nil
   repeat
       let the TrialConditionalFormulaSet be the set of formulae 
           {x and y|x belongs to the ConditionalFormulaSet, y belongs to the SimpleConditionSet}.
       remove all formulae in the TrialConditionalFormulaSet that are either in the ConditionalFormulaSet (i.e.,
       the unspecialized ones) or null (e.g., big = y and big = n)
       for every formula, F, in the TrialConditionalFormulaSet
           if 
               F is statistically significant 
                   and F is better than the BestConditionSet 
                   by user-defined criteria when tested on the TrainingSet 
               then 
                   replace the current value of the BestConditionSet by F
       while the number of formulae in the TrialConditionalFormulaSet > user-defined maximum
           remove the worst formula from the TrialConditionalFormulaSet
       let the TrialConditionalFormulaSet be the TrialConditionalFormulaSet
   until the TrialConditionalFormulaSet is empty

return the BestConditionSet