Class BooleanPerceptronClassifier

  • All Implemented Interfaces:
    Classifier<java.lang.Boolean>

    public class BooleanPerceptronClassifier
    extends java.lang.Object
    implements Classifier<java.lang.Boolean>
    A perceptron (see http://en.wikipedia.org/wiki/Perceptron) based Boolean Classifier. The weights are calculated using TermsEnum.totalTermFreq() both on a per field and a per document basis and then a corresponding FST is used for class assignment.
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      ClassificationResult<java.lang.Boolean> assignClass​(java.lang.String text)
      Assign a class (with score) to the given text String
      java.util.List<ClassificationResult<java.lang.Boolean>> getClasses​(java.lang.String text)
      Get all the classes (sorted by score, descending) assigned to the given text String.
      java.util.List<ClassificationResult<java.lang.Boolean>> getClasses​(java.lang.String text, int max)
      Get the first max classes (sorted by score, descending) assigned to the given text String.
      private void updateFST​(java.util.SortedMap<java.lang.String,​java.lang.Double> weights)  
      private void updateWeights​(TermVectors termVectors, int docId, java.lang.Boolean assignedClass, java.util.SortedMap<java.lang.String,​java.lang.Double> weights, double modifier, boolean updateFST)  
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • bias

        private final java.lang.Double bias
      • textTerms

        private final Terms textTerms
      • analyzer

        private final Analyzer analyzer
      • textFieldName

        private final java.lang.String textFieldName
      • fst

        private FST<java.lang.Long> fst
    • Constructor Detail

      • BooleanPerceptronClassifier

        public BooleanPerceptronClassifier​(IndexReader indexReader,
                                           Analyzer analyzer,
                                           Query query,
                                           java.lang.Integer batchSize,
                                           java.lang.Double bias,
                                           java.lang.String classFieldName,
                                           java.lang.String textFieldName)
                                    throws java.io.IOException
        Parameters:
        indexReader - the reader on the index to be used for classification
        analyzer - an Analyzer used to analyze unseen text
        query - a Query to eventually filter the docs used for training the classifier, or null if all the indexed docs should be used
        batchSize - the size of the batch of docs to use for updating the perceptron weights
        bias - the bias used for class separation
        classFieldName - the name of the field used as the output for the classifier
        textFieldName - the name of the field used as input for the classifier
        Throws:
        java.io.IOException - if the building of the underlying FST fails and / or TermsEnum for the text field cannot be found
    • Method Detail

      • updateWeights

        private void updateWeights​(TermVectors termVectors,
                                   int docId,
                                   java.lang.Boolean assignedClass,
                                   java.util.SortedMap<java.lang.String,​java.lang.Double> weights,
                                   double modifier,
                                   boolean updateFST)
                            throws java.io.IOException
        Throws:
        java.io.IOException
      • updateFST

        private void updateFST​(java.util.SortedMap<java.lang.String,​java.lang.Double> weights)
                        throws java.io.IOException
        Throws:
        java.io.IOException
      • assignClass

        public ClassificationResult<java.lang.Boolean> assignClass​(java.lang.String text)
                                                            throws java.io.IOException
        Description copied from interface: Classifier
        Assign a class (with score) to the given text String
        Specified by:
        assignClass in interface Classifier<java.lang.Boolean>
        Parameters:
        text - a String containing text to be classified
        Returns:
        a ClassificationResult holding assigned class of type T and score
        Throws:
        java.io.IOException - If there is a low-level I/O error.
      • getClasses

        public java.util.List<ClassificationResult<java.lang.Boolean>> getClasses​(java.lang.String text)
        Description copied from interface: Classifier
        Get all the classes (sorted by score, descending) assigned to the given text String.
        Specified by:
        getClasses in interface Classifier<java.lang.Boolean>
        Parameters:
        text - a String containing text to be classified
        Returns:
        the whole list of ClassificationResult, the classes and scores. Returns null if the classifier can't make lists.
      • getClasses

        public java.util.List<ClassificationResult<java.lang.Boolean>> getClasses​(java.lang.String text,
                                                                                  int max)
        Description copied from interface: Classifier
        Get the first max classes (sorted by score, descending) assigned to the given text String.
        Specified by:
        getClasses in interface Classifier<java.lang.Boolean>
        Parameters:
        text - a String containing text to be classified
        max - the number of return list elements
        Returns:
        the whole list of ClassificationResult, the classes and scores. Cut for "max" number of elements. Returns null if the classifier can't make lists.