Class MlBreakEngine


  • public class MlBreakEngine
    extends java.lang.Object
    • Constructor Summary

      Constructors 
      Constructor Description
      MlBreakEngine​(UnicodeSet digitOrOpenPunctuationOrAlphabetSet, UnicodeSet closePunctuationSet)
      Constructor for Chinese and Japanese phrase breaking.
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      int divideUpRange​(java.text.CharacterIterator inText, int startPos, int endPos, java.text.CharacterIterator inString, int codePointLength, int[] charPositions, DictionaryBreakEngine.DequeI foundBreaks)
      Divide up a range of characters handled by this break engine.
      private void evaluateBreakpoint​(java.lang.String inputStr, int[] indexList, int startIdx, int numCodeUnits, java.util.ArrayList<java.lang.Integer> boundary)
      Evaluate whether the breakpointIdx is a potential breakpoint.
      private int initIndexList​(java.text.CharacterIterator inString, int[] indexList, int codePointLength)
      Initialize the index list from the input string.
      private void initKeyValue​(UResourceBundle rb, java.lang.String keyName, java.lang.String valueName, java.util.HashMap<java.lang.String,​java.lang.Integer> map)
      In the machine learning's model file, specify the name of the key and value to load the corresponding feature and its score.
      private void loadMLModel()
      Load the machine learning's model file.
      private java.lang.String transform​(java.text.CharacterIterator inString)
      Transform a CharacterIterator into a String.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • fDigitOrOpenPunctuationOrAlphabetSet

        private UnicodeSet fDigitOrOpenPunctuationOrAlphabetSet
      • fClosePunctuationSet

        private UnicodeSet fClosePunctuationSet
      • fModel

        private java.util.List<java.util.HashMap<java.lang.String,​java.lang.Integer>> fModel
      • fNegativeSum

        private int fNegativeSum
    • Constructor Detail

      • MlBreakEngine

        public MlBreakEngine​(UnicodeSet digitOrOpenPunctuationOrAlphabetSet,
                             UnicodeSet closePunctuationSet)
        Constructor for Chinese and Japanese phrase breaking.
        Parameters:
        digitOrOpenPunctuationOrAlphabetSet - An unicode set with the digit and open punctuation and alphabet.
        closePunctuationSet - An unicode set with the close punctuation.
    • Method Detail

      • divideUpRange

        public int divideUpRange​(java.text.CharacterIterator inText,
                                 int startPos,
                                 int endPos,
                                 java.text.CharacterIterator inString,
                                 int codePointLength,
                                 int[] charPositions,
                                 DictionaryBreakEngine.DequeI foundBreaks)
        Divide up a range of characters handled by this break engine.
        Parameters:
        inText - An input text.
        startPos - The start index of the input text.
        endPos - The end index of the input text.
        inString - A input string normalized from inText from startPos to endPos
        codePointLength - The number of code points of inString
        charPositions - A map that transforms inString's code point index to code unit index.
        foundBreaks - A list to store the breakpoint.
        Returns:
        The number of breakpoints
      • transform

        private java.lang.String transform​(java.text.CharacterIterator inString)
        Transform a CharacterIterator into a String.
      • evaluateBreakpoint

        private void evaluateBreakpoint​(java.lang.String inputStr,
                                        int[] indexList,
                                        int startIdx,
                                        int numCodeUnits,
                                        java.util.ArrayList<java.lang.Integer> boundary)
        Evaluate whether the breakpointIdx is a potential breakpoint.
        Parameters:
        inputStr - An input string to be segmented.
        indexList - A code unit index list of the inputStr.
        startIdx - The start index of the indexList.
        numCodeUnits - The current code unit boundary of the indexList.
        boundary - A list including the index of the breakpoint.
      • initIndexList

        private int initIndexList​(java.text.CharacterIterator inString,
                                  int[] indexList,
                                  int codePointLength)
        Initialize the index list from the input string.
        Parameters:
        inString - An input string to be segmented.
        indexList - A code unit index list of the inString.
        codePointLength - The number of code points of the input string
        Returns:
        The number of the code units of the first six characters in inString.
      • loadMLModel

        private void loadMLModel()
        Load the machine learning's model file.
      • initKeyValue

        private void initKeyValue​(UResourceBundle rb,
                                  java.lang.String keyName,
                                  java.lang.String valueName,
                                  java.util.HashMap<java.lang.String,​java.lang.Integer> map)
        In the machine learning's model file, specify the name of the key and value to load the corresponding feature and its score.
        Parameters:
        rb - A RedouceBundle corresponding to the model file.
        keyName - The kay name in the model file.
        valueName - The value name in the model file.
        map - A HashMap to store the pairs of the feature and its score.