Package com.ibm.icu.impl
Class UnicodeRegex
- java.lang.Object
-
- com.ibm.icu.impl.UnicodeRegex
-
- All Implemented Interfaces:
StringTransform
,Transform<java.lang.String,java.lang.String>
,Freezable<UnicodeRegex>
,java.lang.Cloneable
public class UnicodeRegex extends java.lang.Object implements java.lang.Cloneable, Freezable<UnicodeRegex>, StringTransform
Contains utilities to supplement the JDK Regex, since it doesn't handle Unicode well.TODO: Move to com.ibm.icu.dev.somewhere. 2015-sep-03: This is used there, and also in CLDR and in UnicodeTools.
-
-
Field Summary
Fields Modifier and Type Field Description private java.lang.String
bnfCommentString
private java.lang.String
bnfLineSeparator
private java.lang.String
bnfVariableInfix
private java.util.Comparator<java.lang.Object>
LongestFirst
private static UnicodeRegex
STANDARD
private static java.util.regex.Pattern
SUPP_ESCAPE
private SymbolTable
symbolTable
-
Constructor Summary
Constructors Constructor Description UnicodeRegex()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static java.util.List<java.lang.String>
appendLines(java.util.List<java.lang.String> result, java.io.InputStream inputStream, java.lang.String encoding)
Utility for loading lines from a UTF8 file.static java.util.List<java.lang.String>
appendLines(java.util.List<java.lang.String> result, java.lang.String file, java.lang.String encoding)
Utility for loading lines from a file.UnicodeRegex
cloneAsThawed()
Provides for the clone operation.static java.util.regex.Pattern
compile(java.lang.String regex)
Compile a regex string, after processing by fix(...).static java.util.regex.Pattern
compile(java.lang.String regex, int options)
Compile a regex string, after processing by fix(...).java.lang.String
compileBnf(java.lang.String bnfLines)
Compile a composed string from a set of BNF lines; see the List version for more information.java.lang.String
compileBnf(java.util.List<java.lang.String> lines)
Compile a composed string from a set of BNF lines, such as for composing a regex expression.static java.lang.String
fix(java.lang.String regex)
Convenience static function, using standard parameters.UnicodeRegex
freeze()
Freezes the object.java.lang.String
getBnfCommentString()
java.lang.String
getBnfLineSeparator()
java.lang.String
getBnfVariableInfix()
SymbolTable
getSymbolTable()
Set the symbol table for internal processingprivate java.util.Map<java.lang.String,java.lang.String>
getVariables(java.util.List<java.lang.String> lines)
boolean
isFrozen()
Determines whether the object has been frozen or not.private int
processSet(java.lang.String regex, int i, java.lang.StringBuilder result, UnicodeSet temp, java.text.ParsePosition pos)
void
setBnfCommentString(java.lang.String bnfCommentString)
void
setBnfLineSeparator(java.lang.String bnfLineSeparator)
void
setBnfVariableInfix(java.lang.String bnfVariableInfix)
UnicodeRegex
setSymbolTable(SymbolTable symbolTable)
Get the symbol table for internal processingjava.lang.String
transform(java.lang.String regex)
Adds full Unicode property support, with the latest version of Unicode, to Java Regex, bringing it up to Level 1 (see http://www.unicode.org/reports/tr18/).
-
-
-
Field Detail
-
SUPP_ESCAPE
private static final java.util.regex.Pattern SUPP_ESCAPE
-
symbolTable
private SymbolTable symbolTable
-
STANDARD
private static final UnicodeRegex STANDARD
-
bnfCommentString
private java.lang.String bnfCommentString
-
bnfVariableInfix
private java.lang.String bnfVariableInfix
-
bnfLineSeparator
private java.lang.String bnfLineSeparator
-
LongestFirst
private java.util.Comparator<java.lang.Object> LongestFirst
-
-
Method Detail
-
getSymbolTable
public SymbolTable getSymbolTable()
Set the symbol table for internal processing
-
setSymbolTable
public UnicodeRegex setSymbolTable(SymbolTable symbolTable)
Get the symbol table for internal processing
-
transform
public java.lang.String transform(java.lang.String regex)
Adds full Unicode property support, with the latest version of Unicode, to Java Regex, bringing it up to Level 1 (see http://www.unicode.org/reports/tr18/). It does this by preprocessing the regex pattern string and interpreting the character classes (\p{...}, \P{...}, [...]) according to their syntax and meaning in UnicodeSet. With this utility, Java regex expressions can be updated to work with the latest version of Unicode, and with all Unicode properties. Note that the UnicodeSet syntax has not yet, however, been updated to be completely consistent with Java regex, so be careful of the differences.Not thread-safe; create a separate copy for different threads.
In the future, we may extend this to support other regex packages.
- Specified by:
transform
in interfaceStringTransform
- Specified by:
transform
in interfaceTransform<java.lang.String,java.lang.String>
- Parameters:
regex
- A modified Java regex pattern, as in the input to Pattern.compile(), except that all "character classes" are processed as if they were UnicodeSet patterns. Example: "abc[:bc=N:]. See UnicodeSet for the differences in syntax.- Returns:
- A processed Java regex pattern, suitable for input to Pattern.compile().
-
fix
public static java.lang.String fix(java.lang.String regex)
Convenience static function, using standard parameters.- Parameters:
regex
- as in process()- Returns:
- processed regex pattern, as in process()
-
compile
public static java.util.regex.Pattern compile(java.lang.String regex)
Compile a regex string, after processing by fix(...).- Parameters:
regex
- Raw regex pattern, as in fix(...).- Returns:
- Pattern
-
compile
public static java.util.regex.Pattern compile(java.lang.String regex, int options)
Compile a regex string, after processing by fix(...).- Parameters:
regex
- Raw regex pattern, as in fix(...).- Returns:
- Pattern
-
compileBnf
public java.lang.String compileBnf(java.lang.String bnfLines)
Compile a composed string from a set of BNF lines; see the List version for more information.- Parameters:
bnfLines
- Series of BNF lines.- Returns:
- Pattern
-
compileBnf
public java.lang.String compileBnf(java.util.List<java.lang.String> lines)
Compile a composed string from a set of BNF lines, such as for composing a regex expression. The lines can be in any order, but there must not be any cycles. The result can be used as input for fix().Example:
uri = (?: (scheme) \\:)? (host) (?: \\? (query))? (?: \\u0023 (fragment))?; scheme = reserved+; host = // reserved+; query = [\\=reserved]+; fragment = reserved+; reserved = [[:ascii:][:alphabetic:]];
Caveats: at this point the parsing is simple; for example, # cannot be quoted (use \\u0023); you can set it to null to disable. The equality sign and a few others can be reset with setBnfX().
- Parameters:
lines
- Series of lines that represent a BNF expression. The lines contain a series of statements that of the form x=y;. A statement can take multiple lines, but there can't be multiple statements on a line. A hash quotes to the end of the line.- Returns:
- Pattern
-
getBnfCommentString
public java.lang.String getBnfCommentString()
-
setBnfCommentString
public void setBnfCommentString(java.lang.String bnfCommentString)
-
getBnfVariableInfix
public java.lang.String getBnfVariableInfix()
-
setBnfVariableInfix
public void setBnfVariableInfix(java.lang.String bnfVariableInfix)
-
getBnfLineSeparator
public java.lang.String getBnfLineSeparator()
-
setBnfLineSeparator
public void setBnfLineSeparator(java.lang.String bnfLineSeparator)
-
appendLines
public static java.util.List<java.lang.String> appendLines(java.util.List<java.lang.String> result, java.lang.String file, java.lang.String encoding) throws java.io.IOException
Utility for loading lines from a file.- Parameters:
result
- The result of the appended lines.file
- The file to have an input stream.encoding
- if null, then UTF-8- Returns:
- filled list
- Throws:
java.io.IOException
- If there were problems opening the file for input stream.
-
appendLines
public static java.util.List<java.lang.String> appendLines(java.util.List<java.lang.String> result, java.io.InputStream inputStream, java.lang.String encoding) throws java.io.UnsupportedEncodingException, java.io.IOException
Utility for loading lines from a UTF8 file.- Parameters:
result
- The result of the appended lines.inputStream
- The input stream.encoding
- if null, then UTF-8- Returns:
- filled list
- Throws:
java.io.IOException
- If there were problems opening the input stream for reading.java.io.UnsupportedEncodingException
-
cloneAsThawed
public UnicodeRegex cloneAsThawed()
Description copied from interface:Freezable
Provides for the clone operation. Any clone is initially unfrozen.- Specified by:
cloneAsThawed
in interfaceFreezable<UnicodeRegex>
-
freeze
public UnicodeRegex freeze()
Description copied from interface:Freezable
Freezes the object.- Specified by:
freeze
in interfaceFreezable<UnicodeRegex>
- Returns:
- the object itself.
-
isFrozen
public boolean isFrozen()
Description copied from interface:Freezable
Determines whether the object has been frozen or not.- Specified by:
isFrozen
in interfaceFreezable<UnicodeRegex>
-
processSet
private int processSet(java.lang.String regex, int i, java.lang.StringBuilder result, UnicodeSet temp, java.text.ParsePosition pos)
-
getVariables
private java.util.Map<java.lang.String,java.lang.String> getVariables(java.util.List<java.lang.String> lines)
-
-