A mutable set of Unicode characters and multicharacter strings. Objects of this class
represent
character classes used in regular expressions.
A character specifies a subset of Unicode code points. Legal
code points are U+0000 to U+10FFFF, inclusive.
The UnicodeSet class is not designed to be subclassed.
UnicodeSet
supports two APIs. The first is the
operand API that allows the caller to modify the value of
a
UnicodeSet
object. It conforms to Java 2's
java.util.Set
interface, although
UnicodeSet
does not actually implement that
interface. All methods of
Set
are supported, with the
modification that they take a character range or single character
instead of an
Object
, and they take a
UnicodeSet
instead of a
Collection
. The
operand API may be thought of in terms of boolean logic: a boolean
OR is implemented by
add
, a boolean AND is implemented
by
retain
, a boolean XOR is implemented by
complement
taking an argument, and a boolean NOT is
implemented by
complement
with no argument. In terms
of traditional set theory function names,
add
is a
union,
retain
is an intersection,
remove
is an asymmetric difference, and
complement
with no
argument is a set complement with respect to the superset range
MIN_VALUE-MAX_VALUE
The second API is the
applyPattern()
/
toPattern()
API from the
java.text.Format
-derived classes. Unlike the
methods that add characters, add categories, and control the logic
of the set, the method
applyPattern()
sets all
attributes of a
UnicodeSet
at once, based on a
string pattern.
Pattern syntax
Patterns are accepted by the constructors and the
applyPattern()
methods and returned by the
toPattern()
method. These patterns follow a syntax
similar to that employed by version 8 regular expression character
classes. Here are some simple examples:
[] | No characters |
[a] | The character 'a' |
[ae] | The characters 'a' and 'e' |
[a-e] | The characters 'a' through 'e' inclusive, in Unicode code
point order |
[\\u4E01] | The character U+4E01 |
[a{ab}{ac}] | The character 'a' and the multicharacter strings "ab" and
"ac" |
[\p{Lu}] | All characters in the general category Uppercase Letter |
Any character may be preceded by a backslash in order to remove any special
meaning. White space characters, as defined by UCharacterProperty.isRuleWhiteSpace(), are
ignored, unless they are escaped.
Property patterns specify a set of characters having a certain
property as defined by the Unicode standard. Both the POSIX-like
"[:Lu:]" and the Perl-like syntax "\p{Lu}" are recognized. For a
complete list of supported property patterns, see the User's Guide
for UnicodeSet at
http://icu.sourceforge.net/userguide/unicodeSet.html.
Actual determination of property data is defined by the underlying
Unicode database as implemented by UCharacter.
Patterns specify individual characters, ranges of characters, and
Unicode property sets. When elements are concatenated, they
specify their union. To complement a set, place a '^' immediately
after the opening '['. Property patterns are inverted by modifying
their delimiters; "[:^foo]" and "\P{foo}". In any other location,
'^' has no special meaning.
Ranges are indicated by placing two a '-' between two
characters, as in "a-z". This specifies the range of all
characters from the left to the right, in Unicode order. If the
left character is greater than or equal to the
right character it is a syntax error. If a '-' occurs as the first
character after the opening '[' or '[^', or if it occurs as the
last character before the closing ']', then it is taken as a
literal. Thus "[a\\-b]", "[-ab]", and "[ab-]" all indicate the same
set of three characters, 'a', 'b', and '-'.
Sets may be intersected using the '&' operator or the asymmetric
set difference may be taken using the '-' operator, for example,
"[[:L:]&[\\u0000-\\u0FFF]]" indicates the set of all Unicode letters
with values less than 4096. Operators ('&' and '|') have equal
precedence and bind left-to-right. Thus
"[[:L:]-[a-z]-[\\u0100-\\u01FF]]" is equivalent to
"[[[:L:]-[a-z]]-[\\u0100-\\u01FF]]". This only really matters for
difference; intersection is commutative.
[a] | The set containing 'a'
|
[a-z] | The set containing 'a'
through 'z' and all letters in between, in Unicode order
|
[^a-z] | The set containing
all characters but 'a' through 'z',
that is, U+0000 through 'a'-1 and 'z'+1 through U+10FFFF
|
[[pat1][pat2]]
| The union of sets specified by pat1 and pat2
|
[[pat1]&[pat2]]
| The intersection of sets specified by pat1 and pat2
|
[[pat1]-[pat2]]
| The asymmetric difference of sets specified by pat1 and
pat2
|
[:Lu:] or \p{Lu}
| The set of characters having the specified
Unicode property; in
this case, Unicode uppercase letters
|
[:^Lu:] or \P{Lu}
| The set of characters not having the given
Unicode property
|
Warning: you cannot add an empty string ("") to a UnicodeSet.
Formal syntax
pattern := | ('[' '^'? item* ']') |
property |
item := | char | (char '-' char) | pattern-expr
|
pattern-expr := | pattern | pattern-expr pattern |
pattern-expr op pattern
|
op := | '&' | '-'
|
special := | '[' | ']' | '-'
|
char := | any character that is not special
| ('\\' any character)
| ('\u' hex hex hex hex)
|
hex := | any character for which
Character.digit(c, 16)
returns a non-negative result |
property := | a Unicode property set pattern |
Legend: |
a := b | | a may be replaced by b |
a? | | zero or one instance of a
|
a* | | one or more instances of a
|
a | b | | either a or b
|
'a' | | the literal string between the quotes |
To iterate over contents of UnicodeSet, use UnicodeSetIterator class.
_generatePattern
public StringBuffer _generatePattern(StringBuffer result,
boolean escapeUnprintable)
Generate and append a string representation of this set to result.
This does not use this.pat, the cleaned up copy of the string
passed to applyPattern().
result
- the buffer into which to generate the patternescapeUnprintable
- escape unprintable characters if true
_generatePattern
public StringBuffer _generatePattern(StringBuffer result,
boolean escapeUnprintable,
boolean includeStrings)
Generate and append a string representation of this set to result.
This does not use this.pat, the cleaned up copy of the string
passed to applyPattern().
includeStrings
- if false, doesn't include the strings.
add
public final UnicodeSet add(String s)
Adds the specified multicharacter to this set if it is not already
present. If this set already contains the multicharacter,
the call leaves this set unchanged.
Thus "ch" => {"ch"}
Warning: you cannot add an empty string ("") to a UnicodeSet.
- this object, for chaining
add
public final UnicodeSet add(int c)
Adds the specified character to this set if it is not already
present. If this set already contains the specified character,
the call leaves this set unchanged.
add
public UnicodeSet add(int start,
int end)
Adds the specified range to this set if it is not already
present. If this set already contains the specified range,
the call leaves this set unchanged. If end > start
then an empty range is added, leaving the set unchanged.
start
- first character, inclusive, of range to be added
to this set.end
- last character, inclusive, of range to be added
to this set.
addAll
public void addAll(Collection source)
Add the contents of the collection (as strings) into this UnicodeSet.
source
- the collection to add
addAll
public final UnicodeSet addAll(String s)
Adds each of the characters in this string to the set. Thus "ch" => {"c", "h"}
If this set already any particular character, it has no effect on that character.
- this object, for chaining
addAll
public UnicodeSet addAll(UnicodeSet c)
Adds all of the elements in the specified set to this set if
they're not already present. This operation effectively
modifies this set so that its value is the union of the two
sets. The behavior of this operation is unspecified if the specified
collection is modified while the operation is in progress.
c
- set whose elements are to be added to this set.
addAllTo
public void addAllTo(Collection target)
Add the contents of the UnicodeSet (as strings) into a collection.
target
- collection to add into
addMatchSetTo
public void addMatchSetTo(UnicodeSet toUnionTo)
Implementation of UnicodeMatcher API. Union the set of all
characters that may be matched by this object into the given
set.
- addMatchSetTo in interface UnicodeMatcher
toUnionTo
- the set into which to union the source characters
applyIntPropertyValue
public UnicodeSet applyIntPropertyValue(int prop,
int value)
Modifies this set to contain those code points which have the
given value for the given binary or enumerated property, as
returned by UCharacter.getIntPropertyValue. Prior contents of
this set are lost.
prop
- a property in the range
UProperty.BIN_START..UProperty.BIN_LIMIT-1 or
UProperty.INT_START..UProperty.INT_LIMIT-1 or.
UProperty.MASK_START..UProperty.MASK_LIMIT-1.value
- a value in the range
UCharacter.getIntPropertyMinValue(prop)..
UCharacter.getIntPropertyMaxValue(prop), with one exception.
If prop is UProperty.GENERAL_CATEGORY_MASK, then value should not be
a UCharacter.getType() result, but rather a mask value produced
by logically ORing (1 << UCharacter.getType()) values together.
This allows grouped categories such as [:L:] to be represented.
applyPattern
public final UnicodeSet applyPattern(String pattern)
Modifies this set to represent the set specified by the given pattern.
See the class description for the syntax of the pattern language.
Whitespace is ignored.
pattern
- a string specifying what characters are in the set
applyPattern
public UnicodeSet applyPattern(String pattern,
boolean ignoreWhitespace)
Modifies this set to represent the set specified by the given pattern,
optionally ignoring whitespace.
See the class description for the syntax of the pattern language.
pattern
- a string specifying what characters are in the setignoreWhitespace
- if true then characters for which
UCharacterProperty.isRuleWhiteSpace() returns true are ignored
applyPattern
public UnicodeSet applyPattern(String pattern,
int options)
Modifies this set to represent the set specified by the given pattern,
optionally ignoring whitespace.
See the class description for the syntax of the pattern language.
pattern
- a string specifying what characters are in the setoptions
- a bitmask indicating which options to apply.
Valid options are IGNORE_SPACE and CASE.
applyPropertyAlias
public UnicodeSet applyPropertyAlias(String propertyAlias,
String valueAlias)
Modifies this set to contain those code points which have the
given value for the given property. Prior contents of this
set are lost.
propertyAlias
- a property alias, either short or long.
The name is matched loosely. See PropertyAliases.txt for names
and a description of loose matching. If the value string is
empty, then this string is interpreted as either a
General_Category value alias, a Script value alias, a binary
property alias, or a special ID. Special IDs are matched
loosely and correspond to the following sets:
"ANY" = [\u0000-\U0010FFFF],
"ASCII" = [\u0000-\u007F].valueAlias
- a value alias, either short or long. The
name is matched loosely. See PropertyValueAliases.txt for
names and a description of loose matching. In addition to
aliases listed, numeric values and canonical combining classes
may be expressed numerically, e.g., ("nv", "0.5") or ("ccc",
"220"). The value string may also be empty.
applyPropertyAlias
public UnicodeSet applyPropertyAlias(String propertyAlias,
String valueAlias,
SymbolTable symbols)
Modifies this set to contain those code points which have the
given value for the given property. Prior contents of this
set are lost.
propertyAlias
- valueAlias
- symbols
- if not null, then symbols are first called to see if a property
is available. If true, then everything else is skipped.
charAt
public int charAt(int index)
Returns the character at the given index within this set, where
the set is ordered by ascending code point. If the index is
out of range, return -1. The inverse of this method is
indexOf()
.
index
- an index from 0..size()-1
- the character at the given index, or -1.
clear
public UnicodeSet clear()
Removes all of the elements from this set. This set will be
empty after this call returns.
clone
public Object clone()
Return a new set that is equivalent to this one.
closeOver
public UnicodeSet closeOver(int attribute)
Close this set over the given attribute. For the attribute
CASE, the result is to modify this set so that:
1. For each character or string 'a' in this set, all strings
'b' such that foldCase(a) == foldCase(b) are added to this set.
(For most 'a' that are single characters, 'b' will have
b.length() == 1.)
2. For each string 'e' in the resulting set, if e !=
foldCase(e), 'e' will be removed.
Example: [aq\u00DF{Bc}{bC}{Fi}] => [aAqQ\u00DF\uFB01{ss}{bc}{fi}]
(Here foldCase(x) refers to the operation
UCharacter.foldCase(x, true), and a == b actually denotes
a.equals(b), not pointer comparison.)
attribute
- bitmask for attributes to close over.
Currently only the CASE bit is supported. Any undefined bits
are ignored.
compact
public UnicodeSet compact()
Reallocate this objects internal structures to take up the least
possible space, without changing this object's value.
complement
public UnicodeSet complement()
This is equivalent to
complement(MIN_VALUE, MAX_VALUE)
.
complement
public final UnicodeSet complement(String s)
Complement the specified string in this set.
The set will not contain the specified string once the call
returns.
Warning: you cannot add an empty string ("") to a UnicodeSet.
s
- the string to complement
- this object, for chaining
complement
public final UnicodeSet complement(int c)
Complements the specified character in this set. The character
will be removed if it is in this set, or will be added if it is
not in this set.
complement
public UnicodeSet complement(int start,
int end)
Complements the specified range in this set. Any character in
the range will be removed if it is in this set, or will be
added if it is not in this set. If end > start
then an empty range is complemented, leaving the set unchanged.
start
- first character, inclusive, of range to be removed
from this set.end
- last character, inclusive, of range to be removed
from this set.
complementAll
public final UnicodeSet complementAll(String s)
Complement EACH of the characters in this string. Note: "ch" == {"c", "h"}
If this set already any particular character, it has no effect on that character.
- this object, for chaining
complementAll
public UnicodeSet complementAll(UnicodeSet c)
Complements in this set all elements contained in the specified
set. Any character in the other set will be removed if it is
in this set, or will be added if it is not in this set.
c
- set that defines which elements will be complemented from
this set.
contains
public final boolean contains(String s)
Returns true if this set contains the given
multicharacter string.
s
- string to be checked for containment
- true if this set contains the specified string
contains
public boolean contains(int c)
Returns true if this set contains the given character.
- contains in interface UnicodeFilter
c
- character to be checked for containment
- true if the test condition is met
contains
public boolean contains(int start,
int end)
Returns true if this set contains every character
of the given range.
start
- first character, inclusive, of the rangeend
- last character, inclusive, of the range
- true if the test condition is met
containsAll
public boolean containsAll(String s)
Returns true if there is a partition of the string such that this set contains each of the partitioned strings.
For example, for the Unicode set [a{bc}{cd}]
containsAll is true for each of: "a", "bc", ""cdbca"
containsAll is false for each of: "acb", "bcda", "bcx"
s
- string containing characters to be checked for containment
- true if the test condition is met
containsAll
public boolean containsAll(UnicodeSet c)
Returns true if this set contains all the characters and strings
of the given set.
c
- set to be checked for containment
- true if the test condition is met
containsNone
public boolean containsNone(String s)
Returns true if this set contains none of the characters
of the given string.
s
- string containing characters to be checked for containment
- true if the test condition is met
containsNone
public boolean containsNone(UnicodeSet c)
Returns true if none of the characters or strings in this UnicodeSet appears in the string.
For example, for the Unicode set [a{bc}{cd}]
containsNone is true for: "xy", "cb"
containsNone is false for: "a", "bc", "bcd"
c
- set to be checked for containment
- true if the test condition is met
containsNone
public boolean containsNone(int start,
int end)
Returns true if this set contains none of the characters
of the given range.
start
- first character, inclusive, of the rangeend
- last character, inclusive, of the range
- true if the test condition is met
containsSome
public final boolean containsSome(String s)
Returns true if this set contains one or more of the characters
of the given string.
s
- string containing characters to be checked for containment
- true if the condition is met
containsSome
public final boolean containsSome(UnicodeSet s)
Returns true if this set contains one or more of the characters
and strings of the given set.
s
- set to be checked for containment
- true if the condition is met
containsSome
public final boolean containsSome(int start,
int end)
Returns true if this set contains one or more of the characters
in the given range.
start
- first character, inclusive, of the rangeend
- last character, inclusive, of the range
- true if the condition is met
equals
public boolean equals(Object o)
Compares the specified object with this set for equality. Returns
true if the specified object is also a set, the two sets
have the same size, and every member of the specified set is
contained in this set (or equivalently, every member of this set is
contained in the specified set).
o
- Object to be compared for equality with this set.
- true if the specified Object is equal to this set.
from
public static UnicodeSet from(String s)
Makes a set from a multicharacter string. Thus "ch" => {"ch"}
Warning: you cannot add an empty string ("") to a UnicodeSet.
- a newly created set containing the given string
fromAll
public static UnicodeSet fromAll(String s)
Makes a set from each of the characters in the string. Thus "ch" => {"c", "h"}
- a newly created set containing the given characters
getRangeCount
public int getRangeCount()
Iteration method that returns the number of ranges contained in
this set.
getRangeEnd
public int getRangeEnd(int index)
Iteration method that returns the last character in the
specified range of this set.
getRangeStart
public int getRangeStart(int index)
Iteration method that returns the first character in the
specified range of this set.
getRegexEquivalent
public String getRegexEquivalent()
- regex pattern equivalent to this UnicodeSet
hashCode
public int hashCode()
Returns the hash code value for this set.
- the hash code value for this set.
indexOf
public int indexOf(int c)
Returns the index of the given character within this set, where
the set is ordered by ascending code point. If the character
is not in this set, return -1. The inverse of this method is
charAt()
.
- an index from 0..size()-1, or -1
isEmpty
public boolean isEmpty()
Returns true if this set contains no elements.
- true if this set contains no elements.
matchesIndexValue
public boolean matchesIndexValue(int v)
Implementation of UnicodeMatcher API. Returns true if
this set contains any character whose low byte is the given
value. This is used by RuleBasedTransliterator for
indexing.
- matchesIndexValue in interface UnicodeMatcher
remove
public final UnicodeSet remove(String s)
Removes the specified string from this set if it is present.
The set will not contain the specified string once the call
returns.
s
- the string to be removed
- this object, for chaining
remove
public final UnicodeSet remove(int c)
Removes the specified character from this set if it is present.
The set will not contain the specified character once the call
returns.
c
- the character to be removed
- this object, for chaining
remove
public UnicodeSet remove(int start,
int end)
Removes the specified range from this set if it is present.
The set will not contain the specified range once the call
returns. If end > start
then an empty range is
removed, leaving the set unchanged.
start
- first character, inclusive, of range to be removed
from this set.end
- last character, inclusive, of range to be removed
from this set.
removeAll
public final UnicodeSet removeAll(String s)
Remove EACH of the characters in this string. Note: "ch" == {"c", "h"}
If this set already any particular character, it has no effect on that character.
- this object, for chaining
removeAll
public UnicodeSet removeAll(UnicodeSet c)
Removes from this set all of its elements that are contained in the
specified set. This operation effectively modifies this
set so that its value is the asymmetric set difference of
the two sets.
c
- set that defines which elements will be removed from
this set.
resemblesPattern
public static boolean resemblesPattern(String pattern,
int pos)
Return true if the given position, in the given pattern, appears
to be the start of a UnicodeSet pattern.
retain
public final UnicodeSet retain(String s)
Retain the specified string in this set if it is present.
Upon return this set will be empty if it did not contain s, or
will only contain s if it did contain s.
s
- the string to be retained
- this object, for chaining
retain
public final UnicodeSet retain(int c)
Retain the specified character from this set if it is present.
Upon return this set will be empty if it did not contain c, or
will only contain c if it did contain c.
c
- the character to be retained
- this object, for chaining
retain
public UnicodeSet retain(int start,
int end)
Retain only the elements in this set that are contained in the
specified range. If end > start
then an empty range is
retained, leaving the set empty.
start
- first character, inclusive, of range to be retained
to this set.end
- last character, inclusive, of range to be retained
to this set.
retainAll
public final UnicodeSet retainAll(String s)
Retains EACH of the characters in this string. Note: "ch" == {"c", "h"}
If this set already any particular character, it has no effect on that character.
- this object, for chaining
retainAll
public UnicodeSet retainAll(UnicodeSet c)
Retains only the elements in this set that are contained in the
specified set. In other words, removes from this set all of
its elements that are not contained in the specified set. This
operation effectively modifies this set so that its value is
the intersection of the two sets.
c
- set that defines which elements this set will retain.
set
public UnicodeSet set(UnicodeSet other)
Make this object represent the same set as other
.
other
- a UnicodeSet
whose value will be
copied to this object
set
public UnicodeSet set(int start,
int end)
Make this object represent the range start - end
.
If end > start
then this object is set to an
an empty range.
start
- first character in the set, inclusiveend
- last character in the set, inclusive
size
public int size()
Returns the number of elements in this set (its cardinality)
Note than the elements of a set may include both individual
codepoints and strings.
- the number of elements in this set (its cardinality).
toPattern
public String toPattern(boolean escapeUnprintable)
Returns a string representation of this set. If the result of
calling this function is passed to a UnicodeSet constructor, it
will produce another set that is equal to this one.
- toPattern in interface UnicodeMatcher
toString
public String toString()
Return a programmer-readable string representation of this object.