Class PercentEscaper

java.lang.Object
com.google.gdata.util.common.base.UnicodeEscaper
com.google.gdata.util.common.base.PercentEscaper
All Implemented Interfaces:
Escaper

public class PercentEscaper extends UnicodeEscaper
A UnicodeEscaper that escapes some set of Java characters using the URI percent encoding scheme. The set of safe characters (those which remain unescaped) can be specified on construction.

For details on escaping URIs for use in web pages, see section 2.4 of RFC 3986.

In most cases this class should not need to be used directly. If you have no special requirements for escaping your URIs, you should use either

invalid reference
CharEscapers#uriEscaper()
or
invalid reference
CharEscapers#uriEscaper(boolean)
.

When encoding a String, the following rules apply:

  • The alphanumeric characters "a" through "z", "A" through "Z" and "0" through "9" remain the same.
  • Any additionally specified safe characters remain the same.
  • If plusForSpace was specified, the space character " " is converted into a plus sign "+".
  • All other characters are converted into one or more bytes using UTF-8 encoding and each byte is then represented by the 3-character string "%XY", where "XY" is the two-digit, uppercase, hexadecimal representation of the byte value.

RFC 2396 specifies the set of unreserved characters as "-", "_", ".", "!", "~", "*", "'", "(" and ")". It goes on to state:

Unreserved characters can be escaped without changing the semantics of the URI, but this should not be done unless the URI is being used in a context that does not allow the unescaped character to appear.

For performance reasons the only currently supported character encoding of this class is UTF-8.

Note: This escaper produces uppercase hexidecimal sequences. From RFC 3986:
"URI producers and normalizers should use uppercase hexadecimal digits for all percent-encodings."

  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    private final boolean
    If true we should convert space to the + character.
    static final String
    A string of safe characters that mimics the behavior of URLEncoder.
    private final boolean[]
    An array of flags where for any char c if safeOctets[c] is true then c should remain unmodified in the output.
    static final String
    A string of characters that do not need to be encoded when used in URI path segments, as specified in RFC 3986.
    static final String
    A string of characters that do not need to be encoded when used in URI query strings, as specified in RFC 3986.
    private static final char[]
     
    private static final char[]
     
  • Constructor Summary

    Constructors
    Constructor
    Description
    PercentEscaper(String safeChars, boolean plusForSpace)
    Constructs a URI escaper with the specified safe characters and optional handling of the space character.
  • Method Summary

    Modifier and Type
    Method
    Description
    private static boolean[]
    Creates a boolean[] with entries corresponding to the character values for 0-9, A-Z, a-z and those specified in safeChars set to true.
    protected char[]
    escape(int cp)
    Escapes the given Unicode code point in UTF-8.
    Returns the escaped form of a given literal string.
    protected int
    nextEscapeIndex(CharSequence csq, int index, int end)
    Scans a sub-sequence of characters from a given CharSequence, returning the index of the next character that requires escaping.

    Methods inherited from class com.google.gdata.util.common.base.UnicodeEscaper

    codePointAt, escape, escapeSlow

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • SAFECHARS_URLENCODER

      public static final String SAFECHARS_URLENCODER
      A string of safe characters that mimics the behavior of URLEncoder.
      See Also:
    • SAFEPATHCHARS_URLENCODER

      public static final String SAFEPATHCHARS_URLENCODER
      A string of characters that do not need to be encoded when used in URI path segments, as specified in RFC 3986. Note that some of these characters do need to be escaped when used in other parts of the URI.
      See Also:
    • SAFEQUERYSTRINGCHARS_URLENCODER

      public static final String SAFEQUERYSTRINGCHARS_URLENCODER
      A string of characters that do not need to be encoded when used in URI query strings, as specified in RFC 3986. Note that some of these characters do need to be escaped when used in other parts of the URI.
      See Also:
    • URI_ESCAPED_SPACE

      private static final char[] URI_ESCAPED_SPACE
    • UPPER_HEX_DIGITS

      private static final char[] UPPER_HEX_DIGITS
    • plusForSpace

      private final boolean plusForSpace
      If true we should convert space to the + character.
    • safeOctets

      private final boolean[] safeOctets
      An array of flags where for any char c if safeOctets[c] is true then c should remain unmodified in the output. If c > safeOctets.length then it should be escaped.
  • Constructor Details

    • PercentEscaper

      public PercentEscaper(String safeChars, boolean plusForSpace)
      Constructs a URI escaper with the specified safe characters and optional handling of the space character.
      Parameters:
      safeChars - a non null string specifying additional safe characters for this escaper (the ranges 0..9, a..z and A..Z are always safe and should not be specified here)
      plusForSpace - true if ASCII space should be escaped to + rather than %20
      Throws:
      IllegalArgumentException - if any of the parameters were invalid
  • Method Details

    • createSafeOctets

      private static boolean[] createSafeOctets(String safeChars)
      Creates a boolean[] with entries corresponding to the character values for 0-9, A-Z, a-z and those specified in safeChars set to true. The array is as small as is required to hold the given character information.
    • nextEscapeIndex

      protected int nextEscapeIndex(CharSequence csq, int index, int end)
      Description copied from class: UnicodeEscaper
      Scans a sub-sequence of characters from a given CharSequence, returning the index of the next character that requires escaping.

      Note: When implementing an escaper, it is a good idea to override this method for efficiency. The base class implementation determines successive Unicode code points and invokes UnicodeEscaper.escape(int) for each of them. If the semantics of your escaper are such that code points in the supplementary range are either all escaped or all unescaped, this method can be implemented more efficiently using CharSequence.charAt(int).

      Note however that if your escaper does not escape characters in the supplementary range, you should either continue to validate the correctness of any surrogate characters encountered or provide a clear warning to users that your escaper does not validate its input.

      See PercentEscaper for an example.

      Overrides:
      nextEscapeIndex in class UnicodeEscaper
      Parameters:
      csq - a sequence of characters
      index - the index of the first character to be scanned
      end - the index immediately after the last character to be scanned
    • escape

      public String escape(String s)
      Description copied from class: UnicodeEscaper
      Returns the escaped form of a given literal string.

      If you are escaping input in arbitrary successive chunks, then it is not generally safe to use this method. If an input string ends with an unmatched high surrogate character, then this method will throw IllegalArgumentException. You should either ensure your input is valid UTF-16 before calling this method or use an escaped Appendable (as returned by UnicodeEscaper.escape(Appendable)) which can cope with arbitrarily split input.

      Note: When implementing an escaper it is a good idea to override this method for efficiency by inlining the implementation of UnicodeEscaper.nextEscapeIndex(CharSequence, int, int) directly. Doing this for PercentEscaper more than doubled the performance for unescaped strings (as measured by

      invalid reference
      CharEscapersBenchmark
      ).
      Specified by:
      escape in interface Escaper
      Overrides:
      escape in class UnicodeEscaper
      Parameters:
      s - the literal string to be escaped
      Returns:
      the escaped form of string
    • escape

      protected char[] escape(int cp)
      Escapes the given Unicode code point in UTF-8.
      Specified by:
      escape in class UnicodeEscaper
      Parameters:
      cp - the Unicode code point to escape if necessary
      Returns:
      the replacement characters, or null if no escaping was needed