com.ibm.icu.text

Class UnicodeCompressor

Implemented Interfaces:
com.ibm.icu.text.SCSU

public final class UnicodeCompressor
extends Object
implements com.ibm.icu.text.SCSU

A compression engine implementing the Standard Compression Scheme for Unicode (SCSU) as outlined in Unicode Technical Report #6.

The SCSU works by using dynamically positioned windows consisting of 128 consecutive characters in Unicode. During compression, characters within a window are encoded in the compressed stream as the bytes 0x7F - 0xFF. The SCSU provides transparency for the characters (bytes) between U+0000 - U+00FF. The SCSU approximates the storage size of traditional character sets, for example 1 byte per character for ASCII or Latin-1 text, and 2 bytes per character for CJK ideographs.

USAGE

The static methods on UnicodeCompressor may be used in a straightforward manner to compress simple strings:

  String s = ... ; // get string from somewhere
  byte [] compressed = UnicodeCompressor.compress(s);
 

The static methods have a fairly large memory footprint. For finer-grained control over memory usage, UnicodeCompressor offers more powerful APIs allowing iterative compression:

  // Compress an array "chars" of length "len" using a buffer of 512 bytes
  // to the OutputStream "out"

  UnicodeCompressor myCompressor         = new UnicodeCompressor();
  final static int  BUFSIZE              = 512;
  byte []           byteBuffer           = new byte [ BUFSIZE ];
  int               bytesWritten         = 0;
  int []            unicharsRead         = new int [1];
  int               totalCharsCompressed = 0;
  int               totalBytesWritten    = 0;

  do {
    // do the compression
    bytesWritten = myCompressor.compress(chars, totalCharsCompressed, 
                                         len, unicharsRead,
                                         byteBuffer, 0, BUFSIZE);

    // do something with the current set of bytes
    out.write(byteBuffer, 0, bytesWritten);

    // update the no. of characters compressed
    totalCharsCompressed += unicharsRead[0];

    // update the no. of bytes written
    totalBytesWritten += bytesWritten;

  } while(totalCharsCompressed <32len);

  myCompressor.reset(); // reuse compressor
 
Author:
Stephen F. Booth
See Also:
UnicodeDecompressor

Fields inherited from interface com.ibm.icu.text.SCSU

ARMENIANINDEX, COMPRESSIONOFFSET, GREEKINDEX, HALFWIDTHKATAKANAINDEX, HIRAGANAINDEX, INVALIDCHAR, INVALIDWINDOW, IPAEXTENSIONINDEX, KATAKANAINDEX, LATININDEX, MAXINDEX, NUMSTATICWINDOWS, NUMWINDOWS, RESERVEDINDEX, SCHANGE0, SCHANGE1, SCHANGE2, SCHANGE3, SCHANGE4, SCHANGE5, SCHANGE6, SCHANGE7, SCHANGEU, SDEFINE0, SDEFINE1, SDEFINE2, SDEFINE3, SDEFINE4, SDEFINE5, SDEFINE6, SDEFINE7, SDEFINEX, SINGLEBYTEMODE, SQUOTE0, SQUOTE1, SQUOTE2, SQUOTE3, SQUOTE4, SQUOTE5, SQUOTE6, SQUOTE7, SQUOTEU, SRESERVED, UCHANGE0, UCHANGE1, UCHANGE2, UCHANGE3, UCHANGE4, UCHANGE5, UCHANGE6, UCHANGE7, UDEFINE0, UDEFINE1, UDEFINE2, UDEFINE3, UDEFINE4, UDEFINE5, UDEFINE6, UDEFINE7, UDEFINEX, UNICODEMODE, UQUOTEU, URESERVED, sOffsetTable, sOffsets

Constructor Summary

UnicodeCompressor()
Create a UnicodeCompressor.

Method Summary

static byte[]
compress(String buffer)
Compress a string into a byte array.
static byte[]
compress(char[] buffer, int start, int limit)
Compress a Unicode character array into a byte array.
int
compress(char[] charBuffer, int charBufferStart, int charBufferLimit, int[] charsRead, byte[] byteBuffer, int byteBufferStart, int byteBufferLimit)
Compress a Unicode character array into a byte array.
void
reset()
Reset the compressor to its initial state.

Constructor Details

UnicodeCompressor

public UnicodeCompressor()
Create a UnicodeCompressor. Sets all windows to their default values.

Method Details

compress

public static byte[] compress(String buffer)
Compress a string into a byte array.
Parameters:
buffer - The string to compress.
Returns:
A byte array containing the compressed characters.
See Also:
compress(char [], int, int)

compress

public static byte[] compress(char[] buffer,
                              int start,
                              int limit)
Compress a Unicode character array into a byte array.
Parameters:
buffer - The character buffer to compress.
start - The start of the character run to compress.
limit - The limit of the character run to compress.
Returns:
A byte array containing the compressed characters.

compress

public int compress(char[] charBuffer,
                    int charBufferStart,
                    int charBufferLimit,
                    int[] charsRead,
                    byte[] byteBuffer,
                    int byteBufferStart,
                    int byteBufferLimit)
Compress a Unicode character array into a byte array. This function will only consume input that can be completely output.
Parameters:
charBuffer - The character buffer to compress.
charBufferStart - The start of the character run to compress.
charBufferLimit - The limit of the character run to compress.
charsRead - A one-element array. If not null, on return the number of characters read from charBuffer.
byteBuffer - A buffer to receive the compressed data. This buffer must be at minimum four bytes in size.
byteBufferStart - The starting offset to which to write compressed data.
byteBufferLimit - The limiting offset for writing compressed data.
Returns:
The number of bytes written to byteBuffer.

reset

public void reset()
Reset the compressor to its initial state.

Copyright (c) 2006 IBM Corporation and others.