A compression engine implementing the Standard Compression Scheme
for Unicode (SCSU) as outlined in
Unicode Technical
Report #6.
The SCSU works by using dynamically positioned
windows
consisting of 128 consecutive characters in Unicode. During compression,
characters within a window are encoded in the compressed stream as the bytes
0x7F - 0xFF. The SCSU provides transparency for the characters
(bytes) between
U+0000 - U+00FF. The SCSU approximates the
storage size of traditional character sets, for example 1 byte per
character for ASCII or Latin-1 text, and 2 bytes per character for CJK
ideographs.
USAGE
The static methods on
UnicodeCompressor may be used in a
straightforward manner to compress simple strings:
String s = ... ; // get string from somewhere
byte [] compressed = UnicodeCompressor.compress(s);
The static methods have a fairly large memory footprint.
For finer-grained control over memory usage,
UnicodeCompressor offers more powerful APIs allowing
iterative compression:
// Compress an array "chars" of length "len" using a buffer of 512 bytes
// to the OutputStream "out"
UnicodeCompressor myCompressor = new UnicodeCompressor();
final static int BUFSIZE = 512;
byte [] byteBuffer = new byte [ BUFSIZE ];
int bytesWritten = 0;
int [] unicharsRead = new int [1];
int totalCharsCompressed = 0;
int totalBytesWritten = 0;
do {
// do the compression
bytesWritten = myCompressor.compress(chars, totalCharsCompressed,
len, unicharsRead,
byteBuffer, 0, BUFSIZE);
// do something with the current set of bytes
out.write(byteBuffer, 0, bytesWritten);
// update the no. of characters compressed
totalCharsCompressed += unicharsRead[0];
// update the no. of bytes written
totalBytesWritten += bytesWritten;
} while(totalCharsCompressed <32len);
myCompressor.reset(); // reuse compressor
compress
public static byte[] compress(String buffer)
Compress a string into a byte array.
buffer
- The string to compress.
- A byte array containing the compressed characters.
compress(char [], int, int)
compress
public static byte[] compress(char[] buffer,
int start,
int limit)
Compress a Unicode character array into a byte array.
buffer
- The character buffer to compress.start
- The start of the character run to compress.limit
- The limit of the character run to compress.
- A byte array containing the compressed characters.
compress
public int compress(char[] charBuffer,
int charBufferStart,
int charBufferLimit,
int[] charsRead,
byte[] byteBuffer,
int byteBufferStart,
int byteBufferLimit)
Compress a Unicode character array into a byte array.
This function will only consume input that can be completely
output.
charBuffer
- The character buffer to compress.charBufferStart
- The start of the character run to compress.charBufferLimit
- The limit of the character run to compress.charsRead
- A one-element array. If not null, on return
the number of characters read from charBuffer.byteBuffer
- A buffer to receive the compressed data. This
buffer must be at minimum four bytes in size.byteBufferStart
- The starting offset to which to write
compressed data.byteBufferLimit
- The limiting offset for writing compressed data.
- The number of bytes written to byteBuffer.
reset
public void reset()
Reset the compressor to its initial state.