1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
|
/**
Original pseudocode : Thomas Weidenfeller
Implementation tweaked: Aki Nieminen
http://www.unicode.org/unicode/faq/utf_bom.html
BOMs:
00 00 FE FF = UTF-32, big-endian
FF FE 00 00 = UTF-32, little-endian
FE FF = UTF-16, big-endian
FF FE = UTF-16, little-endian
EF BB BF = UTF-8
Win2k Notepad:
Unicode format = UTF-16LE
***/
import java.io.*;
/**
* Generic unicode textreader, which will use BOM mark
* to identify the encoding to be used.
*/
public class UnicodeReader extends Reader {
PushbackInputStream internalIn;
InputStreamReader internalIn2 = null;
String defaultEnc;
private static final int BOM_SIZE = 4;
/*
Default encoding is used only if BOM is not found. If
defaultEncoding is NULL then systemdefault is used.
*/
UnicodeReader(InputStream in, String defaultEnc) {
internalIn = new PushbackInputStream(in,
BOM_SIZE);
this.defaultEnc = defaultEnc;
}
public String getDefaultEncoding() {
return defaultEnc;
}
public String getEncoding() {
if (internalIn2 == null) return null;
return internalIn2.getEncoding();
}
/**
* Read-ahead four bytes and check for BOM marks. Extra
bytes are
* unread back to the stream, only BOM bytes are skipped.
*/
protected void init() throws IOException {
if (internalIn2 != null) return;
String encoding;
byte bom[] = new byte[BOM_SIZE];
int n, unread;
n = internalIn.read(bom, 0, bom.length);
if ( (bom[0] == (byte)0xEF) && (bom[1] == (byte)0xBB)
&&
(bom[2] == (byte)0xBF) ) {
encoding = "UTF-8";
unread = n - 3;
} else if ( (bom[0] == (byte)0xFE) && (bom[1] == (byte)
0xFF) ) {
encoding = "UTF-16BE";
unread = n - 2;
} else if ( (bom[0] == (byte)0xFF) && (bom[1] == (byte)
0xFE) ) {
encoding = "UTF-16LE";
unread = n - 2;
} else if ( (bom[0] == (byte)0x00) && (bom[1] == (byte)
0x00) &&
(bom[2] == (byte)0xFE) && (bom[3] == (byte)
0xFF)) {
encoding = "UTF-32BE";
unread = n - 4;
} else if ( (bom[0] == (byte)0xFF) && (bom[1] == (byte)
0xFE) &&
(bom[2] == (byte)0x00) && (bom[3] == (byte)
0x00)) {
encoding = "UTF-32LE";
unread = n - 4;
} else {
// Unicode BOM mark not found, unread all bytes
encoding = defaultEnc;
unread = n;
}
// System.out.println("read=" + n + ", unread=" + unread);
if (unread > 0) internalIn.unread(bom, (n - unread),
unread);
else if (unread < -1) internalIn.unread(bom, 0, 0);
// Use given encoding
if (encoding == null) {
internalIn2 = new InputStreamReader(internalIn);
} else {
internalIn2 = new InputStreamReader(internalIn,
encoding);
}
}
public void close() throws IOException {
init();
internalIn2.close();
}
public int read(char[] cbuf, int off, int len) throws
IOException {
init();
return internalIn2.read(cbuf, off, len);
}
} |
Partager