[FileInpuStream]Problème avec l'UTF-8

Version imprimable

05/12/2005, 11h11
Invité

[FileInpuStream]Problème avec l'UTF-8
Bonjour à tous,

J'ai un problème à la lecture d'un fichier txt qui contient du xml.
Le fichier n'a probablement pas été créé par un pgm java.
Lorsque j'ouvre le fichier avec un editeur (notepad ou UE), je peux lire mon fichier correctement.
Mais l'orsque j'utilise un FileInputStream, le format reçu est différent.
Par exemple:
Voici le début du fichier
Code:

1 2 <?xml version="1.0" encoding="UTF-16"?><PDE><Statusdate>2004-12-30T00:00:00+01:00</Statusdate>
ce que ça donne lors de la lecture dans mon byte[]
Code:

1 2 [-1, -2, 60, 0, 63, 0, 120, 0, 109, 0, 108, 0, 32, 0, 118, 0, 101, 0, 114, 0, 115, 0, 105, 0, 111, 0, 110,...
on peut remarquer qu'après chaque caractère, un byte 0 est inéré...

Et je n'arrive même pas à copy/paster le string ici car il contient des 'carrés' entre chaque caractère (les fameux 0)!
Quel est cet étrange fénomène et comment puis-je obtenir le contenu du fichier dans un String tel qu'on peut le récupérer dans un éditeur classique ?
Help
Merci
septentryon
05/12/2005, 12h20
Gfx

Java code les caracteres sur 16 bits, donc forcement avec un tableau de byte[] tu vas voir des choses etranges. Peux-tu montrer ton code pour lire ton fichier ? Essaye d'utiliser des Reader plutot que des InputStream. Les Stream sont prevus pour du "binaire" et les Reader pour des caracteres.

Voici la méthode ...

Voici la méthode que j'utilise.
Le seul moyen que j'ai trouvé est de lire le fichier byte par byte et de ne pas tenir compte des 0.
Mais évidemment la lecture est extremement lente !! J'ai +- 350 fichiers à lire.
Donc il faut trouver un autre moyen pour pouvoir lire le fichier directement dans le bon format.
Le boolean begin est uniquement utilisé pour ne pas tenir compte des premier byte négatif qui n'ont rien a avoir avec le texte.

Code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
 
	private static PDEHolder getXMLFromDirectory(String path) throws Exception {
		RandomAccessFile fic;
 
		String fileContent = "";
		PDEHolder pdeHolder = null;
 
		//Properties props = new Properties();
 
		try {
 
			XStream xstream = null;
 
 
				      try {
				        FileInputStream fis = new FileInputStream(path); 
				        //byte[] b = new byte[70000];
				        boolean begin = false;
				        byte[] b = new byte[1];
				        while (true) {
				          int result = fis.read(b);
				          if (result == -1) break;
				          String s = new String(b, 0, result);
 
				          if ((!begin)&&(b[0] > 0)){begin=true;}
				          if(begin){
				        	  if (b[0]!=0){
				        	  fileContent+=s;}
				        	  }
				        } // end while
				      } // end try
				    // Is this catch strictly necessary?
				      catch (FileNotFoundException ex) {
				        System.err.println("Could not find file " + path); 
				      }
				      catch (IOException ex) {
				        System.err.println(ex); 
				      }
				      System.out.println();
				    // end for
 
 
                xstream = XmlServices.getXStreamInstance();
                pdeHolder = (PDEHolder) xstream.fromXML(fileContent);
            } catch( Exception t) {t.printStackTrace();}
 
            catch (Throwable t) {
                t.printStackTrace();
            }
 
		return pdeHolder;
	}
 
}

Merci d'avance pour le coup de pouce
septentryon

05/12/2005, 13h06
Gfx

Utilise un Reader, la tu vas a l'encontre meme de la maniere dont Java traite les chaines de caracter en interne.

bon

Voila ma modif:
Code:

1 2 3 4 5 6 7 8 9 10 11 12 try { FileReader reader = new FileReader(new java.io.File(path)); char[] b = new char[70000]; while (true) { int result = reader.read(b); if (result == -1) break; String s = new String(b, 0, result); fileContent+=s; }
Mais le problème reste pareil, les 3 premiers caractères sont inconpréhensibles et il break au premier 'carré' qu'il rencontre...

Essaye ca :

Code:

1
2
3
4
5
6
7
8
9
10
 
    StringBuffer contents = new StringBuffer();
    BufferedReader input = new BufferedReader( new FileReader(aFile) );
    String line = null;
    while (( line = input.readLine()) != null){
      contents.append(line);
      contents.append(System.getProperty("line.separator"));
    }
  }
  String text = contents.toString();

05/12/2005, 13h30
Invité

désolé

Désolé, ... c'est pareil ...
05/12/2005, 13h44
Gfx

Je n'ai jamais vu ce probleme. Mais si ton fichier est enregistre en UTF-8, jette un oeil ici : http://bugs.sun.com/bugdatabase/view_bug.do;:YfiG?bug_id=4508058

Effectivement c'est un bug...

Super, grace au lien j'ai résolu ce problème.
Voici un extrait du lien que tu m'as conseillé.
Il pourra pe servir à qq d'autre ...

Workaround code, use this Reader for all unicode textstreams.
It will recognize all BOMs and will skip bytes accordingly.

Code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
 
/**
 Original pseudocode   : Thomas Weidenfeller
 Implementation tweaked: Aki Nieminen
 
 http://www.unicode.org/unicode/faq/utf_bom.html
 BOMs:
   00 00 FE FF    = UTF-32, big-endian
   FF FE 00 00    = UTF-32, little-endian
   FE FF          = UTF-16, big-endian
   FF FE          = UTF-16, little-endian
   EF BB BF       = UTF-8
 
 Win2k Notepad:
   Unicode format = UTF-16LE
***/
 
import java.io.*;
 
/**
 * Generic unicode textreader, which will use BOM mark
 * to identify the encoding to be used.
 */
public class UnicodeReader extends Reader {
   PushbackInputStream internalIn;
	InputStreamReader   internalIn2 = null;
	String              defaultEnc;
 
	private static final int BOM_SIZE = 4;
 
/*
Default encoding is used only if BOM is not found. If
defaultEncoding is NULL then systemdefault is used.
*/
	UnicodeReader(InputStream in, String defaultEnc) {
		internalIn = new PushbackInputStream(in, 
BOM_SIZE);
		this.defaultEnc = defaultEnc;
	}
 
	public String getDefaultEncoding() {
      return defaultEnc;
   }
 
   public String getEncoding() {
      if (internalIn2 == null) return null;
      return internalIn2.getEncoding();
   }
 
   /**
    * Read-ahead four bytes and check for BOM marks. Extra 
bytes are
    * unread back to the stream, only BOM bytes are skipped.
    */
	protected void init() throws IOException {
      if (internalIn2 != null) return;
 
      String encoding;
		byte bom[] = new byte[BOM_SIZE];
		int n, unread;
		n = internalIn.read(bom, 0, bom.length);
 
      if (  (bom[0] == (byte)0xEF) && (bom[1] == (byte)0xBB) 
&&
            (bom[2] == (byte)0xBF) ) {
         encoding = "UTF-8";
         unread = n - 3;
      } else if ( (bom[0] == (byte)0xFE) && (bom[1] == (byte)
0xFF) ) {
         encoding = "UTF-16BE";
         unread = n - 2;
      } else if ( (bom[0] == (byte)0xFF) && (bom[1] == (byte)
0xFE) ) {
         encoding = "UTF-16LE";
         unread = n - 2;
      } else if ( (bom[0] == (byte)0x00) && (bom[1] == (byte)
0x00) &&
                  (bom[2] == (byte)0xFE) && (bom[3] == (byte)
0xFF)) {
         encoding = "UTF-32BE";
         unread = n - 4;
      } else if ( (bom[0] == (byte)0xFF) && (bom[1] == (byte)
0xFE) &&
                  (bom[2] == (byte)0x00) && (bom[3] == (byte)
0x00)) {
         encoding = "UTF-32LE";
         unread = n - 4;
      } else {
         // Unicode BOM mark not found, unread all bytes
         encoding = defaultEnc;
         unread = n;
      }
//      System.out.println("read=" + n + ", unread=" + unread);
 
      if (unread > 0) internalIn.unread(bom, (n - unread), 
unread);
      else if (unread < -1) internalIn.unread(bom, 0, 0);
 
      // Use given encoding
      if (encoding == null) {
         internalIn2 = new InputStreamReader(internalIn);
      } else {
         internalIn2 = new InputStreamReader(internalIn, 
encoding);
      }
	}
 
   public void close() throws IOException {
      init();
      internalIn2.close();
   }
 
   public int read(char[] cbuf, int off, int len) throws 
IOException {
      init();
      return internalIn2.read(cbuf, off, len);
   }
 
}

Encore un grand merci pour l'aide !
septentryon