[FileInpuStream]Problème avec l'UTF-8

Invité · 05/12/2005, 12h11

Bonjour à tous,

J'ai un problème à la lecture d'un fichier txt qui contient du xml.
Le fichier n'a probablement pas été créé par un pgm java.
Lorsque j'ouvre le fichier avec un editeur (notepad ou UE), je peux lire mon fichier correctement.
Mais l'orsque j'utilise un FileInputStream, le format reçu est différent.
Par exemple:
Voici le début du fichier

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
 
<?xml version="1.0" encoding="UTF-16"?><PDE><Statusdate>2004-12-30T00:00:00+01:00</Statusdate>

ce que ça donne lors de la lecture dans mon byte[]

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
 
[-1, -2, 60, 0, 63, 0, 120, 0, 109, 0, 108, 0, 32, 0, 118, 0, 101, 0, 114, 0, 115, 0, 105, 0, 111, 0, 110,...

on peut remarquer qu'après chaque caractère, un byte 0 est inéré...

Et je n'arrive même pas à copy/paster le string ici car il contient des 'carrés' entre chaque caractère (les fameux 0)!
Quel est cet étrange fénomène et comment puis-je obtenir le contenu du fichier dans un String tel qu'on peut le récupérer dans un éditeur classique ?
Help
Merci
septentryon

**Gfx** · 05/12/2005, 13h20

Java code les caracteres sur 16 bits, donc forcement avec un tableau de byte[] tu vas voir des choses etranges. Peux-tu montrer ton code pour lire ton fichier ? Essaye d'utiliser des Reader plutot que des InputStream. Les Stream sont prevus pour du "binaire" et les Reader pour des caracteres.

Invité · 05/12/2005, 14h03

Voici la méthode que j'utilise.
Le seul moyen que j'ai trouvé est de lire le fichier byte par byte et de ne pas tenir compte des 0.
Mais évidemment la lecture est extremement lente !! J'ai +- 350 fichiers à lire.
Donc il faut trouver un autre moyen pour pouvoir lire le fichier directement dans le bon format.
Le boolean begin est uniquement utilisé pour ne pas tenir compte des premier byte négatif qui n'ont rien a avoir avec le texte.

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
 
	private static PDEHolder getXMLFromDirectory(String path) throws Exception {
		RandomAccessFile fic;
 
		String fileContent = "";
		PDEHolder pdeHolder = null;
 
		//Properties props = new Properties();
 
		try {
 
			XStream xstream = null;
 
 
				      try {
				        FileInputStream fis = new FileInputStream(path); 
				        //byte[] b = new byte[70000];
				        boolean begin = false;
				        byte[] b = new byte[1];
				        while (true) {
				          int result = fis.read(b);
				          if (result == -1) break;
				          String s = new String(b, 0, result);
 
				          if ((!begin)&&(b[0] > 0)){begin=true;}
				          if(begin){
				        	  if (b[0]!=0){
				        	  fileContent+=s;}
				        	  }
				        } // end while
				      } // end try
				    // Is this catch strictly necessary?
				      catch (FileNotFoundException ex) {
				        System.err.println("Could not find file " + path); 
				      }
				      catch (IOException ex) {
				        System.err.println(ex); 
				      }
				      System.out.println();
				    // end for
 
 
                xstream = XmlServices.getXStreamInstance();
                pdeHolder = (PDEHolder) xstream.fromXML(fileContent);
            } catch( Exception t) {t.printStackTrace();}
 
            catch (Throwable t) {
                t.printStackTrace();
            }
 
		return pdeHolder;
	}
 
}

Merci d'avance pour le coup de pouce
septentryon

**Gfx** · 05/12/2005, 14h06

Utilise un Reader, la tu vas a l'encontre meme de la maniere dont Java traite les chaines de caracter en interne.

Invité · 05/12/2005, 14h18

Voila ma modif:

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
 
try {
FileReader reader = new FileReader(new java.io.File(path));
char[] b = new char[70000];
 
 while (true) {
	 int result = reader.read(b);
	 if (result == -1) break;
	 String s = new String(b, 0, result);
         fileContent+=s;
 
 }

Mais le problème reste pareil, les 3 premiers caractères sont inconpréhensibles et il break au premier 'carré' qu'il rencontre...

**Gfx** · 05/12/2005, 14h25

Essaye ca :

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
 
    StringBuffer contents = new StringBuffer();
    BufferedReader input = new BufferedReader( new FileReader(aFile) );
    String line = null;
    while (( line = input.readLine()) != null){
      contents.append(line);
      contents.append(System.getProperty("line.separator"));
    }
  }
  String text = contents.toString();

Invité · 05/12/2005, 14h30

Désolé, ... c'est pareil ...

**Gfx** · 05/12/2005, 14h44

Je n'ai jamais vu ce probleme. Mais si ton fichier est enregistre en UTF-8, jette un oeil ici : http://bugs.sun.com/bugdatabase/view_bug.do;:YfiG?bug_id=4508058

Invité · 05/12/2005, 15h04

Super, grace au lien j'ai résolu ce problème.
Voici un extrait du lien que tu m'as conseillé.
Il pourra pe servir à qq d'autre ...

Workaround code, use this Reader for all unicode textstreams.
It will recognize all BOMs and will skip bytes accordingly.

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
 
/**
 Original pseudocode   : Thomas Weidenfeller
 Implementation tweaked: Aki Nieminen
 
 http://www.unicode.org/unicode/faq/utf_bom.html
 BOMs:
   00 00 FE FF    = UTF-32, big-endian
   FF FE 00 00    = UTF-32, little-endian
   FE FF          = UTF-16, big-endian
   FF FE          = UTF-16, little-endian
   EF BB BF       = UTF-8
 
 Win2k Notepad:
   Unicode format = UTF-16LE
***/
 
import java.io.*;
 
/**
 * Generic unicode textreader, which will use BOM mark
 * to identify the encoding to be used.
 */
public class UnicodeReader extends Reader {
   PushbackInputStream internalIn;
	InputStreamReader   internalIn2 = null;
	String              defaultEnc;
 
	private static final int BOM_SIZE = 4;
 
/*
Default encoding is used only if BOM is not found. If
defaultEncoding is NULL then systemdefault is used.
*/
	UnicodeReader(InputStream in, String defaultEnc) {
		internalIn = new PushbackInputStream(in, 
BOM_SIZE);
		this.defaultEnc = defaultEnc;
	}
 
	public String getDefaultEncoding() {
      return defaultEnc;
   }
 
   public String getEncoding() {
      if (internalIn2 == null) return null;
      return internalIn2.getEncoding();
   }
 
   /**
    * Read-ahead four bytes and check for BOM marks. Extra 
bytes are
    * unread back to the stream, only BOM bytes are skipped.
    */
	protected void init() throws IOException {
      if (internalIn2 != null) return;
 
      String encoding;
		byte bom[] = new byte[BOM_SIZE];
		int n, unread;
		n = internalIn.read(bom, 0, bom.length);
 
      if (  (bom[0] == (byte)0xEF) && (bom[1] == (byte)0xBB) 
&&
            (bom[2] == (byte)0xBF) ) {
         encoding = "UTF-8";
         unread = n - 3;
      } else if ( (bom[0] == (byte)0xFE) && (bom[1] == (byte)
0xFF) ) {
         encoding = "UTF-16BE";
         unread = n - 2;
      } else if ( (bom[0] == (byte)0xFF) && (bom[1] == (byte)
0xFE) ) {
         encoding = "UTF-16LE";
         unread = n - 2;
      } else if ( (bom[0] == (byte)0x00) && (bom[1] == (byte)
0x00) &&
                  (bom[2] == (byte)0xFE) && (bom[3] == (byte)
0xFF)) {
         encoding = "UTF-32BE";
         unread = n - 4;
      } else if ( (bom[0] == (byte)0xFF) && (bom[1] == (byte)
0xFE) &&
                  (bom[2] == (byte)0x00) && (bom[3] == (byte)
0x00)) {
         encoding = "UTF-32LE";
         unread = n - 4;
      } else {
         // Unicode BOM mark not found, unread all bytes
         encoding = defaultEnc;
         unread = n;
      }
//      System.out.println("read=" + n + ", unread=" + unread);
 
      if (unread > 0) internalIn.unread(bom, (n - unread), 
unread);
      else if (unread < -1) internalIn.unread(bom, 0, 0);
 
      // Use given encoding
      if (encoding == null) {
         internalIn2 = new InputStreamReader(internalIn);
      } else {
         internalIn2 = new InputStreamReader(internalIn, 
encoding);
      }
	}
 
   public void close() throws IOException {
      init();
      internalIn2.close();
   }
 
   public int read(char[] cbuf, int off, int len) throws 
IOException {
      init();
      return internalIn2.read(cbuf, off, len);
   }
 
}

Encore un grand merci pour l'aide !
septentryon

[FileInpuStream]Problème avec l'UTF-8

Entrée/Sortie Java

Discussions similaires

Partager

Partager