encodage + fichier unimarc iso2709

**plnew** · 28/05/2023, 17h27

Bonjour,

Le format unimarc permet d'échanger des informations entre les logiciels de bibliothèque.

J'essaie en vain d'ouvrir un fichier iso2709 (unimarc) qui n'est pas encodé en utf-8 avec python.

J'utilise pour cela le module pymarc.

Si le fichier est en utf8 aucun problème, sinon je n'arrive pas avec le module pymarc à spécifier l'encodage utilisé.

J'ai tenté de faire .decode() en spécifiant l'encodage probable détecté par chardet (iso8859-9) sur la chaine renvoyé par pymarc,
mais comme le montre l'affichage, l'erreur de décodage est présente.

je vous joints 2 fichiers unimarc (le premier en utf8 et l'autre en iso8859-?)

Avez-vous une piste pour m'aider ?

D'avance merci

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
#!/usr/bin/python3
# -*- coding: utf-8 -*-
 
"""
Détecte le type de fichier et son MIME
Détecte l'encodage du fichier ISO2709 (unimarc).
puis
Afficher le titre si présent en zone 200$a.
 
avec encodage utf-8 -> cela fonctionne bien
avec encodage iso8859-x -> KO
 
test avec chardet (4.0.0)
test avec pymarc (5.0.0)
 
argument 1 -> fichier unimarc
"""
 
import sys
 
import chardet
import encodings
import magic
from pymarc import MARCReader
 
def analyse(fichier):
    print("type de fichier : " + magic.from_file(fichier))
    print("type MIME du fichier : " + magic.from_file(fichier, mime = True))
 
    # le compteur de notice
    numnotice = 0
    rawdata = open(fichier, 'rb').read()
    result = chardet.detect(rawdata)
    charenc = result['encoding']
    print ("l'encodage probable est " + charenc)
    with open(fichier, 'rb') as fh:
        if charenc == "utf-8":
            reader = MARCReader(fh, to_unicode=True, force_utf8=True)
        else:
            print("le fichier est analysé avec un encodage en " + charenc)
#            reader = MARCReader(fh) # PB Unable to parse character 0xa0 in g0=66 g1=69 (7 lignes même message)
            reader = MARCReader(fh, file_encoding='iso8859_9') # affiche le titre mal decodé (par chardet)
#            reader = MARCReader(fh, file_encoding='iso8859_15') # affiche le titre mal decodé 
#            reader = MARCReader(fh, file_encoding='cp850') # affiche le titre mal decodé (dos Europe)
#            reader = MARCReader(fh, file_encoding='cp1252') # affiche le titre mal decodé (Windows Europe)
        for record in reader:
            numnotice += 1
            print("----------------------------")
            print("notice numéro : " + str(numnotice))
            for field in record.get_fields('200'):
                if field['a'] is not None:
                    if charenc != "utf-8":
                        field = field['a']
                        lenfield=len(field)
                        print("le nombre de caractère du titre est : " + str(lenfield))
                        print("Le titre non decodé est : \033[41m" + field + "\033[0m")
                        try:
                            field2 = field.decode('iso8859_9', errors='strict').encode('utf8', errors='strict')
#                        field2 = field.decode('iso8859_15').encode('utf8', errors='strict')
#                        field2 = field.decode('cp850').encode('utf8', errors='strict')
#                        field2 = field.decode('cp1252').encode('utf8', errors='strict')
                            print("le titres decodé est " + str(field2))
                        except AttributeError:
                          pass
                    # si le titre est en utf8
                    else:
                        print("le titres est \033[1;32m" + field['a'].upper())
                        print("\033[0m")
                elif field['a'] is None:
                    print('pas de titre en 200$a')
 
 
if __name__ == "__main__":
    fichier = (sys.argv[1])
    print("le fichier à analyser est : " + fichier)
    analyse(fichier)

**Sve@r** · 28/05/2023, 18h04

Bonjour
J'aime bien tes tests. On voit que tu as cherché assez loin.
Un petit bémol sur rawdata = open(fichier, 'rb').read() car il faut quand-même fermer le fichier. Sinon tu peux passer par read_bytes de pathlib qui s'occupe d'ouvrir et fermer ensuite => result = chardet.detect(pathlib.Path(fichier).read_bytes()).

Je ne suis malheureusement pas arrivé plus loin que toi à une exception près, c'est que moi j'ai affiché tout result, dictionnaire qui contient l'encoding et aussi sa probabilité (champ "confidence"). Et pour le second fichier, la proba d'avoir l'encoding "ISO-8859-9" est à 51%. Ce qui ne ressemble vraiment pas à une certitude (pour l'utf-8 il répond "99%"!!!)

De là je ne vois pas quoi faire de plus. Un "file" sur les deux fichiers me répond "MARC21" ce qui n'aide pas non plus.

Ce que je peux te proposer toutefois, c'est de tester tous les encoding. Tu pourras les récupérer via print(sorted(set(encodings.aliases.aliases.values()))) en important la lib "encodings". En mettant ça dans une boucle tu pourrais faire un automate qui checke chaque encoding de la boucle...

**jurassic pork** · 28/05/2023, 18h24

Hello,
il est bizarre l'encodage du deuxième fichier car cela ressemble à de l'UTF-8 car les caractères accentués sont codés sur 2 bytes mais les accents ne correspondent pas à de l'UTF-8
Par exemple :
à -> 0xc1 0x61
è -> 0xc1 0x65
é -> 0xc2 0x65

[EDIT] une piste ici

0xC1 0x41 | 0x41 0xCC 0x80 // LATIN CAPITAL LETTER A WITH GRAVE
0xC1 0x45 | 0x45 0xCC 0x80 // LATIN CAPITAL LETTER E WITH GRAVE
0xC1 0x49 | 0x49 0xCC 0x80 // LATIN CAPITAL LETTER I WITH GRAVE
0xC1 0x4F | 0x4F 0xCC 0x80 // LATIN CAPITAL LETTER O WITH GRAVE
0xC1 0x55 | 0x55 0xCC 0x80 // LATIN CAPITAL LETTER U WITH GRAVE
0xC1 0x61 | 0x61 0xCC 0x80 // LATIN SMALL LETTER A WITH GRAVE
0xC1 0x65 | 0x65 0xCC 0x80 // LATIN SMALL LETTER E WITH GRAVE
0xC1 0x69 | 0x69 0xCC 0x80 // LATIN SMALL LETTER I WITH GRAVE
0xC1 0x6F | 0x6F 0xCC 0x80 // LATIN SMALL LETTER O WITH GRAVE
0xC1 0x75 | 0x75 0xCC 0x80 // LATIN SMALL LETTER U WITH GRAVE
0xC1 | 0xEE 0x80 0x82 // NON-SPACING GRAVE ACCENT <ISO-IR-103_C1> (not a real character)
0xC2 0x20 | 0x20 0xCC 0x81 // ACUTE ACCENT
0xC2 0x41 | 0x41 0xCC 0x81 // LATIN CAPITAL LETTER A WITH ACUTE
0xC2 0x43 | 0x43 0xCC 0x81 // LATIN CAPITAL LETTER C WITH ACUTE
0xC2 0x45 | 0x45 0xCC 0x81 // LATIN CAPITAL LETTER E WITH ACUTE
0xC2 0x49 | 0x49 0xCC 0x81 // LATIN CAPITAL LETTER I WITH ACUTE
0xC2 0x4C | 0x4C 0xCC 0x81 // LATIN CAPITAL LETTER L WITH ACUTE
0xC2 0x4E | 0x4E 0xCC 0x81 // LATIN CAPITAL LETTER N WITH ACUTE
0xC2 0x4F | 0x4F 0xCC 0x81 // LATIN CAPITAL LETTER O WITH ACUTE
0xC2 0x52 | 0x52 0xCC 0x81 // LATIN CAPITAL LETTER R WITH ACUTE
0xC2 0x53 | 0x53 0xCC 0x81 // LATIN CAPITAL LETTER S WITH ACUTE
0xC2 0x55 | 0x55 0xCC 0x81 // LATIN CAPITAL LETTER U WITH ACUTE
0xC2 0x59 | 0x59 0xCC 0x81 // LATIN CAPITAL LETTER Y WITH ACUTE
0xC2 0x5A | 0x5A 0xCC 0x81 // LATIN CAPITAL LETTER Z WITH ACUTE
0xC2 0x61 | 0x61 0xCC 0x81 // LATIN SMALL LETTER A WITH ACUTE
0xC2 0x63 | 0x63 0xCC 0x81 // LATIN SMALL LETTER C WITH ACUTE
0xC2 0x65 | 0x65 0xCC 0x81 // LATIN SMALL LETTER E WITH ACUTE

Ami calmant, J.P

**plnew** · 28/05/2023, 21h06

Merci pour vos réponses rapides.

Je vais regarder suivant vos pistes.

actuellement j'utilise l'outil yaz-marcdump
pour transformer le fichier isoxxx en utf8 puis pymarc fonctionne

mais je souhaite le faire en full python et apprendre un peu plus sur le format.

**Sve@r** · 28/05/2023, 21h26

Envoyé par plnew

mais je souhaite le faire en full python

Pourquoi? Python n'a pas pour but de tout faire mais de pouvoir tout faire faire par qui sait le faire. Si yaz-marcdump sait faire le job, autant en profiter...
Après si c'est juste pour ta culture perso alors ok j'ai rien dit

**N_BaH** · 28/05/2023, 22h17

Pourquoi?

parce que les utilisateurs veulent des outils qui fonctionnent, et ce sans préalable.

le bonheur vous tend les bras,
MAIS il faut faire (encore) un petit effort pour l'atteindre.

**plnew** · 28/05/2023, 22h28

Aucun encodings de la liste ne convient :-(
Pour ma curiosité sur python (je suis débutant donc apprendre est le but secondaire) (c'est dans un projet plus vaste pour aider à la gestion d'une bibliothèque asso).
Avoir des conseils de pros est agréable.

j'ai quand même bien avancé grace à vous car avec un taux de réussite avec chardet de 0,99 pour utf-8
Je peux réaliser le test chardet -> si utf-8 alors pymarc sinon yaz-marcdump + pymarc.

Avec la piste de "jurassic pork", cela ouvre d'autres perspectives.

correctifs

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
#!/usr/bin/python3
# -*- coding: utf-8 -*-
 
"""
Détecte le type de fichier et son MIME
Détecte l'encodage du fichier ISO2709 (unimarc).
puis
Afficher le titre si présent en zone 200$a.
 
avec encodage utf-8 -> cela fonctionne bien
avec encodage iso8859-x -> KO
 
test avec chardet (4.0.0)
test avec pymarc (5.0.0)
 
argument 1 -> fichier unimarc
 
Corrections v2
Fermeture fichier
Changement avec pathlib.
Fonction listingencoding
    testencoding (affiche celui testé)
    nbreencoding et nbreencodingstotal pour savoir où on se trouve dans la liste des encodings.
Ajout Try /Except dans boucle analyseur MARC car certains encodings font arreter le script. 
Affichage de la totalité des informations de chardet (encoding ; confidence ; language)
"""
 
 
import sys
 
import chardet
import encodings
import magic
import pathlib
from pymarc import MARCReader
from pymarc import exceptions as exc
 
def analyse(fichier, testencoding, nbreencoding, nbreencodingstotal):
    print("type de fichier : " + magic.from_file(fichier))
    print("type MIME du fichier : " + magic.from_file(fichier, mime = True))
 
    # le compteur de notice
    numnotice = 0
    result = chardet.detect(pathlib.Path(fichier).read_bytes())
# ou        
#    rawdata = open(fichier, 'rb').read()
#    result = chardet.detect(rawdata)
 
    charenc = str(result)
    charencencoding = result['encoding']
    print ("l'encodage probable est " + charenc)
 
    with open(fichier, 'rb') as fh:
        if charencencoding == "utf-8":
            reader = MARCReader(fh, to_unicode=True, force_utf8=True)
        else:
            try:
                print("le fichier est analysé avec un encodage en \033[1;32m" + testencoding + " - "\
                       + str(nbreencoding) + "/" + str(nbreencodingstotal) + "\033[0m - ")
                reader = MARCReader(fh, file_encoding=testencoding) # affiche le titre mal decodé (par chardet)
#            reader = MARCReader(fh) # PB Unable to parse character 0xa0 in g0=66 g1=69 (7 lignes même message)
#            reader = MARCReader(fh, file_encoding='iso8859_9') # affiche le titre mal decodé (par chardet)
#            reader = MARCReader(fh, file_encoding='iso8859_15') # affiche le titre mal decodé 
#            reader = MARCReader(fh, file_encoding='cp850') # affiche le titre mal decodé (dos Europe)
#            reader = MARCReader(fh, file_encoding='cp1252') # affiche le titre mal decodé (Windows Europe)
                for record in reader:
                    numnotice += 1
                    print("----------------------------")
                    print("notice numéro : " + str(numnotice))
                    for field in record.get_fields('200'):
                        if field['a'] is not None:
                            if charenc != "utf-8":
                                field = field['a']
                                lenfield=len(field)
                                print("le nombre de caractère du titre est : " + str(lenfield))
                                print("Le titre non decodé est : \033[41m" + field + "\033[0m")
                                try:
                                    field2 = field.decode('iso8859_9', errors='strict').encode('utf8', errors='strict')
#                                field2 = field.decode('iso8859_15').encode('utf8', errors='strict')
#                                field2 = field.decode('cp850').encode('utf8', errors='strict')
#                                field2 = field.decode('cp1252').encode('utf8', errors='strict')
                                    print("le titres decodé est " + str(field2))
                                except AttributeError:
                                  pass
                            # si le titre est en utf8
                            else:
                                print("le titres est \033[1;32m" + field['a'].upper())
                                print("\033[0m")
                        elif field['a'] is None:
                            print('pas de titre en 200$a')
            except:
                print("\033[41mProblème encodage : Arrêt avant la fin de la 1ère notice\033[0m")
    fh.close()
 
def listingencoding(fichier):
    nbreencoding = 0
    listeencodings=sorted(set(encodings.aliases.aliases.values()))
    nbreencodingstotal=len(listeencodings)
    for testencoding in listeencodings:
        nbreencoding +=1
        print("############################################################")
        analyse(fichier, testencoding, nbreencoding, nbreencodingstotal)
 
 
if __name__ == "__main__":
    fichier = (sys.argv[1])
    print("le fichier à analyser est : " + fichier)
    listingencoding(fichier)

encodage + fichier unimarc iso2709 [Python 3.X]

Python

Vue hybride

Discussions similaires

Partager

Partager