IdentifiantMot de passe
Loading...
Mot de passe oublié ?Je m'inscris ! (gratuit)
Navigation

Inscrivez-vous gratuitement
pour pouvoir participer, suivre les réponses en temps réel, voter pour les messages, poser vos propres questions et recevoir la newsletter

Python Discussion :

encodage + fichier unimarc iso2709 [Python 3.X]


Sujet :

Python

  1. #1
    Nouveau membre du Club
    Homme Profil pro
    Loisir / Plaisir
    Inscrit en
    Février 2012
    Messages
    32
    Détails du profil
    Informations personnelles :
    Sexe : Homme
    Localisation : France, Jura (Franche Comté)

    Informations professionnelles :
    Activité : Loisir / Plaisir

    Informations forums :
    Inscription : Février 2012
    Messages : 32
    Points : 27
    Points
    27
    Par défaut encodage + fichier unimarc iso2709
    Bonjour,

    Le format unimarc permet d'échanger des informations entre les logiciels de bibliothèque.

    J'essaie en vain d'ouvrir un fichier iso2709 (unimarc) qui n'est pas encodé en utf-8 avec python.

    J'utilise pour cela le module pymarc.

    Si le fichier est en utf8 aucun problème, sinon je n'arrive pas avec le module pymarc à spécifier l'encodage utilisé.

    J'ai tenté de faire .decode() en spécifiant l'encodage probable détecté par chardet (iso8859-9) sur la chaine renvoyé par pymarc,
    mais comme le montre l'affichage, l'erreur de décodage est présente.

    je vous joints 2 fichiers unimarc (le premier en utf8 et l'autre en iso8859-?)

    Avez-vous une piste pour m'aider ?

    D'avance merci

    Code : Sélectionner tout - Visualiser dans une fenêtre à part
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    #!/usr/bin/python3
    # -*- coding: utf-8 -*-
     
    """
    Détecte le type de fichier et son MIME
    Détecte l'encodage du fichier ISO2709 (unimarc).
    puis
    Afficher le titre si présent en zone 200$a.
     
    avec encodage utf-8 -> cela fonctionne bien
    avec encodage iso8859-x -> KO
     
    test avec chardet (4.0.0)
    test avec pymarc (5.0.0)
     
    argument 1 -> fichier unimarc
    """
     
    import sys
     
    import chardet
    import encodings
    import magic
    from pymarc import MARCReader
     
    def analyse(fichier):
        print("type de fichier : " + magic.from_file(fichier))
        print("type MIME du fichier : " + magic.from_file(fichier, mime = True))
     
        # le compteur de notice
        numnotice = 0
        rawdata = open(fichier, 'rb').read()
        result = chardet.detect(rawdata)
        charenc = result['encoding']
        print ("l'encodage probable est " + charenc)
        with open(fichier, 'rb') as fh:
            if charenc == "utf-8":
                reader = MARCReader(fh, to_unicode=True, force_utf8=True)
            else:
                print("le fichier est analysé avec un encodage en " + charenc)
    #            reader = MARCReader(fh) # PB Unable to parse character 0xa0 in g0=66 g1=69 (7 lignes même message)
                reader = MARCReader(fh, file_encoding='iso8859_9') # affiche le titre mal decodé (par chardet)
    #            reader = MARCReader(fh, file_encoding='iso8859_15') # affiche le titre mal decodé 
    #            reader = MARCReader(fh, file_encoding='cp850') # affiche le titre mal decodé (dos Europe)
    #            reader = MARCReader(fh, file_encoding='cp1252') # affiche le titre mal decodé (Windows Europe)
            for record in reader:
                numnotice += 1
                print("----------------------------")
                print("notice numéro : " + str(numnotice))
                for field in record.get_fields('200'):
                    if field['a'] is not None:
                        if charenc != "utf-8":
                            field = field['a']
                            lenfield=len(field)
                            print("le nombre de caractère du titre est : " + str(lenfield))
                            print("Le titre non decodé est : \033[41m" + field + "\033[0m")
                            try:
                                field2 = field.decode('iso8859_9', errors='strict').encode('utf8', errors='strict')
    #                        field2 = field.decode('iso8859_15').encode('utf8', errors='strict')
    #                        field2 = field.decode('cp850').encode('utf8', errors='strict')
    #                        field2 = field.decode('cp1252').encode('utf8', errors='strict')
                                print("le titres decodé est " + str(field2))
                            except AttributeError:
                              pass
                        # si le titre est en utf8
                        else:
                            print("le titres est \033[1;32m" + field['a'].upper())
                            print("\033[0m")
                    elif field['a'] is None:
                        print('pas de titre en 200$a')
     
     
    if __name__ == "__main__":
        fichier = (sys.argv[1])
        print("le fichier à analyser est : " + fichier)
        analyse(fichier)
    Fichiers attachés Fichiers attachés

  2. #2
    Expert éminent sénior
    Avatar de Sve@r
    Homme Profil pro
    Ingénieur développement logiciels
    Inscrit en
    Février 2006
    Messages
    12 685
    Détails du profil
    Informations personnelles :
    Sexe : Homme
    Localisation : France, Oise (Picardie)

    Informations professionnelles :
    Activité : Ingénieur développement logiciels
    Secteur : Aéronautique - Marine - Espace - Armement

    Informations forums :
    Inscription : Février 2006
    Messages : 12 685
    Points : 30 974
    Points
    30 974
    Billets dans le blog
    1
    Par défaut
    Bonjour
    J'aime bien tes tests. On voit que tu as cherché assez loin.
    Un petit bémol sur rawdata = open(fichier, 'rb').read() car il faut quand-même fermer le fichier. Sinon tu peux passer par read_bytes de pathlib qui s'occupe d'ouvrir et fermer ensuite => result = chardet.detect(pathlib.Path(fichier).read_bytes()).

    Je ne suis malheureusement pas arrivé plus loin que toi à une exception près, c'est que moi j'ai affiché tout result, dictionnaire qui contient l'encoding et aussi sa probabilité (champ "confidence"). Et pour le second fichier, la proba d'avoir l'encoding "ISO-8859-9" est à 51%. Ce qui ne ressemble vraiment pas à une certitude (pour l'utf-8 il répond "99%"!!!)

    De là je ne vois pas quoi faire de plus. Un "file" sur les deux fichiers me répond "MARC21" ce qui n'aide pas non plus.

    Ce que je peux te proposer toutefois, c'est de tester tous les encoding. Tu pourras les récupérer via print(sorted(set(encodings.aliases.aliases.values()))) en important la lib "encodings". En mettant ça dans une boucle tu pourrais faire un automate qui checke chaque encoding de la boucle...
    Mon Tutoriel sur la programmation «Python»
    Mon Tutoriel sur la programmation «Shell»
    Sinon il y en a pleins d'autres. N'oubliez pas non plus les différentes faq disponibles sur ce site
    Et on poste ses codes entre balises [code] et [/code]

  3. #3
    Expert éminent
    Avatar de jurassic pork
    Homme Profil pro
    Bidouilleur
    Inscrit en
    Décembre 2008
    Messages
    3 950
    Détails du profil
    Informations personnelles :
    Sexe : Homme
    Localisation : France

    Informations professionnelles :
    Activité : Bidouilleur
    Secteur : Industrie

    Informations forums :
    Inscription : Décembre 2008
    Messages : 3 950
    Points : 9 279
    Points
    9 279
    Par défaut
    Hello,
    il est bizarre l'encodage du deuxième fichier car cela ressemble à de l'UTF-8 car les caractères accentués sont codés sur 2 bytes mais les accents ne correspondent pas à de l'UTF-8
    Par exemple :
    à -> 0xc1 0x61
    è -> 0xc1 0x65
    é -> 0xc2 0x65

    [EDIT] une piste ici

    0xC1 0x41 | 0x41 0xCC 0x80 // LATIN CAPITAL LETTER A WITH GRAVE
    0xC1 0x45 | 0x45 0xCC 0x80 // LATIN CAPITAL LETTER E WITH GRAVE
    0xC1 0x49 | 0x49 0xCC 0x80 // LATIN CAPITAL LETTER I WITH GRAVE
    0xC1 0x4F | 0x4F 0xCC 0x80 // LATIN CAPITAL LETTER O WITH GRAVE
    0xC1 0x55 | 0x55 0xCC 0x80 // LATIN CAPITAL LETTER U WITH GRAVE
    0xC1 0x61 | 0x61 0xCC 0x80 // LATIN SMALL LETTER A WITH GRAVE
    0xC1 0x65 | 0x65 0xCC 0x80 // LATIN SMALL LETTER E WITH GRAVE
    0xC1 0x69 | 0x69 0xCC 0x80 // LATIN SMALL LETTER I WITH GRAVE
    0xC1 0x6F | 0x6F 0xCC 0x80 // LATIN SMALL LETTER O WITH GRAVE
    0xC1 0x75 | 0x75 0xCC 0x80 // LATIN SMALL LETTER U WITH GRAVE
    0xC1 | 0xEE 0x80 0x82 // NON-SPACING GRAVE ACCENT <ISO-IR-103_C1> (not a real character)
    0xC2 0x20 | 0x20 0xCC 0x81 // ACUTE ACCENT
    0xC2 0x41 | 0x41 0xCC 0x81 // LATIN CAPITAL LETTER A WITH ACUTE
    0xC2 0x43 | 0x43 0xCC 0x81 // LATIN CAPITAL LETTER C WITH ACUTE
    0xC2 0x45 | 0x45 0xCC 0x81 // LATIN CAPITAL LETTER E WITH ACUTE
    0xC2 0x49 | 0x49 0xCC 0x81 // LATIN CAPITAL LETTER I WITH ACUTE
    0xC2 0x4C | 0x4C 0xCC 0x81 // LATIN CAPITAL LETTER L WITH ACUTE
    0xC2 0x4E | 0x4E 0xCC 0x81 // LATIN CAPITAL LETTER N WITH ACUTE
    0xC2 0x4F | 0x4F 0xCC 0x81 // LATIN CAPITAL LETTER O WITH ACUTE
    0xC2 0x52 | 0x52 0xCC 0x81 // LATIN CAPITAL LETTER R WITH ACUTE
    0xC2 0x53 | 0x53 0xCC 0x81 // LATIN CAPITAL LETTER S WITH ACUTE
    0xC2 0x55 | 0x55 0xCC 0x81 // LATIN CAPITAL LETTER U WITH ACUTE
    0xC2 0x59 | 0x59 0xCC 0x81 // LATIN CAPITAL LETTER Y WITH ACUTE
    0xC2 0x5A | 0x5A 0xCC 0x81 // LATIN CAPITAL LETTER Z WITH ACUTE
    0xC2 0x61 | 0x61 0xCC 0x81 // LATIN SMALL LETTER A WITH ACUTE
    0xC2 0x63 | 0x63 0xCC 0x81 // LATIN SMALL LETTER C WITH ACUTE
    0xC2 0x65 | 0x65 0xCC 0x81 // LATIN SMALL LETTER E WITH ACUTE
    Ami calmant, J.P
    Jurassic computer : Sinclair ZX81 - Zilog Z80A à 3,25 MHz - RAM 1 Ko - ROM 8 Ko

  4. #4
    Nouveau membre du Club
    Homme Profil pro
    Loisir / Plaisir
    Inscrit en
    Février 2012
    Messages
    32
    Détails du profil
    Informations personnelles :
    Sexe : Homme
    Localisation : France, Jura (Franche Comté)

    Informations professionnelles :
    Activité : Loisir / Plaisir

    Informations forums :
    Inscription : Février 2012
    Messages : 32
    Points : 27
    Points
    27
    Par défaut
    Merci pour vos réponses rapides.

    Je vais regarder suivant vos pistes.

    actuellement j'utilise l'outil yaz-marcdump
    pour transformer le fichier isoxxx en utf8 puis pymarc fonctionne


    mais je souhaite le faire en full python et apprendre un peu plus sur le format.

  5. #5
    Expert éminent sénior
    Avatar de Sve@r
    Homme Profil pro
    Ingénieur développement logiciels
    Inscrit en
    Février 2006
    Messages
    12 685
    Détails du profil
    Informations personnelles :
    Sexe : Homme
    Localisation : France, Oise (Picardie)

    Informations professionnelles :
    Activité : Ingénieur développement logiciels
    Secteur : Aéronautique - Marine - Espace - Armement

    Informations forums :
    Inscription : Février 2006
    Messages : 12 685
    Points : 30 974
    Points
    30 974
    Billets dans le blog
    1
    Par défaut
    Citation Envoyé par plnew Voir le message
    mais je souhaite le faire en full python
    Pourquoi? Python n'a pas pour but de tout faire mais de pouvoir tout faire faire par qui sait le faire. Si yaz-marcdump sait faire le job, autant en profiter...
    Après si c'est juste pour ta culture perso alors ok j'ai rien dit
    Mon Tutoriel sur la programmation «Python»
    Mon Tutoriel sur la programmation «Shell»
    Sinon il y en a pleins d'autres. N'oubliez pas non plus les différentes faq disponibles sur ce site
    Et on poste ses codes entre balises [code] et [/code]

  6. #6
    Modérateur
    Avatar de N_BaH
    Profil pro
    Inscrit en
    Février 2008
    Messages
    7 549
    Détails du profil
    Informations personnelles :
    Localisation : France

    Informations forums :
    Inscription : Février 2008
    Messages : 7 549
    Points : 19 378
    Points
    19 378
    Par défaut
    Pourquoi?
    parce que les utilisateurs veulent des outils qui fonctionnent, et ce sans préalable.

    le bonheur vous tend les bras,
    MAIS il faut faire (encore) un petit effort pour l'atteindre.


    .
    N'oubliez pas de consulter les cours shell, la FAQ, et les pages man.

  7. #7
    Nouveau membre du Club
    Homme Profil pro
    Loisir / Plaisir
    Inscrit en
    Février 2012
    Messages
    32
    Détails du profil
    Informations personnelles :
    Sexe : Homme
    Localisation : France, Jura (Franche Comté)

    Informations professionnelles :
    Activité : Loisir / Plaisir

    Informations forums :
    Inscription : Février 2012
    Messages : 32
    Points : 27
    Points
    27
    Par défaut
    Aucun encodings de la liste ne convient :-(
    Pour ma curiosité sur python (je suis débutant donc apprendre est le but secondaire) (c'est dans un projet plus vaste pour aider à la gestion d'une bibliothèque asso).
    Avoir des conseils de pros est agréable.

    j'ai quand même bien avancé grace à vous car avec un taux de réussite avec chardet de 0,99 pour utf-8
    Je peux réaliser le test chardet -> si utf-8 alors pymarc sinon yaz-marcdump + pymarc.

    Avec la piste de "jurassic pork", cela ouvre d'autres perspectives.


    correctifs

    Code : Sélectionner tout - Visualiser dans une fenêtre à part
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    #!/usr/bin/python3
    # -*- coding: utf-8 -*-
     
    """
    Détecte le type de fichier et son MIME
    Détecte l'encodage du fichier ISO2709 (unimarc).
    puis
    Afficher le titre si présent en zone 200$a.
     
    avec encodage utf-8 -> cela fonctionne bien
    avec encodage iso8859-x -> KO
     
    test avec chardet (4.0.0)
    test avec pymarc (5.0.0)
     
    argument 1 -> fichier unimarc
     
    Corrections v2
    Fermeture fichier
    Changement avec pathlib.
    Fonction listingencoding
        testencoding (affiche celui testé)
        nbreencoding et nbreencodingstotal pour savoir où on se trouve dans la liste des encodings.
    Ajout Try /Except dans boucle analyseur MARC car certains encodings font arreter le script. 
    Affichage de la totalité des informations de chardet (encoding ; confidence ; language)
    """
     
     
    import sys
     
    import chardet
    import encodings
    import magic
    import pathlib
    from pymarc import MARCReader
    from pymarc import exceptions as exc
     
    def analyse(fichier, testencoding, nbreencoding, nbreencodingstotal):
        print("type de fichier : " + magic.from_file(fichier))
        print("type MIME du fichier : " + magic.from_file(fichier, mime = True))
     
        # le compteur de notice
        numnotice = 0
        result = chardet.detect(pathlib.Path(fichier).read_bytes())
    # ou        
    #    rawdata = open(fichier, 'rb').read()
    #    result = chardet.detect(rawdata)
     
        charenc = str(result)
        charencencoding = result['encoding']
        print ("l'encodage probable est " + charenc)
     
        with open(fichier, 'rb') as fh:
            if charencencoding == "utf-8":
                reader = MARCReader(fh, to_unicode=True, force_utf8=True)
            else:
                try:
                    print("le fichier est analysé avec un encodage en \033[1;32m" + testencoding + " - "\
                           + str(nbreencoding) + "/" + str(nbreencodingstotal) + "\033[0m - ")
                    reader = MARCReader(fh, file_encoding=testencoding) # affiche le titre mal decodé (par chardet)
    #            reader = MARCReader(fh) # PB Unable to parse character 0xa0 in g0=66 g1=69 (7 lignes même message)
    #            reader = MARCReader(fh, file_encoding='iso8859_9') # affiche le titre mal decodé (par chardet)
    #            reader = MARCReader(fh, file_encoding='iso8859_15') # affiche le titre mal decodé 
    #            reader = MARCReader(fh, file_encoding='cp850') # affiche le titre mal decodé (dos Europe)
    #            reader = MARCReader(fh, file_encoding='cp1252') # affiche le titre mal decodé (Windows Europe)
                    for record in reader:
                        numnotice += 1
                        print("----------------------------")
                        print("notice numéro : " + str(numnotice))
                        for field in record.get_fields('200'):
                            if field['a'] is not None:
                                if charenc != "utf-8":
                                    field = field['a']
                                    lenfield=len(field)
                                    print("le nombre de caractère du titre est : " + str(lenfield))
                                    print("Le titre non decodé est : \033[41m" + field + "\033[0m")
                                    try:
                                        field2 = field.decode('iso8859_9', errors='strict').encode('utf8', errors='strict')
    #                                field2 = field.decode('iso8859_15').encode('utf8', errors='strict')
    #                                field2 = field.decode('cp850').encode('utf8', errors='strict')
    #                                field2 = field.decode('cp1252').encode('utf8', errors='strict')
                                        print("le titres decodé est " + str(field2))
                                    except AttributeError:
                                      pass
                                # si le titre est en utf8
                                else:
                                    print("le titres est \033[1;32m" + field['a'].upper())
                                    print("\033[0m")
                            elif field['a'] is None:
                                print('pas de titre en 200$a')
                except:
                    print("\033[41mProblème encodage : Arrêt avant la fin de la 1ère notice\033[0m")
        fh.close()
     
    def listingencoding(fichier):
        nbreencoding = 0
        listeencodings=sorted(set(encodings.aliases.aliases.values()))
        nbreencodingstotal=len(listeencodings)
        for testencoding in listeencodings:
            nbreencoding +=1
            print("############################################################")
            analyse(fichier, testencoding, nbreencoding, nbreencodingstotal)
     
     
    if __name__ == "__main__":
        fichier = (sys.argv[1])
        print("le fichier à analyser est : " + fichier)
        listingencoding(fichier)

  8. #8
    Expert éminent
    Avatar de jurassic pork
    Homme Profil pro
    Bidouilleur
    Inscrit en
    Décembre 2008
    Messages
    3 950
    Détails du profil
    Informations personnelles :
    Sexe : Homme
    Localisation : France

    Informations professionnelles :
    Activité : Bidouilleur
    Secteur : Industrie

    Informations forums :
    Inscription : Décembre 2008
    Messages : 3 950
    Points : 9 279
    Points
    9 279
    Par défaut
    Hello,
    dans Pypi, il y a bien un module qui semble décoder ton type d'encodage (smc.bibencodings) qui serait de l'iso-5426 mais le souci c'est que ce module qui est ancien ne semble pas fonctionner avec des versions récentes de python.
    Exemple en python 3.10 :
    >>> import smc.bibencodings
    >>> b'la for\xc3et'.decode("iso5426")
    Traceback (most recent call last):
    File "D:\Logiciels\Thonny\lib\site-packages\smc\bibencodings\iso5426.py", line 153, in decode
    return decode(input, errors)
    File "D:\Logiciels\Thonny\lib\site-packages\smc\bibencodings\iso5426.py", line 60, in decode
    o = ord(c)
    TypeError: ord() expected string of length 1, but int found


    The above exception was the direct cause of the following exception:


    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    TypeError: decoding with 'iso5426' codec failed (TypeError: ord() expected string of length 1, but int found)
    Ami calmant, J.P
    Jurassic computer : Sinclair ZX81 - Zilog Z80A à 3,25 MHz - RAM 1 Ko - ROM 8 Ko

  9. #9
    Nouveau membre du Club
    Homme Profil pro
    Loisir / Plaisir
    Inscrit en
    Février 2012
    Messages
    32
    Détails du profil
    Informations personnelles :
    Sexe : Homme
    Localisation : France, Jura (Franche Comté)

    Informations professionnelles :
    Activité : Loisir / Plaisir

    Informations forums :
    Inscription : Février 2012
    Messages : 32
    Points : 27
    Points
    27
    Par défaut
    Merci jurassic pork pour le super lien vers smc.bibencodings

    le lien dans la page indique "un autre lien"

    et la colonne MAB2 semble correspondre à l'encodage de mon fichier.

    Effectivement erreur avec python3 sous linux, j'ai essayé avec son exemple

    Code : Sélectionner tout - Visualiser dans une fenêtre à part
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    >>> import smc.bibencodings
    >>> b"Abr\xc2eg\xc2e Historique De L'Origine".decode("mab2")
    Traceback (most recent call last):
      File "/home/yo/.local/lib/python3.10/site-packages/smc/bibencodings/iso5426.py", line 153, in decode
        return decode(input, errors)
      File "/home/yo/.local/lib/python3.10/site-packages/smc/bibencodings/iso5426.py", line 60, in decode
        o = ord(c)
    TypeError: ord() expected string of length 1, but int found
     
    The above exception was the direct cause of the following exception:
     
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: decoding with 'mab2' codec failed (TypeError: ord() expected string of length 1, but int found)

    Pouvez-vous SVP m'expliquer ce qui provoque ce type erreur ?

  10. #10
    Nouveau membre du Club
    Homme Profil pro
    Loisir / Plaisir
    Inscrit en
    Février 2012
    Messages
    32
    Détails du profil
    Informations personnelles :
    Sexe : Homme
    Localisation : France, Jura (Franche Comté)

    Informations professionnelles :
    Activité : Loisir / Plaisir

    Informations forums :
    Inscription : Février 2012
    Messages : 32
    Points : 27
    Points
    27
    Par défaut
    j'ai testé avec python2

    Python 2.7.18 (default, Jul 1 2022, 10:30:50)
    [GCC 11.2.0] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import smc.bibencodings
    >>> b"Abr\xc2eg\xc2e Historique De L'Origine".decode("mab2")
    u"Abr\xe9g\xe9 Historique De L'Origine"
    >>>

  11. #11
    Expert éminent sénior
    Avatar de Sve@r
    Homme Profil pro
    Ingénieur développement logiciels
    Inscrit en
    Février 2006
    Messages
    12 685
    Détails du profil
    Informations personnelles :
    Sexe : Homme
    Localisation : France, Oise (Picardie)

    Informations professionnelles :
    Activité : Ingénieur développement logiciels
    Secteur : Aéronautique - Marine - Espace - Armement

    Informations forums :
    Inscription : Février 2006
    Messages : 12 685
    Points : 30 974
    Points
    30 974
    Billets dans le blog
    1
    Par défaut
    Citation Envoyé par plnew Voir le message
    Pouvez-vous SVP m'expliquer ce qui provoque ce type erreur ?
    Le message semble assez explicite. ord() attend un str en paramètre et reçoit un int. Et un int ce n'est pas un str
    Mon Tutoriel sur la programmation «Python»
    Mon Tutoriel sur la programmation «Shell»
    Sinon il y en a pleins d'autres. N'oubliez pas non plus les différentes faq disponibles sur ce site
    Et on poste ses codes entre balises [code] et [/code]

  12. #12
    Expert éminent
    Avatar de jurassic pork
    Homme Profil pro
    Bidouilleur
    Inscrit en
    Décembre 2008
    Messages
    3 950
    Détails du profil
    Informations personnelles :
    Sexe : Homme
    Localisation : France

    Informations professionnelles :
    Activité : Bidouilleur
    Secteur : Industrie

    Informations forums :
    Inscription : Décembre 2008
    Messages : 3 950
    Points : 9 279
    Points
    9 279
    Par défaut
    Citation Envoyé par Sve@r Voir le message
    Le message semble assez explicite. ord() attend un str en paramètre et reçoit un int. Et un int ce n'est pas un str
    Sve@r tu penses que tu peux corriger le code du fichier iso5426.py du module smc.bibencodings ?
    Code : Sélectionner tout - Visualiser dans une fenêtre à part
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    def decode(input, errors='strict', special=None):
        """Decode unicode from ISO-5426
        """
        if errors not in set(['strict', 'replace', 'ignore', 'repr']):
            raise ValueError("Invalid errors argument %s" % errors)
        result = []
        di = DecodeIterator(input)
        # optimizations
        rappend = result.append
        cget = charmap.get
        for c in di:
            o = ord(c) # première erreur ici
            # ASCII chars
            if c < b'\x7f':
                rappend(chr(o))
                #i += 1
                continue
    
    
            c1, c2 = di.peek(2)
            ccc2 = None
            # 0xc0 to 0xdf signals a combined char
            if 0xc0 <= o <= 0xdf and c1 is not None:
                # special case 0xc9: both 0xc9 and 0xc9 are combining diaeresis
                # use 0xc8 in favor of 0xc9
                if c == b'\xc9':
                    c = b'\xc8'
                if c1 == b'\xc9':
                    c1 = b'\xc8'
                # double combined char
                if 0xc0 <= ord(c1) <= 0xdf and c2 is not None:
                    ccc2 = c + c1 + c2
                    r = cget(ccc2)
                    if r is not None:
                        # double combined found in table
                        rappend(r)
                        di.evolve(2)
                        continue
                    # build combining unicode
                    dc1 = cget(c)
                    dc2 = cget(c1 + c2)
                    if dc1 is not None and dc2 is not None: # pragma: no branch
                        # reverse order, in unicode, the combining char comes after the char
                        rappend(dc2 + dc1)
                        di.evolve(2)
                        continue
                else:
                    cc1 = c + c1
                    r = cget(cc1)
                    if r is not None:
                        rappend(r)
                        di.evolve(1)
                        continue
                    # denormalized unicode: char + combining
                    r = cget(c)
                    rn = cget(c1)
                    if r is not None and rn is not None: # pragma: no branch
                        rappend(rn + r)
                        di.evolve(1)
                        continue
    
    
                # just the combining
                #r = cget(c)
                #if r is not None:
                #    result.append(r)
                #    continue
    
    
            # other chars, 0x80 <= o <= 0xbf or o >= 0xe0 or last combining
            if special is not None:
                r = special.get(c)
                if r is not None:
                    rappend(r)
                    continue
    
    
            r = cget(c)
            if r is not None:
                rappend(r)
                continue
    
    
            # only reached when no result was found
            if errors == "strict":
                p = di.position
                raise UnicodeError("Can't decode byte%s %r at position %i (context %r)" %
                                   ("" if ccc2 is None else "s",
                                    c if ccc2 is None else ccc2,
                                    p, input[p - 3:p + 3]))
            elif errors == "replace":
                rappend('\ufffd')
            elif errors == "ignore":
                pass
            elif errors == "repr":
                rappend('\\x%x' % o)
            else: # pragma: no cover
                # should never be reached
                raise ValueError("Invalid errors argument %s" % errors)
    
    
        return "".join(result), di.position
    pour tester :
    Code : Sélectionner tout - Visualiser dans une fenêtre à part
    1
    2
     
    decode(b'la for\c3et')
    Jurassic computer : Sinclair ZX81 - Zilog Z80A à 3,25 MHz - RAM 1 Ko - ROM 8 Ko

  13. #13
    Expert éminent sénior
    Avatar de Sve@r
    Homme Profil pro
    Ingénieur développement logiciels
    Inscrit en
    Février 2006
    Messages
    12 685
    Détails du profil
    Informations personnelles :
    Sexe : Homme
    Localisation : France, Oise (Picardie)

    Informations professionnelles :
    Activité : Ingénieur développement logiciels
    Secteur : Aéronautique - Marine - Espace - Armement

    Informations forums :
    Inscription : Février 2006
    Messages : 12 685
    Points : 30 974
    Points
    30 974
    Billets dans le blog
    1
    Par défaut
    Citation Envoyé par jurassic pork Voir le message
    Sve@r tu penses que tu peux corriger le code du fichier iso5426.py du module smc.bibencodings ?
    Pas du tout. Ce n'est pas ce que j'ai dit!!!
    plnew a demandé l'explication de l'erreur, je la lui ai donné. Rien de plus.

    Après on peut rajouter que l'erreur vient probablement d'une incompatibilité interne du module entre P2 et P3, ce qui semble logique vu que le module n'est plus soutenu depuis 2012. Mais corriger (ou porter, verbe certainement plus adéquat) ce module sous P3...
    Mon Tutoriel sur la programmation «Python»
    Mon Tutoriel sur la programmation «Shell»
    Sinon il y en a pleins d'autres. N'oubliez pas non plus les différentes faq disponibles sur ce site
    Et on poste ses codes entre balises [code] et [/code]

  14. #14
    Nouveau membre du Club
    Homme Profil pro
    Loisir / Plaisir
    Inscrit en
    Février 2012
    Messages
    32
    Détails du profil
    Informations personnelles :
    Sexe : Homme
    Localisation : France, Jura (Franche Comté)

    Informations professionnelles :
    Activité : Loisir / Plaisir

    Informations forums :
    Inscription : Février 2012
    Messages : 32
    Points : 27
    Points
    27
    Par défaut
    Merci à vous deux, vous m'avez bien fait avancer dans la connaissance du format obscur iso2709.

    Je vois que la solution par yaz-marcdump a de beau jour devant elle :-)

    Je demanderais peut-être l'évolution du module à son auteur, mais cela semble beaucoup de boulot, et apparemment, il n'a pas beaucoup d'utilisateurs.

    merci encore pour toutes les explications.

  15. #15
    Expert confirmé Avatar de papajoker
    Homme Profil pro
    Développeur Web
    Inscrit en
    Septembre 2013
    Messages
    2 101
    Détails du profil
    Informations personnelles :
    Sexe : Homme
    Localisation : France, Nièvre (Bourgogne)

    Informations professionnelles :
    Activité : Développeur Web
    Secteur : High Tech - Multimédia et Internet

    Informations forums :
    Inscription : Septembre 2013
    Messages : 2 101
    Points : 4 446
    Points
    4 446
    Par défaut
    bonjour
    Citation Envoyé par jurassic pork Voir le message
    tu peux corriger le code du fichier iso5426.py du module smc.bibencodings ?
    Code : Sélectionner tout - Visualiser dans une fenêtre à part
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
        for c in di:
            o = ord(c) # première erreur ici
            # ASCII chars
            if c < b'\x7f':
                rappend(chr(o))
                #i += 1
                continue
    
    
            c1, c2 = di.peek(2)
            ccc2 = None
    maintenant, la structure bytes est une "liste" d'entier (faire un print(list(b'la for\c3et'))) et avant ... ce ne devait pas être des entiers ...

    Un petit test serait de convertir cet entier en bytes (à voir si la conversion finale de la chaine n'est pas alors invalide ...)
    L4 et 14 dans cet extrait de code

    Code : Sélectionner tout - Visualiser dans une fenêtre à part
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    di = DecodeIterator(input)
    ...
        for c in di:
            c = bytes([c])    # PYTHON 3 ...
            o = ord(c) # première erreur ici
            # ASCII chars
            if c < b'\x7f':
                rappend(chr(o))
                #i += 1
                continue
    
    
            c1, c2 = di.peek(2)
            c1, c2 = bytes([c1]), bytes([c2])     # PYTHON 3 ...
            ccc2 = None
    Si bon, plus propre de modifier directement la class DecodeIterator

    ps: modifier une lib : problèmes si l'application finale doit être installée sur X machines
    $moi= ( !== ) ? : ;

  16. #16
    Expert éminent
    Avatar de jurassic pork
    Homme Profil pro
    Bidouilleur
    Inscrit en
    Décembre 2008
    Messages
    3 950
    Détails du profil
    Informations personnelles :
    Sexe : Homme
    Localisation : France

    Informations professionnelles :
    Activité : Bidouilleur
    Secteur : Industrie

    Informations forums :
    Inscription : Décembre 2008
    Messages : 3 950
    Points : 9 279
    Points
    9 279
    Par défaut
    bon voici un code indépendant (constitué à partir des fichiers de smc.bibencodings) qui fait le décodage mab2. Dans la dernière boucle la chaîne est bien égale à la forêt. Le souci c'est que je n'atteins pas le return qui renvoit cette chaîne. J'ai cette exception :
    >>> %Debug testIso5426.py
    Traceback (most recent call last):
    File "D:\Temp\testIso5426.py", line 17, in __iter__
    raise StopIteration
    StopIteration

    The above exception was the direct cause of the following exception:

    Traceback (most recent call last):
    File "D:\Temp\testIso5426.py", line 826, in <module>
    decode(b'la for\xc3et')
    File "D:\Temp\testIso5426.py", line 54, in decode
    for c in di:
    RuntimeError: generator raised StopIteration
    >>>
    Voici le code pour tester :
    Code : Sélectionner tout - Visualiser dans une fenêtre à part
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    114
    115
    116
    117
    118
    119
    120
    121
    122
    123
    124
    125
    126
    127
    128
    129
    130
    131
    132
    133
    134
    135
    136
    137
    138
    139
    140
    141
    142
    143
    144
    145
    146
    147
    148
    149
    150
    151
    152
    153
    154
    155
    156
    157
    158
    159
    160
    161
    162
    163
    164
    165
    166
    167
    168
    169
    170
    171
    172
    173
    174
    175
    176
    177
    178
    179
    180
    181
    182
    183
    184
    185
    186
    187
    188
    189
    190
    191
    192
    193
    194
    195
    196
    197
    198
    199
    200
    201
    202
    203
    204
    205
    206
    207
    208
    209
    210
    211
    212
    213
    214
    215
    216
    217
    218
    219
    220
    221
    222
    223
    224
    225
    226
    227
    228
    229
    230
    231
    232
    233
    234
    235
    236
    237
    238
    239
    240
    241
    242
    243
    244
    245
    246
    247
    248
    249
    250
    251
    252
    253
    254
    255
    256
    257
    258
    259
    260
    261
    262
    263
    264
    265
    266
    267
    268
    269
    270
    271
    272
    273
    274
    275
    276
    277
    278
    279
    280
    281
    282
    283
    284
    285
    286
    287
    288
    289
    290
    291
    292
    293
    294
    295
    296
    297
    298
    299
    300
    301
    302
    303
    304
    305
    306
    307
    308
    309
    310
    311
    312
    313
    314
    315
    316
    317
    318
    319
    320
    321
    322
    323
    324
    325
    326
    327
    328
    329
    330
    331
    332
    333
    334
    335
    336
    337
    338
    339
    340
    341
    342
    343
    344
    345
    346
    347
    348
    349
    350
    351
    352
    353
    354
    355
    356
    357
    358
    359
    360
    361
    362
    363
    364
    365
    366
    367
    368
    369
    370
    371
    372
    373
    374
    375
    376
    377
    378
    379
    380
    381
    382
    383
    384
    385
    386
    387
    388
    389
    390
    391
    392
    393
    394
    395
    396
    397
    398
    399
    400
    401
    402
    403
    404
    405
    406
    407
    408
    409
    410
    411
    412
    413
    414
    415
    416
    417
    418
    419
    420
    421
    422
    423
    424
    425
    426
    427
    428
    429
    430
    431
    432
    433
    434
    435
    436
    437
    438
    439
    440
    441
    442
    443
    444
    445
    446
    447
    448
    449
    450
    451
    452
    453
    454
    455
    456
    457
    458
    459
    460
    461
    462
    463
    464
    465
    466
    467
    468
    469
    470
    471
    472
    473
    474
    475
    476
    477
    478
    479
    480
    481
    482
    483
    484
    485
    486
    487
    488
    489
    490
    491
    492
    493
    494
    495
    496
    497
    498
    499
    500
    501
    502
    503
    504
    505
    506
    507
    508
    509
    510
    511
    512
    513
    514
    515
    516
    517
    518
    519
    520
    521
    522
    523
    524
    525
    526
    527
    528
    529
    530
    531
    532
    533
    534
    535
    536
    537
    538
    539
    540
    541
    542
    543
    544
    545
    546
    547
    548
    549
    550
    551
    552
    553
    554
    555
    556
    557
    558
    559
    560
    561
    562
    563
    564
    565
    566
    567
    568
    569
    570
    571
    572
    573
    574
    575
    576
    577
    578
    579
    580
    581
    582
    583
    584
    585
    586
    587
    588
    589
    590
    591
    592
    593
    594
    595
    596
    597
    598
    599
    600
    601
    602
    603
    604
    605
    606
    607
    608
    609
    610
    611
    612
    613
    614
    615
    616
    617
    618
    619
    620
    621
    622
    623
    624
    625
    626
    627
    628
    629
    630
    631
    632
    633
    634
    635
    636
    637
    638
    639
    640
    641
    642
    643
    644
    645
    646
    647
    648
    649
    650
    651
    652
    653
    654
    655
    656
    657
    658
    659
    660
    661
    662
    663
    664
    665
    666
    667
    668
    669
    670
    671
    672
    673
    674
    675
    676
    677
    678
    679
    680
    681
    682
    683
    684
    685
    686
    687
    688
    689
    690
    691
    692
    693
    694
    695
    696
    697
    698
    699
    700
    701
    702
    703
    704
    705
    706
    707
    708
    709
    710
    711
    712
    713
    714
    715
    716
    717
    718
    719
    720
    721
    722
    723
    724
    725
    726
    727
    728
    729
    730
    731
    732
    733
    734
    735
    736
    737
    738
    739
    740
    741
    742
    743
    744
    745
    746
    747
    748
    749
    750
    751
    752
    753
    754
    755
    756
    757
    758
    759
    760
    761
    762
    763
    764
    765
    766
    767
    768
    769
    770
    771
    772
    773
    774
    775
    776
    777
    778
    779
    780
    781
    782
    783
    784
    785
    786
    787
    788
    789
    790
    791
    792
    793
    794
    795
    796
    797
    798
    799
    800
    801
    802
    803
    804
    805
    806
    807
    808
    809
    810
    811
    812
    813
    814
    815
    816
    817
    818
    819
    820
    821
    822
    823
    824
    825
    826
    827
    828
    829
    830
    831
    832
    833
    834
    835
    836
    837
    838
    839
    840
    841
    842
    843
    844
    845
    846
    from __future__ import unicode_literals, print_function
     
    class DecodeIterator(object):
        """Decoding iterator with peek and evolve
        """
     
        __slots__ = ("_data", "_length", "_pos")
        def __init__(self, data):
            self._data = data
            self._length = len(data)
            self._pos = 0
     
     
        def __iter__(self):
            while True:
                pos = self._pos
                if pos >= self._length:
                    raise StopIteration
                yield self._data[pos]
                self._pos += 1
     
     
        def __len__(self):
            return self._length
     
     
        #def __getitem__(self, item):
        #    return self._data.__getitem__(item)
     
        @property
        def position(self):
            return self._pos
     
        def peek(self, amount=2):
            nextpos = self._pos + 1
            result = list(self._data[nextpos:nextpos + amount])
            if len(result) < amount:
                result.extend([None] * (amount - len(result)))
            return result
     
        def evolve(self, amount=1):
            self._pos += amount
     
     
        #def residual(self, amount=1):
        #    return self._length - self._pos > amount
     
     
    def decode(input, errors='strict', special=None):
        """Decode unicode from ISO-5426
        """
        if errors not in set(['strict', 'replace', 'ignore', 'repr']):
            raise ValueError("Invalid errors argument %s" % errors)
        result = []
        di = DecodeIterator(input)
        # optimizations
        rappend = result.append
        cget = charmap.get
        for c in di:
            o = c # première erreur ici
            # ASCII chars
            if c < 0x7f:
                rappend(chr(o))
                #i += 1
                continue
     
     
            c1, c2 = di.peek(2)
            ccc2 = None
            # 0xc0 to 0xdf signals a combined char
            if 0xc0 <= o <= 0xdf and c1 is not None:
                # special case 0xc9: both 0xc9 and 0xc9 are combining diaeresis
                # use 0xc8 in favor of 0xc9
                if c == 0xc9:
                    c = 0xc8
                if c1 == 0xc9:
                    c1 = 0xc8
                # double combined char
                if 0xc0 <= c1 <= 0xdf and c2 is not None:
                    ccc2 = (c<<16) + (c1<<8) + c2
                    r = cget(ccc2.to_bytes(2,"big"))
                    if r is not None:
                        # double combined found in table
                        rappend(r)
                        di.evolve(2)
                        continue
                    # build combining unicode
                    dc1 = cget(c.to_bytes(2,"big"))
                    dc2 = cget(c1.to_bytes(2,"big") + c2.to_bytes(2,"big"))
                    if dc1 is not None and dc2 is not None: # pragma: no branch
                        # reverse order, in unicode, the combining char comes after the char
                        rappend(dc2 + dc1)
                        di.evolve(2)
                        continue
                else:
                    cc1 = (c<<8) + c1
                    r = cget(cc1.to_bytes(2,"big"))
                    if r is not None:
                        rappend(r)
                        di.evolve(1)
                        continue
                    # denormalized unicode: char + combining
                    r = cget(c.to_bytes(2,"big"))
                    rn = cget(c1.to_bytes(2,"big"))
                    if r is not None and rn is not None: # pragma: no branch
                        rappend(rn + r)
                        di.evolve(1)
                        continue
     
     
     
     
                # just the combining
                #r = cget(c)
                #if r is not None:
                #    result.append(r)
                #    continue
     
     
     
     
            # other chars, 0x80 <= o <= 0xbf or o >= 0xe0 or last combining
            if special is not None:
                r = special.get(c.to_bytes(2,"big"))
                if r is not None:
                    rappend(r)
                    continue
     
     
     
     
            r = cget(c.to_bytes(2,"big"))
            if r is not None:
                rappend(r)
                continue
     
     
     
     
            # only reached when no result was found
            if errors == "strict":
                p = di.position
                raise UnicodeError("Can't decode byte%s %r at position %i (context %r)" %
                                   ("" if ccc2 is None else "s",
                                    c if ccc2 is None else ccc2,
                                    p, input[p - 3:p + 3]))
            elif errors == "replace":
                rappend('\ufffd')
            elif errors == "ignore":
                pass
            elif errors == "repr":
                rappend('\\x%x' % o)
            else: # pragma: no cover
                # should never be reached
                raise ValueError("Invalid errors argument %s" % errors)
     
     
     
     
        return "".join(result), di.position
     
     
     
     
    # special identity mapping for 0xa4, 0xe0-0xff
    special_xe0_map = {
        b'\xa4': '\xa4',
        b'\xe0': '\xe0',
        b'\xe1': '\xe1',
        b'\xe2': '\xe2',
        b'\xe3': '\xe3',
        b'\xe4': '\xe4',
        b'\xe5': '\xe5',
        b'\xe6': '\xe6',
        b'\xe7': '\xe7',
        b'\xe8': '\xe8',
        b'\xe9': '\xe9',
        b'\xea': '\xea',
        b'\xeb': '\xeb',
        b'\xec': '\xec',
        b'\xed': '\xed',
        b'\xee': '\xee',
        b'\xef': '\xef',
        b'\xf0': '\xf0',
        b'\xf1': '\xf1',
        b'\xf2': '\xf2',
        b'\xf3': '\xf3',
        b'\xf4': '\xf4',
        b'\xf5': '\xf5',
        b'\xf6': '\xf6',
        b'\xf7': '\xf7',
        b'\xf8': '\xf8',
        b'\xf9': '\xf9',
        b'\xfa': '\xfa',
        b'\xfb': '\xfb',
        b'\xfc': '\xfc',
        b'\xfd': '\xfd',
        b'\xfe': '\xfe',
        b'\xff': '\xff'}
     
     
     
     
    unicodemap = {
        '\u001d': b'\x1d', # <control>
        '\u001e': b'\x1e', # <control>
        '\u001f': b'\x1f', # <control>
        '\u0020': b' ', # SPACE
        '\u0021': b'!', # EXCLAMATION MARK
        '\u0022': b'"', # QUOTATION MARK
        '\u0023': b'#', # NUMBER SIGN
        '\u0024': b'\xa4', # DOLLAR SIGN
        '\u0025': b'%', # PERCENT SIGN
        '\u0026': b'&', # AMPERSAND
        '\u0027': b"'", # APOSTROPHE
        '\u0028': b'(', # LEFT PARENTHESIS
        '\u0029': b')', # RIGHT PARENTHESIS
        '\u002a': b'*', # ASTERISK
        '\u002b': b'+', # PLUS SIGN
        '\u002c': b',', # COMMA
        '\u002d': b'-', # HYPHEN-MINUS
        '\u002e': b'.', # FULL STOP
        '\u002f': b'/', # SOLIDUS
        '\u0030': b'0', # DIGIT ZERO
        '\u0031': b'1', # DIGIT ONE
        '\u0032': b'2', # DIGIT TWO
        '\u0033': b'3', # DIGIT THREE
        '\u0034': b'4', # DIGIT FOUR
        '\u0035': b'5', # DIGIT FIVE
        '\u0036': b'6', # DIGIT SIX
        '\u0037': b'7', # DIGIT SEVEN
        '\u0038': b'8', # DIGIT EIGHT
        '\u0039': b'9', # DIGIT NINE
        '\u003a': b':', # COLON
        '\u003b': b';', # SEMICOLON
        '\u003c': b'<', # LESS-THAN SIGN
        '\u003d': b'=', # EQUALS SIGN
        '\u003e': b'>', # GREATER-THAN SIGN
        '\u003f': b'?', # QUESTION MARK
        '\u0040': b'@', # COMMERCIAL AT
        '\u0041': b'A', # LATIN CAPITAL LETTER A
        '\u0042': b'B', # LATIN CAPITAL LETTER B
        '\u0043': b'C', # LATIN CAPITAL LETTER C
        '\u0044': b'D', # LATIN CAPITAL LETTER D
        '\u0045': b'E', # LATIN CAPITAL LETTER E
        '\u0046': b'F', # LATIN CAPITAL LETTER F
        '\u0047': b'G', # LATIN CAPITAL LETTER G
        '\u0048': b'H', # LATIN CAPITAL LETTER H
        '\u0049': b'I', # LATIN CAPITAL LETTER I
        '\u004a': b'J', # LATIN CAPITAL LETTER J
        '\u004b': b'K', # LATIN CAPITAL LETTER K
        '\u004c': b'L', # LATIN CAPITAL LETTER L
        '\u004d': b'M', # LATIN CAPITAL LETTER M
        '\u004e': b'N', # LATIN CAPITAL LETTER N
        '\u004f': b'O', # LATIN CAPITAL LETTER O
        '\u0050': b'P', # LATIN CAPITAL LETTER P
        '\u0051': b'Q', # LATIN CAPITAL LETTER Q
        '\u0052': b'R', # LATIN CAPITAL LETTER R
        '\u0053': b'S', # LATIN CAPITAL LETTER S
        '\u0054': b'T', # LATIN CAPITAL LETTER T
        '\u0055': b'U', # LATIN CAPITAL LETTER U
        '\u0056': b'V', # LATIN CAPITAL LETTER V
        '\u0057': b'W', # LATIN CAPITAL LETTER W
        '\u0058': b'X', # LATIN CAPITAL LETTER X
        '\u0059': b'Y', # LATIN CAPITAL LETTER Y
        '\u005a': b'Z', # LATIN CAPITAL LETTER Z
        '\u005b': b'[', # LEFT SQUARE BRACKET
        '\u005c': b'\\', # REVERSE SOLIDUS
        '\u005d': b']', # RIGHT SQUARE BRACKET
        '\u005e': b'^', # CIRCUMFLEX ACCENT
        '\u005f': b'_', # LOW LINE
        '\u0060': b'`', # GRAVE ACCENT
        '\u0061': b'a', # LATIN SMALL LETTER A
        '\u0062': b'b', # LATIN SMALL LETTER B
        '\u0063': b'c', # LATIN SMALL LETTER C
        '\u0064': b'd', # LATIN SMALL LETTER D
        '\u0065': b'e', # LATIN SMALL LETTER E
        '\u0066': b'f', # LATIN SMALL LETTER F
        '\u0067': b'g', # LATIN SMALL LETTER G
        '\u0068': b'h', # LATIN SMALL LETTER H
        '\u0069': b'i', # LATIN SMALL LETTER I
        '\u006a': b'j', # LATIN SMALL LETTER J
        '\u006b': b'k', # LATIN SMALL LETTER K
        '\u006c': b'l', # LATIN SMALL LETTER L
        '\u006d': b'm', # LATIN SMALL LETTER M
        '\u006e': b'n', # LATIN SMALL LETTER N
        '\u006f': b'o', # LATIN SMALL LETTER O
        '\u0070': b'p', # LATIN SMALL LETTER P
        '\u0071': b'q', # LATIN SMALL LETTER Q
        '\u0072': b'r', # LATIN SMALL LETTER R
        '\u0073': b's', # LATIN SMALL LETTER S
        '\u0074': b't', # LATIN SMALL LETTER T
        '\u0075': b'u', # LATIN SMALL LETTER U
        '\u0076': b'v', # LATIN SMALL LETTER V
        '\u0077': b'w', # LATIN SMALL LETTER W
        '\u0078': b'x', # LATIN SMALL LETTER X
        '\u0079': b'y', # LATIN SMALL LETTER Y
        '\u007a': b'z', # LATIN SMALL LETTER Z
        '\u007b': b'{', # LEFT CURLY BRACKET
        '\u007c': b'|', # VERTICAL LINE
        '\u007d': b'}', # RIGHT CURLY BRACKET
        '\u007e': b'~', # TILDE
        '\u0088': b'\x88', # <control>
        '\u0089': b'\x89', # <control>
        # XXX not part of the standard but MARC equivalent of \x88, \x89
        #'\u0098': b'\x98', # <control>
        #'\u009c': b'\x9c', # <control>
        '\u00a1': b'\xa1', # INVERTED EXCLAMATION MARK
        '\u00a3': b'\xa3', # POUND SIGN
        '\u00a5': b'\xa5', # YEN SIGN
        '\u00a7': b'\xa7', # SECTION SIGN
        '\u00a9': b'\xad', # COPYRIGHT SIGN
        '\u00ab': b'\xab', # LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
        '\u00ae': b'\xaf', # REGISTERED SIGN
        '\u00b7': b'\xb7', # MIDDLE DOT
        '\u00bb': b'\xbb', # RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
        '\u00bf': b'\xbf', # INVERTED QUESTION MARK
        '\u00c0': b'\xc1A', # LATIN CAPITAL LETTER A WITH GRAVE
        '\u00c1': b'\xc2A', # LATIN CAPITAL LETTER A WITH ACUTE
        '\u00c2': b'\xc3A', # LATIN CAPITAL LETTER A WITH CIRCUMFLEX
        '\u00c3': b'\xc4A', # LATIN CAPITAL LETTER A WITH TILDE
        '\u00c4': b'\xc8A', # LATIN CAPITAL LETTER A WITH DIAERESIS
        '\u00c5': b'\xcaA', # LATIN CAPITAL LETTER A WITH RING ABOVE
        '\u00c6': b'\xe1', # LATIN CAPITAL LETTER AE
        '\u00c7': b'\xd0C', # LATIN CAPITAL LETTER C WITH CEDILLA
        '\u00c8': b'\xc1E', # LATIN CAPITAL LETTER E WITH GRAVE
        '\u00c9': b'\xc2E', # LATIN CAPITAL LETTER E WITH ACUTE
        '\u00ca': b'\xc3E', # LATIN CAPITAL LETTER E WITH CIRCUMFLEX
        '\u00cb': b'\xc8E', # LATIN CAPITAL LETTER E WITH DIAERESIS
        '\u00cc': b'\xc1I', # LATIN CAPITAL LETTER I WITH GRAVE
        '\u00cd': b'\xc2I', # LATIN CAPITAL LETTER I WITH ACUTE
        '\u00ce': b'\xc3I', # LATIN CAPITAL LETTER I WITH CIRCUMFLEX
        '\u00cf': b'\xc8I', # LATIN CAPITAL LETTER I WITH DIAERESIS
        '\u00d1': b'\xc4N', # LATIN CAPITAL LETTER N WITH TILDE
        '\u00d2': b'\xc1O', # LATIN CAPITAL LETTER O WITH GRAVE
        '\u00d3': b'\xc2O', # LATIN CAPITAL LETTER O WITH ACUTE
        '\u00d4': b'\xc3O', # LATIN CAPITAL LETTER O WITH CIRCUMFLEX
        '\u00d5': b'\xc4O', # LATIN CAPITAL LETTER O WITH TILDE
        '\u00d6': b'\xc8O', # LATIN CAPITAL LETTER O WITH DIAERESIS
        '\u00d8': b'\xe9', # LATIN CAPITAL LETTER O WITH STROKE
        '\u00d9': b'\xc1U', # LATIN CAPITAL LETTER U WITH GRAVE
        '\u00da': b'\xc2U', # LATIN CAPITAL LETTER U WITH ACUTE
        '\u00db': b'\xc3U', # LATIN CAPITAL LETTER U WITH CIRCUMFLEX
        '\u00dc': b'\xc8U', # LATIN CAPITAL LETTER U WITH DIAERESIS
        '\u00dd': b'\xc2Y', # LATIN CAPITAL LETTER Y WITH ACUTE
        '\u00de': b'\xec', # LATIN CAPITAL LETTER THORN
        '\u00df': b'\xfb', # LATIN SMALL LETTER SHARP S
        '\u00e0': b'\xc1a', # LATIN SMALL LETTER A WITH GRAVE
        '\u00e1': b'\xc2a', # LATIN SMALL LETTER A WITH ACUTE
        '\u00e2': b'\xc3a', # LATIN SMALL LETTER A WITH CIRCUMFLEX
        '\u00e3': b'\xc4a', # LATIN SMALL LETTER A WITH TILDE
        '\u00e4': b'\xc8a', # LATIN SMALL LETTER A WITH DIAERESIS
        '\u00e5': b'\xcaa', # LATIN SMALL LETTER A WITH RING ABOVE
        '\u00e6': b'\xf1', # LATIN SMALL LETTER AE
        '\u00e7': b'\xd0c', # LATIN SMALL LETTER C WITH CEDILLA
        '\u00e8': b'\xc1e', # LATIN SMALL LETTER E WITH GRAVE
        '\u00e9': b'\xc2e', # LATIN SMALL LETTER E WITH ACUTE
        '\u00ea': b'\xc3e', # LATIN SMALL LETTER E WITH CIRCUMFLEX
        '\u00eb': b'\xc8e', # LATIN SMALL LETTER E WITH DIAERESIS
        '\u00ec': b'\xc1i', # LATIN SMALL LETTER I WITH GRAVE
        '\u00ed': b'\xc2i', # LATIN SMALL LETTER I WITH ACUTE
        '\u00ee': b'\xc3i', # LATIN SMALL LETTER I WITH CIRCUMFLEX
        '\u00ef': b'\xc8i', # LATIN SMALL LETTER I WITH DIAERESIS
        '\u00f0': b'\xf3', # LATIN SMALL LETTER ETH
        '\u00f1': b'\xc4n', # LATIN SMALL LETTER N WITH TILDE
        '\u00f2': b'\xc1o', # LATIN SMALL LETTER O WITH GRAVE
        '\u00f3': b'\xc2o', # LATIN SMALL LETTER O WITH ACUTE
        '\u00f4': b'\xc3o', # LATIN SMALL LETTER O WITH CIRCUMFLEX
        '\u00f5': b'\xc4o', # LATIN SMALL LETTER O WITH TILDE
        '\u00f6': b'\xc8o', # LATIN SMALL LETTER O WITH DIAERESIS
        '\u00f8': b'\xf9', # LATIN SMALL LETTER O WITH STROKE
        '\u00f9': b'\xc1u', # LATIN SMALL LETTER U WITH GRAVE
        '\u00fa': b'\xc2u', # LATIN SMALL LETTER U WITH ACUTE
        '\u00fb': b'\xc3u', # LATIN SMALL LETTER U WITH CIRCUMFLEX
        '\u00fc': b'\xc8u', # LATIN SMALL LETTER U WITH DIAERESIS
        '\u00fd': b'\xc2y', # LATIN SMALL LETTER Y WITH ACUTE
        '\u00fe': b'\xfc', # LATIN SMALL LETTER THORN
        '\u00ff': b'\xc8y', # LATIN SMALL LETTER Y WITH DIAERESIS
        '\u0100': b'\xc5A', # LATIN CAPITAL LETTER A WITH MACRON
        '\u0101': b'\xc5a', # LATIN SMALL LETTER A WITH MACRON
        '\u0102': b'\xc6A', # LATIN CAPITAL LETTER A WITH BREVE
        '\u0103': b'\xc6a', # LATIN SMALL LETTER A WITH BREVE
        '\u0104': b'\xd3A', # LATIN CAPITAL LETTER A WITH OGONEK
        '\u0105': b'\xd3a', # LATIN SMALL LETTER A WITH OGONEK
        '\u0106': b'\xc2C', # LATIN CAPITAL LETTER C WITH ACUTE
        '\u0107': b'\xc2c', # LATIN SMALL LETTER C WITH ACUTE
        '\u0108': b'\xc3C', # LATIN CAPITAL LETTER C WITH CIRCUMFLEX
        '\u0109': b'\xc3c', # LATIN SMALL LETTER C WITH CIRCUMFLEX
        '\u010a': b'\xc7C', # LATIN CAPITAL LETTER C WITH DOT ABOVE
        '\u010b': b'\xc7c', # LATIN SMALL LETTER C WITH DOT ABOVE
        '\u010c': b'\xcfC', # LATIN CAPITAL LETTER C WITH CARON
        '\u010d': b'\xcfc', # LATIN SMALL LETTER C WITH CARON
        '\u010e': b'\xcfD', # LATIN CAPITAL LETTER D WITH CARON
        '\u010f': b'\xcfd', # LATIN SMALL LETTER D WITH CARON
        '\u0110': b'\xe2', # LATIN CAPITAL LETTER D WITH STROKE
        '\u0111': b'\xf2', # LATIN SMALL LETTER D WITH STROKE
        '\u0112': b'\xc5E', # LATIN CAPITAL LETTER E WITH MACRON
        '\u0113': b'\xc5e', # LATIN SMALL LETTER E WITH MACRON
        '\u0114': b'\xc6E', # LATIN CAPITAL LETTER E WITH BREVE
        '\u0115': b'\xc6e', # LATIN SMALL LETTER E WITH BREVE
        '\u0116': b'\xc7E', # LATIN CAPITAL LETTER E WITH DOT ABOVE
        '\u0117': b'\xc7e', # LATIN SMALL LETTER E WITH DOT ABOVE
        '\u0118': b'\xd3E', # LATIN CAPITAL LETTER E WITH OGONEK
        '\u0119': b'\xd3e', # LATIN SMALL LETTER E WITH OGONEK
        '\u011a': b'\xcfE', # LATIN CAPITAL LETTER E WITH CARON
        '\u011b': b'\xcfe', # LATIN SMALL LETTER E WITH CARON
        '\u011c': b'\xc3G', # LATIN CAPITAL LETTER G WITH CIRCUMFLEX
        '\u011d': b'\xc3g', # LATIN SMALL LETTER G WITH CIRCUMFLEX
        '\u011e': b'\xc6G', # LATIN CAPITAL LETTER G WITH BREVE
        '\u011f': b'\xc6g', # LATIN SMALL LETTER G WITH BREVE
        '\u0120': b'\xc7G', # LATIN CAPITAL LETTER G WITH DOT ABOVE
        '\u0121': b'\xc7g', # LATIN SMALL LETTER G WITH DOT ABOVE
        '\u0122': b'\xd0G', # LATIN CAPITAL LETTER G WITH CEDILLA
        '\u0123': b'\xd0g', # LATIN SMALL LETTER G WITH CEDILLA
        '\u0124': b'\xc3H', # LATIN CAPITAL LETTER H WITH CIRCUMFLEX
        '\u0125': b'\xc3h', # LATIN SMALL LETTER H WITH CIRCUMFLEX
        '\u0128': b'\xc4I', # LATIN CAPITAL LETTER I WITH TILDE
        '\u0129': b'\xc4i', # LATIN SMALL LETTER I WITH TILDE
        '\u012a': b'\xc5I', # LATIN CAPITAL LETTER I WITH MACRON
        '\u012b': b'\xc5i', # LATIN SMALL LETTER I WITH MACRON
        '\u012c': b'\xc6I', # LATIN CAPITAL LETTER I WITH BREVE
        '\u012d': b'\xc6i', # LATIN SMALL LETTER I WITH BREVE
        '\u012e': b'\xd3I', # LATIN CAPITAL LETTER I WITH OGONEK
        '\u012f': b'\xd3i', # LATIN SMALL LETTER I WITH OGONEK
        '\u0130': b'\xc7I', # LATIN CAPITAL LETTER I WITH DOT ABOVE
        '\u0131': b'\xf5', # LATIN SMALL LETTER DOTLESS I
        '\u0132': b'\xe6', # LATIN CAPITAL LIGATURE IJ
        '\u0133': b'\xf6', # LATIN SMALL LIGATURE IJ
        '\u0134': b'\xc3J', # LATIN CAPITAL LETTER J WITH CIRCUMFLEX
        '\u0135': b'\xc3j', # LATIN SMALL LETTER J WITH CIRCUMFLEX
        '\u0136': b'\xd0K', # LATIN CAPITAL LETTER K WITH CEDILLA
        '\u0137': b'\xd0k', # LATIN SMALL LETTER K WITH CEDILLA
        '\u0139': b'\xc2L', # LATIN CAPITAL LETTER L WITH ACUTE
        '\u013a': b'\xc2l', # LATIN SMALL LETTER L WITH ACUTE
        '\u013b': b'\xd0L', # LATIN CAPITAL LETTER L WITH CEDILLA
        '\u013c': b'\xd0l', # LATIN SMALL LETTER L WITH CEDILLA
        '\u013d': b'\xcfL', # LATIN CAPITAL LETTER L WITH CARON
        '\u013e': b'\xcfl', # LATIN SMALL LETTER L WITH CARON
        '\u0141': b'\xe8', # LATIN CAPITAL LETTER L WITH STROKE
        '\u0142': b'\xf8', # LATIN SMALL LETTER L WITH STROKE
        '\u0143': b'\xc2N', # LATIN CAPITAL LETTER N WITH ACUTE
        '\u0144': b'\xc2n', # LATIN SMALL LETTER N WITH ACUTE
        '\u0145': b'\xd0N', # LATIN CAPITAL LETTER N WITH CEDILLA
        '\u0146': b'\xd0n', # LATIN SMALL LETTER N WITH CEDILLA
        '\u0147': b'\xcfN', # LATIN CAPITAL LETTER N WITH CARON
        '\u0148': b'\xcfn', # LATIN SMALL LETTER N WITH CARON
        '\u014c': b'\xc5O', # LATIN CAPITAL LETTER O WITH MACRON
        '\u014d': b'\xc5o', # LATIN SMALL LETTER O WITH MACRON
        '\u014e': b'\xc6O', # LATIN CAPITAL LETTER O WITH BREVE
        '\u014f': b'\xc6o', # LATIN SMALL LETTER O WITH BREVE
        '\u0150': b'\xcdO', # LATIN CAPITAL LETTER O WITH DOUBLE ACUTE
        '\u0151': b'\xcdo', # LATIN SMALL LETTER O WITH DOUBLE ACUTE
        '\u0152': b'\xea', # LATIN CAPITAL LIGATURE OE
        '\u0153': b'\xfa', # LATIN SMALL LIGATURE OE
        '\u0154': b'\xc2R', # LATIN CAPITAL LETTER R WITH ACUTE
        '\u0155': b'\xc2r', # LATIN SMALL LETTER R WITH ACUTE
        '\u0156': b'\xd0R', # LATIN CAPITAL LETTER R WITH CEDILLA
        '\u0157': b'\xd0r', # LATIN SMALL LETTER R WITH CEDILLA
        '\u0158': b'\xcfR', # LATIN CAPITAL LETTER R WITH CARON
        '\u0159': b'\xcfr', # LATIN SMALL LETTER R WITH CARON
        '\u015a': b'\xc2S', # LATIN CAPITAL LETTER S WITH ACUTE
        '\u015b': b'\xc2s', # LATIN SMALL LETTER S WITH ACUTE
        '\u015c': b'\xc3S', # LATIN CAPITAL LETTER S WITH CIRCUMFLEX
        '\u015d': b'\xc3s', # LATIN SMALL LETTER S WITH CIRCUMFLEX
        '\u015e': b'\xd0S', # LATIN CAPITAL LETTER S WITH CEDILLA
        '\u015f': b'\xd0s', # LATIN SMALL LETTER S WITH CEDILLA
        '\u0160': b'\xcfS', # LATIN CAPITAL LETTER S WITH CARON
        '\u0161': b'\xcfs', # LATIN SMALL LETTER S WITH CARON
        '\u0162': b'\xd0T', # LATIN CAPITAL LETTER T WITH CEDILLA
        '\u0163': b'\xd0t', # LATIN SMALL LETTER T WITH CEDILLA
        '\u0164': b'\xcfT', # LATIN CAPITAL LETTER T WITH CARON
        '\u0165': b'\xcft', # LATIN SMALL LETTER T WITH CARON
        '\u0168': b'\xc4U', # LATIN CAPITAL LETTER U WITH TILDE
        '\u0169': b'\xc4u', # LATIN SMALL LETTER U WITH TILDE
        '\u016a': b'\xc5U', # LATIN CAPITAL LETTER U WITH MACRON
        '\u016b': b'\xc5u', # LATIN SMALL LETTER U WITH MACRON
        '\u016c': b'\xc6U', # LATIN CAPITAL LETTER U WITH BREVE
        '\u016d': b'\xc6u', # LATIN SMALL LETTER U WITH BREVE
        '\u016e': b'\xcaU', # LATIN CAPITAL LETTER U WITH RING ABOVE
        '\u016f': b'\xcau', # LATIN SMALL LETTER U WITH RING ABOVE
        '\u0170': b'\xcdU', # LATIN CAPITAL LETTER U WITH DOUBLE ACUTE
        '\u0171': b'\xcdu', # LATIN SMALL LETTER U WITH DOUBLE ACUTE
        '\u0172': b'\xd3U', # LATIN CAPITAL LETTER U WITH OGONEK
        '\u0173': b'\xd3u', # LATIN SMALL LETTER U WITH OGONEK
        '\u0174': b'\xc3W', # LATIN CAPITAL LETTER W WITH CIRCUMFLEX
        '\u0175': b'\xc3w', # LATIN SMALL LETTER W WITH CIRCUMFLEX
        '\u0176': b'\xc3Y', # LATIN CAPITAL LETTER Y WITH CIRCUMFLEX
        '\u0177': b'\xc3y', # LATIN SMALL LETTER Y WITH CIRCUMFLEX
        '\u0178': b'\xc8Y', # LATIN CAPITAL LETTER Y WITH DIAERESIS
        '\u0179': b'\xc2Z', # LATIN CAPITAL LETTER Z WITH ACUTE
        '\u017a': b'\xc2z', # LATIN SMALL LETTER Z WITH ACUTE
        '\u017b': b'\xc7Z', # LATIN CAPITAL LETTER Z WITH DOT ABOVE
        '\u017c': b'\xc7z', # LATIN SMALL LETTER Z WITH DOT ABOVE
        '\u017d': b'\xcfZ', # LATIN CAPITAL LETTER Z WITH CARON
        '\u017e': b'\xcfz', # LATIN SMALL LETTER Z WITH CARON
        '\u01a0': b'\xceO', # LATIN CAPITAL LETTER O WITH HORN
        '\u01a1': b'\xceo', # LATIN SMALL LETTER O WITH HORN
        '\u01af': b'\xceU', # LATIN CAPITAL LETTER U WITH HORN
        '\u01b0': b'\xceu', # LATIN SMALL LETTER U WITH HORN
        '\u01cd': b'\xcfA', # LATIN CAPITAL LETTER A WITH CARON
        '\u01ce': b'\xcfa', # LATIN SMALL LETTER A WITH CARON
        '\u01cf': b'\xcfI', # LATIN CAPITAL LETTER I WITH CARON
        '\u01d0': b'\xcfi', # LATIN SMALL LETTER I WITH CARON
        '\u01d1': b'\xcfO', # LATIN CAPITAL LETTER O WITH CARON
        '\u01d2': b'\xcfo', # LATIN SMALL LETTER O WITH CARON
        '\u01d3': b'\xcfU', # LATIN CAPITAL LETTER U WITH CARON
        '\u01d4': b'\xcfu', # LATIN SMALL LETTER U WITH CARON
        '\u01d5': b'\xc5\xc8U', # LATIN CAPITAL LETTER U WITH DIAERESIS AND MACRON
        '\u01d6': b'\xc5\xc8u', # LATIN SMALL LETTER U WITH DIAERESIS AND MACRON
        '\u01d7': b'\xc2\xc8U', # LATIN CAPITAL LETTER U WITH DIAERESIS AND ACUTE
        '\u01d8': b'\xc2\xc8u', # LATIN SMALL LETTER U WITH DIAERESIS AND ACUTE
        '\u01d9': b'\xcf\xc8U', # LATIN CAPITAL LETTER U WITH DIAERESIS AND CARON
        '\u01da': b'\xcf\xc8u', # LATIN SMALL LETTER U WITH DIAERESIS AND CARON
        '\u01db': b'\xc1\xc8U', # LATIN CAPITAL LETTER U WITH DIAERESIS AND GRAVE
        '\u01dc': b'\xc1\xc8u', # LATIN SMALL LETTER U WITH DIAERESIS AND GRAVE
        '\u01de': b'\xc5\xc8A', # LATIN CAPITAL LETTER A WITH DIAERESIS AND MACRON
        '\u01df': b'\xc5\xc8a', # LATIN SMALL LETTER A WITH DIAERESIS AND MACRON
        '\u01e0': b'\xc5\xc7A', # LATIN CAPITAL LETTER A WITH DOT ABOVE AND MACRON
        '\u01e1': b'\xc5\xc7a', # LATIN SMALL LETTER A WITH DOT ABOVE AND MACRON
        '\u01e2': b'\xc5\xe1', # LATIN CAPITAL LETTER AE WITH MACRON
        '\u01e3': b'\xc5\xf1', # LATIN SMALL LETTER AE WITH MACRON
        '\u01e6': b'\xcfG', # LATIN CAPITAL LETTER G WITH CARON
        '\u01e7': b'\xcfg', # LATIN SMALL LETTER G WITH CARON
        '\u01e8': b'\xcfK', # LATIN CAPITAL LETTER K WITH CARON
        '\u01e9': b'\xcfk', # LATIN SMALL LETTER K WITH CARON
        '\u01ea': b'\xd3O', # LATIN CAPITAL LETTER O WITH OGONEK
        '\u01eb': b'\xd3o', # LATIN SMALL LETTER O WITH OGONEK
        '\u01ec': b'\xc5\xd3O', # LATIN CAPITAL LETTER O WITH OGONEK AND MACRON
        '\u01ed': b'\xc5\xd3o', # LATIN SMALL LETTER O WITH OGONEK AND MACRON
        '\u01f0': b'\xcfj', # LATIN SMALL LETTER J WITH CARON
        '\u01f4': b'\xc2G', # LATIN CAPITAL LETTER G WITH ACUTE
        '\u01f5': b'\xc2g', # LATIN SMALL LETTER G WITH ACUTE
        '\u01f8': b'\xc1N', # LATIN CAPITAL LETTER N WITH GRAVE
        '\u01f9': b'\xc1n', # LATIN SMALL LETTER N WITH GRAVE
        '\u01fa': b'\xc2\xcaA', # LATIN CAPITAL LETTER A WITH RING ABOVE AND ACUTE
        '\u01fb': b'\xc2\xcaa', # LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE
        '\u01fc': b'\xc2\xe1', # LATIN CAPITAL LETTER AE WITH ACUTE
        '\u01fd': b'\xc2\xf1', # LATIN SMALL LETTER AE WITH ACUTE
        '\u01fe': b'\xc2\xe9', # LATIN CAPITAL LETTER O WITH STROKE AND ACUTE
        '\u01ff': b'\xc2\xf9', # LATIN SMALL LETTER O WITH STROKE AND ACUTE
        '\u0218': b'\xd2S', # LATIN CAPITAL LETTER S WITH COMMA BELOW
        '\u0219': b'\xd2s', # LATIN SMALL LETTER S WITH COMMA BELOW
        '\u021a': b'\xd2T', # LATIN CAPITAL LETTER T WITH COMMA BELOW
        '\u021b': b'\xd2t', # LATIN SMALL LETTER T WITH COMMA BELOW
        '\u021e': b'\xcfH', # LATIN CAPITAL LETTER H WITH CARON
        '\u021f': b'\xcfh', # LATIN SMALL LETTER H WITH CARON
        '\u0226': b'\xc7A', # LATIN CAPITAL LETTER A WITH DOT ABOVE
        '\u0227': b'\xc7a', # LATIN SMALL LETTER A WITH DOT ABOVE
        '\u0228': b'\xd0E', # LATIN CAPITAL LETTER E WITH CEDILLA
        '\u0229': b'\xd0e', # LATIN SMALL LETTER E WITH CEDILLA
        '\u022a': b'\xc5\xc8O', # LATIN CAPITAL LETTER O WITH DIAERESIS AND MACRON
        '\u022b': b'\xc5\xc8o', # LATIN SMALL LETTER O WITH DIAERESIS AND MACRON
        '\u022c': b'\xc5\xc4O', # LATIN CAPITAL LETTER O WITH TILDE AND MACRON
        '\u022d': b'\xc5\xc4o', # LATIN SMALL LETTER O WITH TILDE AND MACRON
        '\u022e': b'\xc7O', # LATIN CAPITAL LETTER O WITH DOT ABOVE
        '\u022f': b'\xc7o', # LATIN SMALL LETTER O WITH DOT ABOVE
        '\u0230': b'\xc5\xc7O', # LATIN CAPITAL LETTER O WITH DOT ABOVE AND MACRON
        '\u0231': b'\xc5\xc7o', # LATIN SMALL LETTER O WITH DOT ABOVE AND MACRON
        '\u0232': b'\xc5Y', # LATIN CAPITAL LETTER Y WITH MACRON
        '\u0233': b'\xc5y', # LATIN SMALL LETTER Y WITH MACRON
        '\u02b9': b'\xbd', # MODIFIER LETTER PRIME
        '\u02ba': b'\xbe', # MODIFIER LETTER DOUBLE PRIME
        '\u02bb': b'\xb0', # MODIFIER LETTER TURNED COMMA
        '\u02bc': b'\xb1', # MODIFIER LETTER APOSTROPHE
        '\u0300': b'\xc1', # COMBINING GRAVE ACCENT
        '\u0301': b'\xc2', # COMBINING ACUTE ACCENT
        '\u0302': b'\xc3', # COMBINING CIRCUMFLEX ACCENT
        '\u0303': b'\xc4', # COMBINING TILDE
        '\u0304': b'\xc5', # COMBINING MACRON
        '\u0306': b'\xc6', # COMBINING BREVE
        '\u0307': b'\xc7', # COMBINING DOT ABOVE
        '\u0308': b'\xc8', # COMBINING DIAERESIS
        '\u0309': b'\xc0', # COMBINING HOOK ABOVE
        '\u030a': b'\xca', # COMBINING RING ABOVE
        '\u030b': b'\xcd', # COMBINING DOUBLE ACUTE ACCENT
        '\u030c': b'\xcf', # COMBINING CARON
        '\u0312': b'\xcc', # COMBINING TURNED COMMA ABOVE
        '\u0315': b'\xcb', # COMBINING COMMA ABOVE RIGHT
        '\u031b': b'\xce', # COMBINING HORN
        '\u031c': b'\xd1', # COMBINING LEFT HALF RING BELOW
        '\u0323': b'\xd6', # COMBINING DOT BELOW
        '\u0324': b'\xd7', # COMBINING DIAERESIS BELOW
        '\u0325': b'\xd4', # COMBINING RING BELOW
        '\u0326': b'\xd2', # COMBINING COMMA BELOW
        '\u0327': b'\xd0', # COMBINING CEDILLA
        '\u0328': b'\xd3', # COMBINING OGONEK
        '\u0329': b'\xda', # COMBINING VERTICAL LINE BELOW
        '\u032d': b'\xdb', # COMBINING CIRCUMFLEX ACCENT BELOW
        '\u032e': b'\xd5', # COMBINING BREVE BELOW
        '\u0332': b'\xd8', # COMBINING LOW LINE
        '\u0333': b'\xd9', # COMBINING DOUBLE LOW LINE
        '\u0340': b'\xc1', # COMBINING GRAVE TONE MARK
        '\u0341': b'\xc2', # COMBINING ACUTE TONE MARK
        '\u0344': b'\xc2\xc8', # COMBINING GREEK DIALYTIKA TONOS
        '\u0374': b'\xbd', # GREEK NUMERAL SIGN
        '\u037e': b';', # GREEK QUESTION MARK
        '\u0387': b'\xb7', # GREEK ANO TELEIA
        '\u1e00': b'\xd4A', # LATIN CAPITAL LETTER A WITH RING BELOW
        '\u1e01': b'\xd4a', # LATIN SMALL LETTER A WITH RING BELOW
        '\u1e02': b'\xc7B', # LATIN CAPITAL LETTER B WITH DOT ABOVE
        '\u1e03': b'\xc7b', # LATIN SMALL LETTER B WITH DOT ABOVE
        '\u1e04': b'\xd6B', # LATIN CAPITAL LETTER B WITH DOT BELOW
        '\u1e05': b'\xd6b', # LATIN SMALL LETTER B WITH DOT BELOW
        '\u1e08': b'\xc2\xd0C', # LATIN CAPITAL LETTER C WITH CEDILLA AND ACUTE
        '\u1e09': b'\xc2\xd0c', # LATIN SMALL LETTER C WITH CEDILLA AND ACUTE
        '\u1e0a': b'\xc7D', # LATIN CAPITAL LETTER D WITH DOT ABOVE
        '\u1e0b': b'\xc7d', # LATIN SMALL LETTER D WITH DOT ABOVE
        '\u1e0c': b'\xd6D', # LATIN CAPITAL LETTER D WITH DOT BELOW
        '\u1e0d': b'\xd6d', # LATIN SMALL LETTER D WITH DOT BELOW
        '\u1e10': b'\xd0D', # LATIN CAPITAL LETTER D WITH CEDILLA
        '\u1e11': b'\xd0d', # LATIN SMALL LETTER D WITH CEDILLA
        '\u1e12': b'\xdbD', # LATIN CAPITAL LETTER D WITH CIRCUMFLEX BELOW
        '\u1e13': b'\xdbd', # LATIN SMALL LETTER D WITH CIRCUMFLEX BELOW
        '\u1e14': b'\xc1\xc5E', # LATIN CAPITAL LETTER E WITH MACRON AND GRAVE
        '\u1e15': b'\xc1\xc5e', # LATIN SMALL LETTER E WITH MACRON AND GRAVE
        '\u1e16': b'\xc2\xc5E', # LATIN CAPITAL LETTER E WITH MACRON AND ACUTE
        '\u1e17': b'\xc2\xc5e', # LATIN SMALL LETTER E WITH MACRON AND ACUTE
        '\u1e18': b'\xdbE', # LATIN CAPITAL LETTER E WITH CIRCUMFLEX BELOW
        '\u1e19': b'\xdbe', # LATIN SMALL LETTER E WITH CIRCUMFLEX BELOW
        '\u1e1c': b'\xc6\xd0E', # LATIN CAPITAL LETTER E WITH CEDILLA AND BREVE
        '\u1e1d': b'\xc6\xd0e', # LATIN SMALL LETTER E WITH CEDILLA AND BREVE
        '\u1e1e': b'\xc7F', # LATIN CAPITAL LETTER F WITH DOT ABOVE
        '\u1e1f': b'\xc7f', # LATIN SMALL LETTER F WITH DOT ABOVE
        '\u1e20': b'\xc5G', # LATIN CAPITAL LETTER G WITH MACRON
        '\u1e21': b'\xc5g', # LATIN SMALL LETTER G WITH MACRON
        '\u1e22': b'\xc7H', # LATIN CAPITAL LETTER H WITH DOT ABOVE
        '\u1e23': b'\xc7h', # LATIN SMALL LETTER H WITH DOT ABOVE
        '\u1e24': b'\xd6H', # LATIN CAPITAL LETTER H WITH DOT BELOW
        '\u1e25': b'\xd6h', # LATIN SMALL LETTER H WITH DOT BELOW
        '\u1e26': b'\xc8H', # LATIN CAPITAL LETTER H WITH DIAERESIS
        '\u1e27': b'\xc8h', # LATIN SMALL LETTER H WITH DIAERESIS
        '\u1e28': b'\xd0H', # LATIN CAPITAL LETTER H WITH CEDILLA
        '\u1e29': b'\xd0h', # LATIN SMALL LETTER H WITH CEDILLA
        '\u1e2a': b'\xd5H', # LATIN CAPITAL LETTER H WITH BREVE BELOW
        '\u1e2b': b'\xd5h', # LATIN SMALL LETTER H WITH BREVE BELOW
        '\u1e2e': b'\xc2\xc8I', # LATIN CAPITAL LETTER I WITH DIAERESIS AND ACUTE
        '\u1e2f': b'\xc2\xc8i', # LATIN SMALL LETTER I WITH DIAERESIS AND ACUTE
        '\u1e30': b'\xc2K', # LATIN CAPITAL LETTER K WITH ACUTE
        '\u1e31': b'\xc2k', # LATIN SMALL LETTER K WITH ACUTE
        '\u1e32': b'\xd6K', # LATIN CAPITAL LETTER K WITH DOT BELOW
        '\u1e33': b'\xd6k', # LATIN SMALL LETTER K WITH DOT BELOW
        '\u1e36': b'\xd6L', # LATIN CAPITAL LETTER L WITH DOT BELOW
        '\u1e37': b'\xd6l', # LATIN SMALL LETTER L WITH DOT BELOW
        '\u1e38': b'\xc5\xd6L', # LATIN CAPITAL LETTER L WITH DOT BELOW AND MACRON
        '\u1e39': b'\xc5\xd6l', # LATIN SMALL LETTER L WITH DOT BELOW AND MACRON
        '\u1e3c': b'\xdbL', # LATIN CAPITAL LETTER L WITH CIRCUMFLEX BELOW
        '\u1e3d': b'\xdbl', # LATIN SMALL LETTER L WITH CIRCUMFLEX BELOW
        '\u1e3e': b'\xc2M', # LATIN CAPITAL LETTER M WITH ACUTE
        '\u1e3f': b'\xc2m', # LATIN SMALL LETTER M WITH ACUTE
        '\u1e40': b'\xc7M', # LATIN CAPITAL LETTER M WITH DOT ABOVE
        '\u1e41': b'\xc7m', # LATIN SMALL LETTER M WITH DOT ABOVE
        '\u1e42': b'\xd6M', # LATIN CAPITAL LETTER M WITH DOT BELOW
        '\u1e43': b'\xd6m', # LATIN SMALL LETTER M WITH DOT BELOW
        '\u1e44': b'\xc7N', # LATIN CAPITAL LETTER N WITH DOT ABOVE
        '\u1e45': b'\xc7n', # LATIN SMALL LETTER N WITH DOT ABOVE
        '\u1e46': b'\xd6N', # LATIN CAPITAL LETTER N WITH DOT BELOW
        '\u1e47': b'\xd6n', # LATIN SMALL LETTER N WITH DOT BELOW
        '\u1e4a': b'\xdbN', # LATIN CAPITAL LETTER N WITH CIRCUMFLEX BELOW
        '\u1e4b': b'\xdbn', # LATIN SMALL LETTER N WITH CIRCUMFLEX BELOW
        '\u1e4c': b'\xc2\xc4O', # LATIN CAPITAL LETTER O WITH TILDE AND ACUTE
        '\u1e4d': b'\xc2\xc4o', # LATIN SMALL LETTER O WITH TILDE AND ACUTE
        '\u1e4e': b'\xc8\xc4O', # LATIN CAPITAL LETTER O WITH TILDE AND DIAERESIS
        '\u1e4f': b'\xc8\xc4o', # LATIN SMALL LETTER O WITH TILDE AND DIAERESIS
        '\u1e50': b'\xc1\xc5O', # LATIN CAPITAL LETTER O WITH MACRON AND GRAVE
        '\u1e51': b'\xc1\xc5o', # LATIN SMALL LETTER O WITH MACRON AND GRAVE
        '\u1e52': b'\xc2\xc5O', # LATIN CAPITAL LETTER O WITH MACRON AND ACUTE
        '\u1e53': b'\xc2\xc5o', # LATIN SMALL LETTER O WITH MACRON AND ACUTE
        '\u1e54': b'\xc2P', # LATIN CAPITAL LETTER P WITH ACUTE
        '\u1e55': b'\xc2p', # LATIN SMALL LETTER P WITH ACUTE
        '\u1e56': b'\xc7P', # LATIN CAPITAL LETTER P WITH DOT ABOVE
        '\u1e57': b'\xc7p', # LATIN SMALL LETTER P WITH DOT ABOVE
        '\u1e58': b'\xc7R', # LATIN CAPITAL LETTER R WITH DOT ABOVE
        '\u1e59': b'\xc7r', # LATIN SMALL LETTER R WITH DOT ABOVE
        '\u1e5a': b'\xd6R', # LATIN CAPITAL LETTER R WITH DOT BELOW
        '\u1e5b': b'\xd6r', # LATIN SMALL LETTER R WITH DOT BELOW
        '\u1e5c': b'\xc5\xd6R', # LATIN CAPITAL LETTER R WITH DOT BELOW AND MACRON
        '\u1e5d': b'\xc5\xd6r', # LATIN SMALL LETTER R WITH DOT BELOW AND MACRON
        '\u1e60': b'\xc7S', # LATIN CAPITAL LETTER S WITH DOT ABOVE
        '\u1e61': b'\xc7s', # LATIN SMALL LETTER S WITH DOT ABOVE
        '\u1e62': b'\xd6S', # LATIN CAPITAL LETTER S WITH DOT BELOW
        '\u1e63': b'\xd6s', # LATIN SMALL LETTER S WITH DOT BELOW
        '\u1e64': b'\xc7\xc2S', # LATIN CAPITAL LETTER S WITH ACUTE AND DOT ABOVE
        '\u1e65': b'\xc7\xc2s', # LATIN SMALL LETTER S WITH ACUTE AND DOT ABOVE
        '\u1e66': b'\xc7\xcfS', # LATIN CAPITAL LETTER S WITH CARON AND DOT ABOVE
        '\u1e67': b'\xc7\xcfs', # LATIN SMALL LETTER S WITH CARON AND DOT ABOVE
        '\u1e68': b'\xc7\xd6S', # LATIN CAPITAL LETTER S WITH DOT BELOW AND DOT ABOVE
        '\u1e69': b'\xc7\xd6s', # LATIN SMALL LETTER S WITH DOT BELOW AND DOT ABOVE
        '\u1e6a': b'\xc7T', # LATIN CAPITAL LETTER T WITH DOT ABOVE
        '\u1e6b': b'\xc7t', # LATIN SMALL LETTER T WITH DOT ABOVE
        '\u1e6c': b'\xd6T', # LATIN CAPITAL LETTER T WITH DOT BELOW
        '\u1e6d': b'\xd6t', # LATIN SMALL LETTER T WITH DOT BELOW
        '\u1e70': b'\xdbT', # LATIN CAPITAL LETTER T WITH CIRCUMFLEX BELOW
        '\u1e71': b'\xdbt', # LATIN SMALL LETTER T WITH CIRCUMFLEX BELOW
        '\u1e72': b'\xd7U', # LATIN CAPITAL LETTER U WITH DIAERESIS BELOW
        '\u1e73': b'\xd7u', # LATIN SMALL LETTER U WITH DIAERESIS BELOW
        '\u1e76': b'\xdbU', # LATIN CAPITAL LETTER U WITH CIRCUMFLEX BELOW
        '\u1e77': b'\xdbu', # LATIN SMALL LETTER U WITH CIRCUMFLEX BELOW
        '\u1e78': b'\xc2\xc4U', # LATIN CAPITAL LETTER U WITH TILDE AND ACUTE
        '\u1e79': b'\xc2\xc4u', # LATIN SMALL LETTER U WITH TILDE AND ACUTE
        '\u1e7a': b'\xc8\xc5U', # LATIN CAPITAL LETTER U WITH MACRON AND DIAERESIS
        '\u1e7b': b'\xc8\xc5u', # LATIN SMALL LETTER U WITH MACRON AND DIAERESIS
        '\u1e7c': b'\xc4V', # LATIN CAPITAL LETTER V WITH TILDE
        '\u1e7d': b'\xc4v', # LATIN SMALL LETTER V WITH TILDE
        '\u1e7e': b'\xd6V', # LATIN CAPITAL LETTER V WITH DOT BELOW
        '\u1e7f': b'\xd6v', # LATIN SMALL LETTER V WITH DOT BELOW
        '\u1e80': b'\xc1W', # LATIN CAPITAL LETTER W WITH GRAVE
        '\u1e81': b'\xc1w', # LATIN SMALL LETTER W WITH GRAVE
        '\u1e82': b'\xc2W', # LATIN CAPITAL LETTER W WITH ACUTE
        '\u1e83': b'\xc2w', # LATIN SMALL LETTER W WITH ACUTE
        '\u1e84': b'\xc8W', # LATIN CAPITAL LETTER W WITH DIAERESIS
        '\u1e85': b'\xc8w', # LATIN SMALL LETTER W WITH DIAERESIS
        '\u1e86': b'\xc7W', # LATIN CAPITAL LETTER W WITH DOT ABOVE
        '\u1e87': b'\xc7w', # LATIN SMALL LETTER W WITH DOT ABOVE
        '\u1e88': b'\xd6W', # LATIN CAPITAL LETTER W WITH DOT BELOW
        '\u1e89': b'\xd6w', # LATIN SMALL LETTER W WITH DOT BELOW
        '\u1e8a': b'\xc7X', # LATIN CAPITAL LETTER X WITH DOT ABOVE
        '\u1e8b': b'\xc7x', # LATIN SMALL LETTER X WITH DOT ABOVE
        '\u1e8c': b'\xc8X', # LATIN CAPITAL LETTER X WITH DIAERESIS
        '\u1e8d': b'\xc8x', # LATIN SMALL LETTER X WITH DIAERESIS
        '\u1e8e': b'\xc7Y', # LATIN CAPITAL LETTER Y WITH DOT ABOVE
        '\u1e8f': b'\xc7y', # LATIN SMALL LETTER Y WITH DOT ABOVE
        '\u1e90': b'\xc3Z', # LATIN CAPITAL LETTER Z WITH CIRCUMFLEX
        '\u1e91': b'\xc3z', # LATIN SMALL LETTER Z WITH CIRCUMFLEX
        '\u1e92': b'\xd6Z', # LATIN CAPITAL LETTER Z WITH DOT BELOW
        '\u1e93': b'\xd6z', # LATIN SMALL LETTER Z WITH DOT BELOW
        '\u1e97': b'\xc8t', # LATIN SMALL LETTER T WITH DIAERESIS
        '\u1e98': b'\xcaw', # LATIN SMALL LETTER W WITH RING ABOVE
        '\u1e99': b'\xcay', # LATIN SMALL LETTER Y WITH RING ABOVE
        '\u1ea0': b'\xd6A', # LATIN CAPITAL LETTER A WITH DOT BELOW
        '\u1ea1': b'\xd6a', # LATIN SMALL LETTER A WITH DOT BELOW
        '\u1ea2': b'\xc0A', # LATIN CAPITAL LETTER A WITH HOOK ABOVE
        '\u1ea3': b'\xc0a', # LATIN SMALL LETTER A WITH HOOK ABOVE
        '\u1ea4': b'\xc2\xc3A', # LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND ACUTE
        '\u1ea5': b'\xc2\xc3a', # LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE
        '\u1ea6': b'\xc1\xc3A', # LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND GRAVE
        '\u1ea7': b'\xc1\xc3a', # LATIN SMALL LETTER A WITH CIRCUMFLEX AND GRAVE
        '\u1ea8': b'\xc0\xc3A', # LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE
        '\u1ea9': b'\xc0\xc3a', # LATIN SMALL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE
        '\u1eaa': b'\xc4\xc3A', # LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND TILDE
        '\u1eab': b'\xc4\xc3a', # LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE
        '\u1eac': b'\xc3\xd6A', # LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND DOT BELOW
        '\u1ead': b'\xc3\xd6a', # LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW
        '\u1eae': b'\xc2\xc6A', # LATIN CAPITAL LETTER A WITH BREVE AND ACUTE
        '\u1eaf': b'\xc2\xc6a', # LATIN SMALL LETTER A WITH BREVE AND ACUTE
        '\u1eb0': b'\xc1\xc6A', # LATIN CAPITAL LETTER A WITH BREVE AND GRAVE
        '\u1eb1': b'\xc1\xc6a', # LATIN SMALL LETTER A WITH BREVE AND GRAVE
        '\u1eb2': b'\xc0\xc6A', # LATIN CAPITAL LETTER A WITH BREVE AND HOOK ABOVE
        '\u1eb3': b'\xc0\xc6a', # LATIN SMALL LETTER A WITH BREVE AND HOOK ABOVE
        '\u1eb4': b'\xc4\xc6A', # LATIN CAPITAL LETTER A WITH BREVE AND TILDE
        '\u1eb5': b'\xc4\xc6a', # LATIN SMALL LETTER A WITH BREVE AND TILDE
        '\u1eb6': b'\xc6\xd6A', # LATIN CAPITAL LETTER A WITH BREVE AND DOT BELOW
        '\u1eb7': b'\xc6\xd6a', # LATIN SMALL LETTER A WITH BREVE AND DOT BELOW
        '\u1eb8': b'\xd6E', # LATIN CAPITAL LETTER E WITH DOT BELOW
        '\u1eb9': b'\xd6e', # LATIN SMALL LETTER E WITH DOT BELOW
        '\u1eba': b'\xc0E', # LATIN CAPITAL LETTER E WITH HOOK ABOVE
        '\u1ebb': b'\xc0e', # LATIN SMALL LETTER E WITH HOOK ABOVE
        '\u1ebc': b'\xc4E', # LATIN CAPITAL LETTER E WITH TILDE
        '\u1ebd': b'\xc4e', # LATIN SMALL LETTER E WITH TILDE
        '\u1ebe': b'\xc2\xc3E', # LATIN CAPITAL LETTER E WITH CIRCUMFLEX AND ACUTE
        '\u1ebf': b'\xc2\xc3e', # LATIN SMALL LETTER E WITH CIRCUMFLEX AND ACUTE
        '\u1ec0': b'\xc1\xc3E', # LATIN CAPITAL LETTER E WITH CIRCUMFLEX AND GRAVE
        '\u1ec1': b'\xc1\xc3e', # LATIN SMALL LETTER E WITH CIRCUMFLEX AND GRAVE
        '\u1ec2': b'\xc0\xc3E', # LATIN CAPITAL LETTER E WITH CIRCUMFLEX AND HOOK ABOVE
        '\u1ec3': b'\xc0\xc3e', # LATIN SMALL LETTER E WITH CIRCUMFLEX AND HOOK ABOVE
        '\u1ec4': b'\xc4\xc3E', # LATIN CAPITAL LETTER E WITH CIRCUMFLEX AND TILDE
        '\u1ec5': b'\xc4\xc3e', # LATIN SMALL LETTER E WITH CIRCUMFLEX AND TILDE
        '\u1ec6': b'\xc3\xd6E', # LATIN CAPITAL LETTER E WITH CIRCUMFLEX AND DOT BELOW
        '\u1ec7': b'\xc3\xd6e', # LATIN SMALL LETTER E WITH CIRCUMFLEX AND DOT BELOW
        '\u1ec8': b'\xc0I', # LATIN CAPITAL LETTER I WITH HOOK ABOVE
        '\u1ec9': b'\xc0i', # LATIN SMALL LETTER I WITH HOOK ABOVE
        '\u1eca': b'\xd6I', # LATIN CAPITAL LETTER I WITH DOT BELOW
        '\u1ecb': b'\xd6i', # LATIN SMALL LETTER I WITH DOT BELOW
        '\u1ecc': b'\xd6O', # LATIN CAPITAL LETTER O WITH DOT BELOW
        '\u1ecd': b'\xd6o', # LATIN SMALL LETTER O WITH DOT BELOW
        '\u1ece': b'\xc0O', # LATIN CAPITAL LETTER O WITH HOOK ABOVE
        '\u1ecf': b'\xc0o', # LATIN SMALL LETTER O WITH HOOK ABOVE
        '\u1ed0': b'\xc2\xc3O', # LATIN CAPITAL LETTER O WITH CIRCUMFLEX AND ACUTE
        '\u1ed1': b'\xc2\xc3o', # LATIN SMALL LETTER O WITH CIRCUMFLEX AND ACUTE
        '\u1ed2': b'\xc1\xc3O', # LATIN CAPITAL LETTER O WITH CIRCUMFLEX AND GRAVE
        '\u1ed3': b'\xc1\xc3o', # LATIN SMALL LETTER O WITH CIRCUMFLEX AND GRAVE
        '\u1ed4': b'\xc0\xc3O', # LATIN CAPITAL LETTER O WITH CIRCUMFLEX AND HOOK ABOVE
        '\u1ed5': b'\xc0\xc3o', # LATIN SMALL LETTER O WITH CIRCUMFLEX AND HOOK ABOVE
        '\u1ed6': b'\xc4\xc3O', # LATIN CAPITAL LETTER O WITH CIRCUMFLEX AND TILDE
        '\u1ed7': b'\xc4\xc3o', # LATIN SMALL LETTER O WITH CIRCUMFLEX AND TILDE
        '\u1ed8': b'\xc3\xd6O', # LATIN CAPITAL LETTER O WITH CIRCUMFLEX AND DOT BELOW
        '\u1ed9': b'\xc3\xd6o', # LATIN SMALL LETTER O WITH CIRCUMFLEX AND DOT BELOW
        '\u1eda': b'\xc2\xceO', # LATIN CAPITAL LETTER O WITH HORN AND ACUTE
        '\u1edb': b'\xc2\xceo', # LATIN SMALL LETTER O WITH HORN AND ACUTE
        '\u1edc': b'\xc1\xceO', # LATIN CAPITAL LETTER O WITH HORN AND GRAVE
        '\u1edd': b'\xc1\xceo', # LATIN SMALL LETTER O WITH HORN AND GRAVE
        '\u1ede': b'\xc0\xceO', # LATIN CAPITAL LETTER O WITH HORN AND HOOK ABOVE
        '\u1edf': b'\xc0\xceo', # LATIN SMALL LETTER O WITH HORN AND HOOK ABOVE
        '\u1ee0': b'\xc4\xceO', # LATIN CAPITAL LETTER O WITH HORN AND TILDE
        '\u1ee1': b'\xc4\xceo', # LATIN SMALL LETTER O WITH HORN AND TILDE
        '\u1ee2': b'\xd6\xceO', # LATIN CAPITAL LETTER O WITH HORN AND DOT BELOW
        '\u1ee3': b'\xd6\xceo', # LATIN SMALL LETTER O WITH HORN AND DOT BELOW
        '\u1ee4': b'\xd6U', # LATIN CAPITAL LETTER U WITH DOT BELOW
        '\u1ee5': b'\xd6u', # LATIN SMALL LETTER U WITH DOT BELOW
        '\u1ee6': b'\xc0U', # LATIN CAPITAL LETTER U WITH HOOK ABOVE
        '\u1ee7': b'\xc0u', # LATIN SMALL LETTER U WITH HOOK ABOVE
        '\u1ee8': b'\xc2\xceU', # LATIN CAPITAL LETTER U WITH HORN AND ACUTE
        '\u1ee9': b'\xc2\xceu', # LATIN SMALL LETTER U WITH HORN AND ACUTE
        '\u1eea': b'\xc1\xceU', # LATIN CAPITAL LETTER U WITH HORN AND GRAVE
        '\u1eeb': b'\xc1\xceu', # LATIN SMALL LETTER U WITH HORN AND GRAVE
        '\u1eec': b'\xc0\xceU', # LATIN CAPITAL LETTER U WITH HORN AND HOOK ABOVE
        '\u1eed': b'\xc0\xceu', # LATIN SMALL LETTER U WITH HORN AND HOOK ABOVE
        '\u1eee': b'\xc4\xceU', # LATIN CAPITAL LETTER U WITH HORN AND TILDE
        '\u1eef': b'\xc4\xceu', # LATIN SMALL LETTER U WITH HORN AND TILDE
        '\u1ef0': b'\xd6\xceU', # LATIN CAPITAL LETTER U WITH HORN AND DOT BELOW
        '\u1ef1': b'\xd6\xceu', # LATIN SMALL LETTER U WITH HORN AND DOT BELOW
        '\u1ef2': b'\xc1Y', # LATIN CAPITAL LETTER Y WITH GRAVE
        '\u1ef3': b'\xc1y', # LATIN SMALL LETTER Y WITH GRAVE
        '\u1ef4': b'\xd6Y', # LATIN CAPITAL LETTER Y WITH DOT BELOW
        '\u1ef5': b'\xd6y', # LATIN SMALL LETTER Y WITH DOT BELOW
        '\u1ef6': b'\xc0Y', # LATIN CAPITAL LETTER Y WITH HOOK ABOVE
        '\u1ef7': b'\xc0y', # LATIN SMALL LETTER Y WITH HOOK ABOVE
        '\u1ef8': b'\xc4Y', # LATIN CAPITAL LETTER Y WITH TILDE
        '\u1ef9': b'\xc4y', # LATIN SMALL LETTER Y WITH TILDE
        '\u1fef': b'`', # GREEK VARIA
        '\u2018': b'\xa9', # LEFT SINGLE QUOTATION MARK
        '\u2019': b'\xb9', # RIGHT SINGLE QUOTATION MARK
        '\u201a': b'\xb2', # SINGLE LOW-9 QUOTATION MARK
        '\u201c': b'\xaa', # LEFT DOUBLE QUOTATION MARK
        '\u201d': b'\xba', # RIGHT DOUBLE QUOTATION MARK
        '\u201e': b'\xa2', # DOUBLE LOW-9 QUOTATION MARK
        '\u2020': b'\xa6', # DAGGER
        '\u2021': b'\xb6', # DOUBLE DAGGER
        '\u2032': b'\xa8', # PRIME
        '\u2033': b'\xb8', # DOUBLE PRIME
        '\u2117': b'\xae', # SOUND RECORDING COPYRIGHT
        #'\u212a': b'K', # KELVIN SIGN
        '\u212b': b'\xcaA', # ANGSTROM SIGN
        '\u266d': b'\xac', # MUSIC FLAT SIGN
        '\u266f': b'\xbc', # MUSIC SHARP SIGN
        '\ufe20': b'\xdd', # COMBINING LIGATURE LEFT HALF
        '\ufe21': b'\xde', # COMBINING LIGATURE RIGHT HALF
        '\ufe23': b'\xdf', # COMBINING DOUBLE TILDE RIGHT HALF
    }
     
     
    charmap = {}
    for uni, char in getattr(unicodemap, "iteritems", unicodemap.items)():
        if char in charmap:
            continue
        charmap[char] = uni
    decode(b'la for\xc3et')
    Le code est sûrement à optimiser car mes to_bytes c'est peut-être pas la meilleure solution pour faire fonctionner le code.
    Jurassic computer : Sinclair ZX81 - Zilog Z80A à 3,25 MHz - RAM 1 Ko - ROM 8 Ko

  17. #17
    Expert confirmé Avatar de papajoker
    Homme Profil pro
    Développeur Web
    Inscrit en
    Septembre 2013
    Messages
    2 101
    Détails du profil
    Informations personnelles :
    Sexe : Homme
    Localisation : France, Nièvre (Bourgogne)

    Informations professionnelles :
    Activité : Développeur Web
    Secteur : High Tech - Multimédia et Internet

    Informations forums :
    Inscription : Septembre 2013
    Messages : 2 101
    Points : 4 446
    Points
    4 446
    Par défaut
    ligne 18, tu peux mettre un simple return à la place du raise

    Citation Envoyé par jurassic pork Voir le message
    à optimise
    comme dit plus haut, je pense qu'il suffit uniquement de convertir la sortie du yield de DecodeIterator.iter() et peek() dans ces 2 méthodes
    $moi= ( !== ) ? : ;

  18. #18
    Expert éminent
    Avatar de jurassic pork
    Homme Profil pro
    Bidouilleur
    Inscrit en
    Décembre 2008
    Messages
    3 950
    Détails du profil
    Informations personnelles :
    Sexe : Homme
    Localisation : France

    Informations professionnelles :
    Activité : Bidouilleur
    Secteur : Industrie

    Informations forums :
    Inscription : Décembre 2008
    Messages : 3 950
    Points : 9 279
    Points
    9 279
    Par défaut
    Hello,
    merci papajoker pour ces conseils d'expert, voici à ce que je suis arrivé :
    Modification de la classe DecodeIterator comme ceci :
    Code : Sélectionner tout - Visualiser dans une fenêtre à part
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    class DecodeIterator(object):
        """Decoding iterator with peek and evolve
        """
        __slots__ = ("_data", "_length", "_pos")
        def __init__(self, data):
            self._data = data
            self._length = len(data)
            self._pos = 0
     
        def __iter__(self):
            while True:
                pos = self._pos
                if pos >= self._length:
                #    raise StopIteration
                    return
                yield self._data[pos].to_bytes(1,'big')
                self._pos += 1
     
        def __len__(self):
            return self._length
     
        #def __getitem__(self, item):
        #    return self._data.__getitem__(item)
     
        @property
        def position(self):
            return self._pos
     
        def peek(self, amount=2):
            nextpos = self._pos + 1
            result = [x.to_bytes(1,'big') for x in list(self._data[nextpos:nextpos + amount])]
            if len(result) < amount:
                result.extend([None] * (amount - len(result)))
            return result
     
        def evolve(self, amount=1):
            self._pos += amount
    Et voici le script decIso5626.py qui contient la fonction decodeIso5426 pour décoder des chaînes en format ISO-5426
    Code : Sélectionner tout - Visualiser dans une fenêtre à part
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    114
    115
    116
    117
    118
    119
    120
    121
    122
    123
    124
    125
    126
    127
    128
    129
    130
    131
    132
    133
    134
    135
    136
    137
    138
    139
    140
    141
    142
    143
    144
    145
    146
    147
    148
    149
    150
    151
    152
    153
    154
    155
    156
    157
    158
    159
    160
    161
    162
    163
    164
    165
    166
    167
    168
    169
    170
    171
    172
    173
    174
    175
    176
    177
    178
    179
    180
    181
    182
    183
    184
    185
    186
    187
    188
    189
    190
    191
    192
    193
    194
    195
    196
    197
    198
    199
    200
    201
    202
    203
    204
    205
    206
    207
    208
    209
    210
    211
    212
    213
    214
    215
    216
    217
    218
    219
    220
    221
    222
    223
    224
    225
    226
    227
    228
    229
    230
    231
    232
    233
    234
    235
    236
    237
    238
    239
    240
    241
    242
    243
    244
    245
    246
    247
    248
    249
    250
    251
    252
    253
    254
    255
    256
    257
    258
    259
    260
    261
    262
    263
    264
    265
    266
    267
    268
    269
    270
    271
    272
    273
    274
    275
    276
    277
    278
    279
    280
    281
    282
    283
    284
    285
    286
    287
    288
    289
    290
    291
    292
    293
    294
    295
    296
    297
    298
    299
    300
    301
    302
    303
    304
    305
    306
    307
    308
    309
    310
    311
    312
    313
    314
    315
    316
    317
    318
    319
    320
    321
    322
    323
    324
    325
    326
    327
    328
    329
    330
    331
    332
    333
    334
    335
    336
    337
    338
    339
    340
    341
    342
    343
    344
    345
    346
    347
    348
    349
    350
    351
    352
    353
    354
    355
    356
    357
    358
    359
    360
    361
    362
    363
    364
    365
    366
    367
    368
    369
    370
    371
    372
    373
    374
    375
    376
    377
    378
    379
    380
    381
    382
    383
    384
    385
    386
    387
    388
    389
    390
    391
    392
    393
    394
    395
    396
    397
    398
    399
    400
    401
    402
    403
    404
    405
    406
    407
    408
    409
    410
    411
    412
    413
    414
    415
    416
    417
    418
    419
    420
    421
    422
    423
    424
    425
    426
    427
    428
    429
    430
    431
    432
    433
    434
    435
    436
    437
    438
    439
    440
    441
    442
    443
    444
    445
    446
    447
    448
    449
    450
    451
    452
    453
    454
    455
    456
    457
    458
    459
    460
    461
    462
    463
    464
    465
    466
    467
    468
    469
    470
    471
    472
    473
    474
    475
    476
    477
    478
    479
    480
    481
    482
    483
    484
    485
    486
    487
    488
    489
    490
    491
    492
    493
    494
    495
    496
    497
    498
    499
    500
    501
    502
    503
    504
    505
    506
    507
    508
    509
    510
    511
    512
    513
    514
    515
    516
    517
    518
    519
    520
    521
    522
    523
    524
    525
    526
    527
    528
    529
    530
    531
    532
    533
    534
    535
    536
    537
    538
    539
    540
    541
    542
    543
    544
    545
    546
    547
    548
    549
    550
    551
    552
    553
    554
    555
    556
    557
    558
    559
    560
    561
    562
    563
    564
    565
    566
    567
    568
    569
    570
    571
    572
    573
    574
    575
    576
    577
    578
    579
    580
    581
    582
    583
    584
    585
    586
    587
    588
    589
    590
    591
    592
    593
    594
    595
    596
    597
    598
    599
    600
    601
    602
    603
    604
    605
    606
    607
    608
    609
    610
    611
    612
    613
    614
    615
    616
    617
    618
    619
    620
    621
    622
    623
    624
    625
    626
    627
    628
    629
    630
    631
    632
    633
    634
    635
    636
    637
    638
    639
    640
    641
    642
    643
    644
    645
    646
    647
    648
    649
    650
    651
    652
    653
    654
    655
    656
    657
    658
    659
    660
    661
    662
    663
    664
    665
    666
    667
    668
    669
    670
    671
    672
    673
    674
    675
    676
    677
    678
    679
    680
    681
    682
    683
    684
    685
    686
    687
    688
    689
    690
    691
    692
    693
    694
    695
    696
    697
    698
    699
    700
    701
    702
    703
    704
    705
    706
    707
    708
    709
    710
    711
    712
    713
    714
    715
    716
    717
    718
    719
    720
    721
    722
    723
    724
    725
    726
    727
    728
    729
    730
    731
    732
    733
    734
    735
    736
    737
    738
    739
    740
    741
    742
    743
    744
    745
    746
    747
    748
    749
    750
    751
    752
    753
    754
    755
    756
    757
    758
    759
    760
    761
    762
    763
    764
    765
    766
    767
    768
    769
    770
    771
    772
    773
    774
    775
    776
    777
    778
    779
    780
    781
    782
    783
    784
    785
    786
    787
    788
    789
    790
    791
    792
    793
    794
    795
    796
    797
    798
    799
    800
    801
    802
    803
    804
    805
    806
    807
    808
    809
    810
    811
    812
    813
    814
    815
    816
    817
    818
    819
    820
    821
    822
    823
    824
    825
    826
    827
    828
    829
    830
    831
    832
    833
    834
    835
    836
    837
    838
    839
    840
    841
    842
    843
    844
    845
    846
    from __future__ import unicode_literals, print_function
     
    class DecodeIterator(object):
        """Decoding iterator with peek and evolve
        """
     
        __slots__ = ("_data", "_length", "_pos")
        def __init__(self, data):
            self._data = data
            self._length = len(data)
            self._pos = 0
     
     
        def __iter__(self):
            while True:
                pos = self._pos
                if pos >= self._length:
                #    raise StopIteration
                    return
                yield self._data[pos].to_bytes(1,'big')
                self._pos += 1
     
     
        def __len__(self):
            return self._length
     
     
        #def __getitem__(self, item):
        #    return self._data.__getitem__(item)
     
        @property
        def position(self):
            return self._pos
     
        def peek(self, amount=2):
            nextpos = self._pos + 1
            result = [x.to_bytes(1,'big') for x in list(self._data[nextpos:nextpos + amount])]
            if len(result) < amount:
                result.extend([None] * (amount - len(result)))
            return result
     
        def evolve(self, amount=1):
            self._pos += amount
     
     
        #def residual(self, amount=1):
        #    return self._length - self._pos > amount
     
     
    def decodeIso5426(input, errors='strict', special=None):
        """Decode unicode from ISO-5426
        """
        if errors not in set(['strict', 'replace', 'ignore', 'repr']):
            raise ValueError("Invalid errors argument %s" % errors)
        result = []
        di = DecodeIterator(input)
        # optimizations
        rappend = result.append
        cget = charmap.get
        for c in di:
            o = ord(c) # première erreur ici
            # ASCII chars
            if c < b'\x7f':
                rappend(chr(o))
                #i += 1
                continue
     
     
     
     
            c1, c2 = di.peek(2)
            ccc2 = None
            # 0xc0 to 0xdf signals a combined char
            if 0xc0 <= o <= 0xdf and c1 is not None:
                # special case 0xc9: both 0xc9 and 0xc9 are combining diaeresis
                # use 0xc8 in favor of 0xc9
                if c == b'\xc9':
                    c = b'\xc8'
                if c1 == b'\xc9':
                    c1 = b'\xc8'
                # double combined char
                if 0xc0 <= ord(c1) <= 0xdf and c2 is not None:
                    ccc2 = c + c1 + c2
                    r = cget(ccc2)
                    if r is not None:
                        # double combined found in table
                        rappend(r)
                        di.evolve(2)
                        continue
                    # build combining unicode
                    dc1 = cget(c)
                    dc2 = cget(c1 + c2)
                    if dc1 is not None and dc2 is not None: # pragma: no branch
                        # reverse order, in unicode, the combining char comes after the char
                        rappend(dc2 + dc1)
                        di.evolve(2)
                        continue
                else:
                    cc1 = c + c1
                    r = cget(cc1)
                    if r is not None:
                        rappend(r)
                        di.evolve(1)
                        continue
                    # denormalized unicode: char + combining
                    r = cget(c)
                    rn = cget(c1)
                    if r is not None and rn is not None: # pragma: no branch
                        rappend(rn + r)
                        di.evolve(1)
                        continue
     
     
     
     
                # just the combining
                #r = cget(c)
                #if r is not None:
                #    result.append(r)
                #    continue
     
     
     
     
            # other chars, 0x80 <= o <= 0xbf or o >= 0xe0 or last combining
            if special is not None:
                r = special.get(c)
                if r is not None:
                    rappend(r)
                    continue
     
     
     
     
            r = cget(c)
            if r is not None:
                rappend(r)
                continue
     
     
     
     
            # only reached when no result was found
            if errors == "strict":
                p = di.position
                raise UnicodeError("Can't decode byte%s %r at position %i (context %r)" %
                                   ("" if ccc2 is None else "s",
                                    c if ccc2 is None else ccc2,
                                    p, input[p - 3:p + 3]))
            elif errors == "replace":
                rappend('\ufffd')
            elif errors == "ignore":
                pass
            elif errors == "repr":
                rappend('\\x%x' % o)
            else: # pragma: no cover
                # should never be reached
                raise ValueError("Invalid errors argument %s" % errors)
     
        return "".join(result)  #, di.position
     
     
     
     
    # special identity mapping for 0xa4, 0xe0-0xff
    special_xe0_map = {
        b'\xa4': '\xa4',
        b'\xe0': '\xe0',
        b'\xe1': '\xe1',
        b'\xe2': '\xe2',
        b'\xe3': '\xe3',
        b'\xe4': '\xe4',
        b'\xe5': '\xe5',
        b'\xe6': '\xe6',
        b'\xe7': '\xe7',
        b'\xe8': '\xe8',
        b'\xe9': '\xe9',
        b'\xea': '\xea',
        b'\xeb': '\xeb',
        b'\xec': '\xec',
        b'\xed': '\xed',
        b'\xee': '\xee',
        b'\xef': '\xef',
        b'\xf0': '\xf0',
        b'\xf1': '\xf1',
        b'\xf2': '\xf2',
        b'\xf3': '\xf3',
        b'\xf4': '\xf4',
        b'\xf5': '\xf5',
        b'\xf6': '\xf6',
        b'\xf7': '\xf7',
        b'\xf8': '\xf8',
        b'\xf9': '\xf9',
        b'\xfa': '\xfa',
        b'\xfb': '\xfb',
        b'\xfc': '\xfc',
        b'\xfd': '\xfd',
        b'\xfe': '\xfe',
        b'\xff': '\xff'}
     
     
     
     
    unicodemap = {
        '\u001d': b'\x1d', # <control>
        '\u001e': b'\x1e', # <control>
        '\u001f': b'\x1f', # <control>
        '\u0020': b' ', # SPACE
        '\u0021': b'!', # EXCLAMATION MARK
        '\u0022': b'"', # QUOTATION MARK
        '\u0023': b'#', # NUMBER SIGN
        '\u0024': b'\xa4', # DOLLAR SIGN
        '\u0025': b'%', # PERCENT SIGN
        '\u0026': b'&', # AMPERSAND
        '\u0027': b"'", # APOSTROPHE
        '\u0028': b'(', # LEFT PARENTHESIS
        '\u0029': b')', # RIGHT PARENTHESIS
        '\u002a': b'*', # ASTERISK
        '\u002b': b'+', # PLUS SIGN
        '\u002c': b',', # COMMA
        '\u002d': b'-', # HYPHEN-MINUS
        '\u002e': b'.', # FULL STOP
        '\u002f': b'/', # SOLIDUS
        '\u0030': b'0', # DIGIT ZERO
        '\u0031': b'1', # DIGIT ONE
        '\u0032': b'2', # DIGIT TWO
        '\u0033': b'3', # DIGIT THREE
        '\u0034': b'4', # DIGIT FOUR
        '\u0035': b'5', # DIGIT FIVE
        '\u0036': b'6', # DIGIT SIX
        '\u0037': b'7', # DIGIT SEVEN
        '\u0038': b'8', # DIGIT EIGHT
        '\u0039': b'9', # DIGIT NINE
        '\u003a': b':', # COLON
        '\u003b': b';', # SEMICOLON
        '\u003c': b'<', # LESS-THAN SIGN
        '\u003d': b'=', # EQUALS SIGN
        '\u003e': b'>', # GREATER-THAN SIGN
        '\u003f': b'?', # QUESTION MARK
        '\u0040': b'@', # COMMERCIAL AT
        '\u0041': b'A', # LATIN CAPITAL LETTER A
        '\u0042': b'B', # LATIN CAPITAL LETTER B
        '\u0043': b'C', # LATIN CAPITAL LETTER C
        '\u0044': b'D', # LATIN CAPITAL LETTER D
        '\u0045': b'E', # LATIN CAPITAL LETTER E
        '\u0046': b'F', # LATIN CAPITAL LETTER F
        '\u0047': b'G', # LATIN CAPITAL LETTER G
        '\u0048': b'H', # LATIN CAPITAL LETTER H
        '\u0049': b'I', # LATIN CAPITAL LETTER I
        '\u004a': b'J', # LATIN CAPITAL LETTER J
        '\u004b': b'K', # LATIN CAPITAL LETTER K
        '\u004c': b'L', # LATIN CAPITAL LETTER L
        '\u004d': b'M', # LATIN CAPITAL LETTER M
        '\u004e': b'N', # LATIN CAPITAL LETTER N
        '\u004f': b'O', # LATIN CAPITAL LETTER O
        '\u0050': b'P', # LATIN CAPITAL LETTER P
        '\u0051': b'Q', # LATIN CAPITAL LETTER Q
        '\u0052': b'R', # LATIN CAPITAL LETTER R
        '\u0053': b'S', # LATIN CAPITAL LETTER S
        '\u0054': b'T', # LATIN CAPITAL LETTER T
        '\u0055': b'U', # LATIN CAPITAL LETTER U
        '\u0056': b'V', # LATIN CAPITAL LETTER V
        '\u0057': b'W', # LATIN CAPITAL LETTER W
        '\u0058': b'X', # LATIN CAPITAL LETTER X
        '\u0059': b'Y', # LATIN CAPITAL LETTER Y
        '\u005a': b'Z', # LATIN CAPITAL LETTER Z
        '\u005b': b'[', # LEFT SQUARE BRACKET
        '\u005c': b'\\', # REVERSE SOLIDUS
        '\u005d': b']', # RIGHT SQUARE BRACKET
        '\u005e': b'^', # CIRCUMFLEX ACCENT
        '\u005f': b'_', # LOW LINE
        '\u0060': b'`', # GRAVE ACCENT
        '\u0061': b'a', # LATIN SMALL LETTER A
        '\u0062': b'b', # LATIN SMALL LETTER B
        '\u0063': b'c', # LATIN SMALL LETTER C
        '\u0064': b'd', # LATIN SMALL LETTER D
        '\u0065': b'e', # LATIN SMALL LETTER E
        '\u0066': b'f', # LATIN SMALL LETTER F
        '\u0067': b'g', # LATIN SMALL LETTER G
        '\u0068': b'h', # LATIN SMALL LETTER H
        '\u0069': b'i', # LATIN SMALL LETTER I
        '\u006a': b'j', # LATIN SMALL LETTER J
        '\u006b': b'k', # LATIN SMALL LETTER K
        '\u006c': b'l', # LATIN SMALL LETTER L
        '\u006d': b'm', # LATIN SMALL LETTER M
        '\u006e': b'n', # LATIN SMALL LETTER N
        '\u006f': b'o', # LATIN SMALL LETTER O
        '\u0070': b'p', # LATIN SMALL LETTER P
        '\u0071': b'q', # LATIN SMALL LETTER Q
        '\u0072': b'r', # LATIN SMALL LETTER R
        '\u0073': b's', # LATIN SMALL LETTER S
        '\u0074': b't', # LATIN SMALL LETTER T
        '\u0075': b'u', # LATIN SMALL LETTER U
        '\u0076': b'v', # LATIN SMALL LETTER V
        '\u0077': b'w', # LATIN SMALL LETTER W
        '\u0078': b'x', # LATIN SMALL LETTER X
        '\u0079': b'y', # LATIN SMALL LETTER Y
        '\u007a': b'z', # LATIN SMALL LETTER Z
        '\u007b': b'{', # LEFT CURLY BRACKET
        '\u007c': b'|', # VERTICAL LINE
        '\u007d': b'}', # RIGHT CURLY BRACKET
        '\u007e': b'~', # TILDE
        '\u0088': b'\x88', # <control>
        '\u0089': b'\x89', # <control>
        # XXX not part of the standard but MARC equivalent of \x88, \x89
        #'\u0098': b'\x98', # <control>
        #'\u009c': b'\x9c', # <control>
        '\u00a0': b'\xa0', # SPACE  AJOUT J.P 05/2023
        '\u00a1': b'\xa1', # INVERTED EXCLAMATION MARK
        '\u00a3': b'\xa3', # POUND SIGN
        '\u00a5': b'\xa5', # YEN SIGN
        '\u00a7': b'\xa7', # SECTION SIGN
        '\u00a9': b'\xad', # COPYRIGHT SIGN
        '\u00ab': b'\xab', # LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
        '\u00ae': b'\xaf', # REGISTERED SIGN
        '\u00b7': b'\xb7', # MIDDLE DOT
        '\u00bb': b'\xbb', # RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
        '\u00bf': b'\xbf', # INVERTED QUESTION MARK
        '\u00c0': b'\xc1A', # LATIN CAPITAL LETTER A WITH GRAVE
        '\u00c1': b'\xc2A', # LATIN CAPITAL LETTER A WITH ACUTE
        '\u00c2': b'\xc3A', # LATIN CAPITAL LETTER A WITH CIRCUMFLEX
        '\u00c3': b'\xc4A', # LATIN CAPITAL LETTER A WITH TILDE
        '\u00c4': b'\xc8A', # LATIN CAPITAL LETTER A WITH DIAERESIS
        '\u00c5': b'\xcaA', # LATIN CAPITAL LETTER A WITH RING ABOVE
        '\u00c6': b'\xe1', # LATIN CAPITAL LETTER AE
        '\u00c7': b'\xd0C', # LATIN CAPITAL LETTER C WITH CEDILLA
        '\u00c8': b'\xc1E', # LATIN CAPITAL LETTER E WITH GRAVE
        '\u00c9': b'\xc2E', # LATIN CAPITAL LETTER E WITH ACUTE
        '\u00ca': b'\xc3E', # LATIN CAPITAL LETTER E WITH CIRCUMFLEX
        '\u00cb': b'\xc8E', # LATIN CAPITAL LETTER E WITH DIAERESIS
        '\u00cc': b'\xc1I', # LATIN CAPITAL LETTER I WITH GRAVE
        '\u00cd': b'\xc2I', # LATIN CAPITAL LETTER I WITH ACUTE
        '\u00ce': b'\xc3I', # LATIN CAPITAL LETTER I WITH CIRCUMFLEX
        '\u00cf': b'\xc8I', # LATIN CAPITAL LETTER I WITH DIAERESIS
        '\u00d1': b'\xc4N', # LATIN CAPITAL LETTER N WITH TILDE
        '\u00d2': b'\xc1O', # LATIN CAPITAL LETTER O WITH GRAVE
        '\u00d3': b'\xc2O', # LATIN CAPITAL LETTER O WITH ACUTE
        '\u00d4': b'\xc3O', # LATIN CAPITAL LETTER O WITH CIRCUMFLEX
        '\u00d5': b'\xc4O', # LATIN CAPITAL LETTER O WITH TILDE
        '\u00d6': b'\xc8O', # LATIN CAPITAL LETTER O WITH DIAERESIS
        '\u00d8': b'\xe9', # LATIN CAPITAL LETTER O WITH STROKE
        '\u00d9': b'\xc1U', # LATIN CAPITAL LETTER U WITH GRAVE
        '\u00da': b'\xc2U', # LATIN CAPITAL LETTER U WITH ACUTE
        '\u00db': b'\xc3U', # LATIN CAPITAL LETTER U WITH CIRCUMFLEX
        '\u00dc': b'\xc8U', # LATIN CAPITAL LETTER U WITH DIAERESIS
        '\u00dd': b'\xc2Y', # LATIN CAPITAL LETTER Y WITH ACUTE
        '\u00de': b'\xec', # LATIN CAPITAL LETTER THORN
        '\u00df': b'\xfb', # LATIN SMALL LETTER SHARP S
        '\u00e0': b'\xc1a', # LATIN SMALL LETTER A WITH GRAVE
        '\u00e1': b'\xc2a', # LATIN SMALL LETTER A WITH ACUTE
        '\u00e2': b'\xc3a', # LATIN SMALL LETTER A WITH CIRCUMFLEX
        '\u00e3': b'\xc4a', # LATIN SMALL LETTER A WITH TILDE
        '\u00e4': b'\xc8a', # LATIN SMALL LETTER A WITH DIAERESIS
        '\u00e5': b'\xcaa', # LATIN SMALL LETTER A WITH RING ABOVE
        '\u00e6': b'\xf1', # LATIN SMALL LETTER AE
        '\u00e7': b'\xd0c', # LATIN SMALL LETTER C WITH CEDILLA
        '\u00e8': b'\xc1e', # LATIN SMALL LETTER E WITH GRAVE
        '\u00e9': b'\xc2e', # LATIN SMALL LETTER E WITH ACUTE
        '\u00ea': b'\xc3e', # LATIN SMALL LETTER E WITH CIRCUMFLEX
        '\u00eb': b'\xc8e', # LATIN SMALL LETTER E WITH DIAERESIS
        '\u00ec': b'\xc1i', # LATIN SMALL LETTER I WITH GRAVE
        '\u00ed': b'\xc2i', # LATIN SMALL LETTER I WITH ACUTE
        '\u00ee': b'\xc3i', # LATIN SMALL LETTER I WITH CIRCUMFLEX
        '\u00ef': b'\xc8i', # LATIN SMALL LETTER I WITH DIAERESIS
        '\u00f0': b'\xf3', # LATIN SMALL LETTER ETH
        '\u00f1': b'\xc4n', # LATIN SMALL LETTER N WITH TILDE
        '\u00f2': b'\xc1o', # LATIN SMALL LETTER O WITH GRAVE
        '\u00f3': b'\xc2o', # LATIN SMALL LETTER O WITH ACUTE
        '\u00f4': b'\xc3o', # LATIN SMALL LETTER O WITH CIRCUMFLEX
        '\u00f5': b'\xc4o', # LATIN SMALL LETTER O WITH TILDE
        '\u00f6': b'\xc8o', # LATIN SMALL LETTER O WITH DIAERESIS
        '\u00f8': b'\xf9', # LATIN SMALL LETTER O WITH STROKE
        '\u00f9': b'\xc1u', # LATIN SMALL LETTER U WITH GRAVE
        '\u00fa': b'\xc2u', # LATIN SMALL LETTER U WITH ACUTE
        '\u00fb': b'\xc3u', # LATIN SMALL LETTER U WITH CIRCUMFLEX
        '\u00fc': b'\xc8u', # LATIN SMALL LETTER U WITH DIAERESIS
        '\u00fd': b'\xc2y', # LATIN SMALL LETTER Y WITH ACUTE
        '\u00fe': b'\xfc', # LATIN SMALL LETTER THORN
        '\u00ff': b'\xc8y', # LATIN SMALL LETTER Y WITH DIAERESIS
        '\u0100': b'\xc5A', # LATIN CAPITAL LETTER A WITH MACRON
        '\u0101': b'\xc5a', # LATIN SMALL LETTER A WITH MACRON
        '\u0102': b'\xc6A', # LATIN CAPITAL LETTER A WITH BREVE
        '\u0103': b'\xc6a', # LATIN SMALL LETTER A WITH BREVE
        '\u0104': b'\xd3A', # LATIN CAPITAL LETTER A WITH OGONEK
        '\u0105': b'\xd3a', # LATIN SMALL LETTER A WITH OGONEK
        '\u0106': b'\xc2C', # LATIN CAPITAL LETTER C WITH ACUTE
        '\u0107': b'\xc2c', # LATIN SMALL LETTER C WITH ACUTE
        '\u0108': b'\xc3C', # LATIN CAPITAL LETTER C WITH CIRCUMFLEX
        '\u0109': b'\xc3c', # LATIN SMALL LETTER C WITH CIRCUMFLEX
        '\u010a': b'\xc7C', # LATIN CAPITAL LETTER C WITH DOT ABOVE
        '\u010b': b'\xc7c', # LATIN SMALL LETTER C WITH DOT ABOVE
        '\u010c': b'\xcfC', # LATIN CAPITAL LETTER C WITH CARON
        '\u010d': b'\xcfc', # LATIN SMALL LETTER C WITH CARON
        '\u010e': b'\xcfD', # LATIN CAPITAL LETTER D WITH CARON
        '\u010f': b'\xcfd', # LATIN SMALL LETTER D WITH CARON
        '\u0110': b'\xe2', # LATIN CAPITAL LETTER D WITH STROKE
        '\u0111': b'\xf2', # LATIN SMALL LETTER D WITH STROKE
        '\u0112': b'\xc5E', # LATIN CAPITAL LETTER E WITH MACRON
        '\u0113': b'\xc5e', # LATIN SMALL LETTER E WITH MACRON
        '\u0114': b'\xc6E', # LATIN CAPITAL LETTER E WITH BREVE
        '\u0115': b'\xc6e', # LATIN SMALL LETTER E WITH BREVE
        '\u0116': b'\xc7E', # LATIN CAPITAL LETTER E WITH DOT ABOVE
        '\u0117': b'\xc7e', # LATIN SMALL LETTER E WITH DOT ABOVE
        '\u0118': b'\xd3E', # LATIN CAPITAL LETTER E WITH OGONEK
        '\u0119': b'\xd3e', # LATIN SMALL LETTER E WITH OGONEK
        '\u011a': b'\xcfE', # LATIN CAPITAL LETTER E WITH CARON
        '\u011b': b'\xcfe', # LATIN SMALL LETTER E WITH CARON
        '\u011c': b'\xc3G', # LATIN CAPITAL LETTER G WITH CIRCUMFLEX
        '\u011d': b'\xc3g', # LATIN SMALL LETTER G WITH CIRCUMFLEX
        '\u011e': b'\xc6G', # LATIN CAPITAL LETTER G WITH BREVE
        '\u011f': b'\xc6g', # LATIN SMALL LETTER G WITH BREVE
        '\u0120': b'\xc7G', # LATIN CAPITAL LETTER G WITH DOT ABOVE
        '\u0121': b'\xc7g', # LATIN SMALL LETTER G WITH DOT ABOVE
        '\u0122': b'\xd0G', # LATIN CAPITAL LETTER G WITH CEDILLA
        '\u0123': b'\xd0g', # LATIN SMALL LETTER G WITH CEDILLA
        '\u0124': b'\xc3H', # LATIN CAPITAL LETTER H WITH CIRCUMFLEX
        '\u0125': b'\xc3h', # LATIN SMALL LETTER H WITH CIRCUMFLEX
        '\u0128': b'\xc4I', # LATIN CAPITAL LETTER I WITH TILDE
        '\u0129': b'\xc4i', # LATIN SMALL LETTER I WITH TILDE
        '\u012a': b'\xc5I', # LATIN CAPITAL LETTER I WITH MACRON
        '\u012b': b'\xc5i', # LATIN SMALL LETTER I WITH MACRON
        '\u012c': b'\xc6I', # LATIN CAPITAL LETTER I WITH BREVE
        '\u012d': b'\xc6i', # LATIN SMALL LETTER I WITH BREVE
        '\u012e': b'\xd3I', # LATIN CAPITAL LETTER I WITH OGONEK
        '\u012f': b'\xd3i', # LATIN SMALL LETTER I WITH OGONEK
        '\u0130': b'\xc7I', # LATIN CAPITAL LETTER I WITH DOT ABOVE
        '\u0131': b'\xf5', # LATIN SMALL LETTER DOTLESS I
        '\u0132': b'\xe6', # LATIN CAPITAL LIGATURE IJ
        '\u0133': b'\xf6', # LATIN SMALL LIGATURE IJ
        '\u0134': b'\xc3J', # LATIN CAPITAL LETTER J WITH CIRCUMFLEX
        '\u0135': b'\xc3j', # LATIN SMALL LETTER J WITH CIRCUMFLEX
        '\u0136': b'\xd0K', # LATIN CAPITAL LETTER K WITH CEDILLA
        '\u0137': b'\xd0k', # LATIN SMALL LETTER K WITH CEDILLA
        '\u0139': b'\xc2L', # LATIN CAPITAL LETTER L WITH ACUTE
        '\u013a': b'\xc2l', # LATIN SMALL LETTER L WITH ACUTE
        '\u013b': b'\xd0L', # LATIN CAPITAL LETTER L WITH CEDILLA
        '\u013c': b'\xd0l', # LATIN SMALL LETTER L WITH CEDILLA
        '\u013d': b'\xcfL', # LATIN CAPITAL LETTER L WITH CARON
        '\u013e': b'\xcfl', # LATIN SMALL LETTER L WITH CARON
        '\u0141': b'\xe8', # LATIN CAPITAL LETTER L WITH STROKE
        '\u0142': b'\xf8', # LATIN SMALL LETTER L WITH STROKE
        '\u0143': b'\xc2N', # LATIN CAPITAL LETTER N WITH ACUTE
        '\u0144': b'\xc2n', # LATIN SMALL LETTER N WITH ACUTE
        '\u0145': b'\xd0N', # LATIN CAPITAL LETTER N WITH CEDILLA
        '\u0146': b'\xd0n', # LATIN SMALL LETTER N WITH CEDILLA
        '\u0147': b'\xcfN', # LATIN CAPITAL LETTER N WITH CARON
        '\u0148': b'\xcfn', # LATIN SMALL LETTER N WITH CARON
        '\u014c': b'\xc5O', # LATIN CAPITAL LETTER O WITH MACRON
        '\u014d': b'\xc5o', # LATIN SMALL LETTER O WITH MACRON
        '\u014e': b'\xc6O', # LATIN CAPITAL LETTER O WITH BREVE
        '\u014f': b'\xc6o', # LATIN SMALL LETTER O WITH BREVE
        '\u0150': b'\xcdO', # LATIN CAPITAL LETTER O WITH DOUBLE ACUTE
        '\u0151': b'\xcdo', # LATIN SMALL LETTER O WITH DOUBLE ACUTE
        '\u0152': b'\xea', # LATIN CAPITAL LIGATURE OE
        '\u0153': b'\xfa', # LATIN SMALL LIGATURE OE
        '\u0154': b'\xc2R', # LATIN CAPITAL LETTER R WITH ACUTE
        '\u0155': b'\xc2r', # LATIN SMALL LETTER R WITH ACUTE
        '\u0156': b'\xd0R', # LATIN CAPITAL LETTER R WITH CEDILLA
        '\u0157': b'\xd0r', # LATIN SMALL LETTER R WITH CEDILLA
        '\u0158': b'\xcfR', # LATIN CAPITAL LETTER R WITH CARON
        '\u0159': b'\xcfr', # LATIN SMALL LETTER R WITH CARON
        '\u015a': b'\xc2S', # LATIN CAPITAL LETTER S WITH ACUTE
        '\u015b': b'\xc2s', # LATIN SMALL LETTER S WITH ACUTE
        '\u015c': b'\xc3S', # LATIN CAPITAL LETTER S WITH CIRCUMFLEX
        '\u015d': b'\xc3s', # LATIN SMALL LETTER S WITH CIRCUMFLEX
        '\u015e': b'\xd0S', # LATIN CAPITAL LETTER S WITH CEDILLA
        '\u015f': b'\xd0s', # LATIN SMALL LETTER S WITH CEDILLA
        '\u0160': b'\xcfS', # LATIN CAPITAL LETTER S WITH CARON
        '\u0161': b'\xcfs', # LATIN SMALL LETTER S WITH CARON
        '\u0162': b'\xd0T', # LATIN CAPITAL LETTER T WITH CEDILLA
        '\u0163': b'\xd0t', # LATIN SMALL LETTER T WITH CEDILLA
        '\u0164': b'\xcfT', # LATIN CAPITAL LETTER T WITH CARON
        '\u0165': b'\xcft', # LATIN SMALL LETTER T WITH CARON
        '\u0168': b'\xc4U', # LATIN CAPITAL LETTER U WITH TILDE
        '\u0169': b'\xc4u', # LATIN SMALL LETTER U WITH TILDE
        '\u016a': b'\xc5U', # LATIN CAPITAL LETTER U WITH MACRON
        '\u016b': b'\xc5u', # LATIN SMALL LETTER U WITH MACRON
        '\u016c': b'\xc6U', # LATIN CAPITAL LETTER U WITH BREVE
        '\u016d': b'\xc6u', # LATIN SMALL LETTER U WITH BREVE
        '\u016e': b'\xcaU', # LATIN CAPITAL LETTER U WITH RING ABOVE
        '\u016f': b'\xcau', # LATIN SMALL LETTER U WITH RING ABOVE
        '\u0170': b'\xcdU', # LATIN CAPITAL LETTER U WITH DOUBLE ACUTE
        '\u0171': b'\xcdu', # LATIN SMALL LETTER U WITH DOUBLE ACUTE
        '\u0172': b'\xd3U', # LATIN CAPITAL LETTER U WITH OGONEK
        '\u0173': b'\xd3u', # LATIN SMALL LETTER U WITH OGONEK
        '\u0174': b'\xc3W', # LATIN CAPITAL LETTER W WITH CIRCUMFLEX
        '\u0175': b'\xc3w', # LATIN SMALL LETTER W WITH CIRCUMFLEX
        '\u0176': b'\xc3Y', # LATIN CAPITAL LETTER Y WITH CIRCUMFLEX
        '\u0177': b'\xc3y', # LATIN SMALL LETTER Y WITH CIRCUMFLEX
        '\u0178': b'\xc8Y', # LATIN CAPITAL LETTER Y WITH DIAERESIS
        '\u0179': b'\xc2Z', # LATIN CAPITAL LETTER Z WITH ACUTE
        '\u017a': b'\xc2z', # LATIN SMALL LETTER Z WITH ACUTE
        '\u017b': b'\xc7Z', # LATIN CAPITAL LETTER Z WITH DOT ABOVE
        '\u017c': b'\xc7z', # LATIN SMALL LETTER Z WITH DOT ABOVE
        '\u017d': b'\xcfZ', # LATIN CAPITAL LETTER Z WITH CARON
        '\u017e': b'\xcfz', # LATIN SMALL LETTER Z WITH CARON
        '\u01a0': b'\xceO', # LATIN CAPITAL LETTER O WITH HORN
        '\u01a1': b'\xceo', # LATIN SMALL LETTER O WITH HORN
        '\u01af': b'\xceU', # LATIN CAPITAL LETTER U WITH HORN
        '\u01b0': b'\xceu', # LATIN SMALL LETTER U WITH HORN
        '\u01cd': b'\xcfA', # LATIN CAPITAL LETTER A WITH CARON
        '\u01ce': b'\xcfa', # LATIN SMALL LETTER A WITH CARON
        '\u01cf': b'\xcfI', # LATIN CAPITAL LETTER I WITH CARON
        '\u01d0': b'\xcfi', # LATIN SMALL LETTER I WITH CARON
        '\u01d1': b'\xcfO', # LATIN CAPITAL LETTER O WITH CARON
        '\u01d2': b'\xcfo', # LATIN SMALL LETTER O WITH CARON
        '\u01d3': b'\xcfU', # LATIN CAPITAL LETTER U WITH CARON
        '\u01d4': b'\xcfu', # LATIN SMALL LETTER U WITH CARON
        '\u01d5': b'\xc5\xc8U', # LATIN CAPITAL LETTER U WITH DIAERESIS AND MACRON
        '\u01d6': b'\xc5\xc8u', # LATIN SMALL LETTER U WITH DIAERESIS AND MACRON
        '\u01d7': b'\xc2\xc8U', # LATIN CAPITAL LETTER U WITH DIAERESIS AND ACUTE
        '\u01d8': b'\xc2\xc8u', # LATIN SMALL LETTER U WITH DIAERESIS AND ACUTE
        '\u01d9': b'\xcf\xc8U', # LATIN CAPITAL LETTER U WITH DIAERESIS AND CARON
        '\u01da': b'\xcf\xc8u', # LATIN SMALL LETTER U WITH DIAERESIS AND CARON
        '\u01db': b'\xc1\xc8U', # LATIN CAPITAL LETTER U WITH DIAERESIS AND GRAVE
        '\u01dc': b'\xc1\xc8u', # LATIN SMALL LETTER U WITH DIAERESIS AND GRAVE
        '\u01de': b'\xc5\xc8A', # LATIN CAPITAL LETTER A WITH DIAERESIS AND MACRON
        '\u01df': b'\xc5\xc8a', # LATIN SMALL LETTER A WITH DIAERESIS AND MACRON
        '\u01e0': b'\xc5\xc7A', # LATIN CAPITAL LETTER A WITH DOT ABOVE AND MACRON
        '\u01e1': b'\xc5\xc7a', # LATIN SMALL LETTER A WITH DOT ABOVE AND MACRON
        '\u01e2': b'\xc5\xe1', # LATIN CAPITAL LETTER AE WITH MACRON
        '\u01e3': b'\xc5\xf1', # LATIN SMALL LETTER AE WITH MACRON
        '\u01e6': b'\xcfG', # LATIN CAPITAL LETTER G WITH CARON
        '\u01e7': b'\xcfg', # LATIN SMALL LETTER G WITH CARON
        '\u01e8': b'\xcfK', # LATIN CAPITAL LETTER K WITH CARON
        '\u01e9': b'\xcfk', # LATIN SMALL LETTER K WITH CARON
        '\u01ea': b'\xd3O', # LATIN CAPITAL LETTER O WITH OGONEK
        '\u01eb': b'\xd3o', # LATIN SMALL LETTER O WITH OGONEK
        '\u01ec': b'\xc5\xd3O', # LATIN CAPITAL LETTER O WITH OGONEK AND MACRON
        '\u01ed': b'\xc5\xd3o', # LATIN SMALL LETTER O WITH OGONEK AND MACRON
        '\u01f0': b'\xcfj', # LATIN SMALL LETTER J WITH CARON
        '\u01f4': b'\xc2G', # LATIN CAPITAL LETTER G WITH ACUTE
        '\u01f5': b'\xc2g', # LATIN SMALL LETTER G WITH ACUTE
        '\u01f8': b'\xc1N', # LATIN CAPITAL LETTER N WITH GRAVE
        '\u01f9': b'\xc1n', # LATIN SMALL LETTER N WITH GRAVE
        '\u01fa': b'\xc2\xcaA', # LATIN CAPITAL LETTER A WITH RING ABOVE AND ACUTE
        '\u01fb': b'\xc2\xcaa', # LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE
        '\u01fc': b'\xc2\xe1', # LATIN CAPITAL LETTER AE WITH ACUTE
        '\u01fd': b'\xc2\xf1', # LATIN SMALL LETTER AE WITH ACUTE
        '\u01fe': b'\xc2\xe9', # LATIN CAPITAL LETTER O WITH STROKE AND ACUTE
        '\u01ff': b'\xc2\xf9', # LATIN SMALL LETTER O WITH STROKE AND ACUTE
        '\u0218': b'\xd2S', # LATIN CAPITAL LETTER S WITH COMMA BELOW
        '\u0219': b'\xd2s', # LATIN SMALL LETTER S WITH COMMA BELOW
        '\u021a': b'\xd2T', # LATIN CAPITAL LETTER T WITH COMMA BELOW
        '\u021b': b'\xd2t', # LATIN SMALL LETTER T WITH COMMA BELOW
        '\u021e': b'\xcfH', # LATIN CAPITAL LETTER H WITH CARON
        '\u021f': b'\xcfh', # LATIN SMALL LETTER H WITH CARON
        '\u0226': b'\xc7A', # LATIN CAPITAL LETTER A WITH DOT ABOVE
        '\u0227': b'\xc7a', # LATIN SMALL LETTER A WITH DOT ABOVE
        '\u0228': b'\xd0E', # LATIN CAPITAL LETTER E WITH CEDILLA
        '\u0229': b'\xd0e', # LATIN SMALL LETTER E WITH CEDILLA
        '\u022a': b'\xc5\xc8O', # LATIN CAPITAL LETTER O WITH DIAERESIS AND MACRON
        '\u022b': b'\xc5\xc8o', # LATIN SMALL LETTER O WITH DIAERESIS AND MACRON
        '\u022c': b'\xc5\xc4O', # LATIN CAPITAL LETTER O WITH TILDE AND MACRON
        '\u022d': b'\xc5\xc4o', # LATIN SMALL LETTER O WITH TILDE AND MACRON
        '\u022e': b'\xc7O', # LATIN CAPITAL LETTER O WITH DOT ABOVE
        '\u022f': b'\xc7o', # LATIN SMALL LETTER O WITH DOT ABOVE
        '\u0230': b'\xc5\xc7O', # LATIN CAPITAL LETTER O WITH DOT ABOVE AND MACRON
        '\u0231': b'\xc5\xc7o', # LATIN SMALL LETTER O WITH DOT ABOVE AND MACRON
        '\u0232': b'\xc5Y', # LATIN CAPITAL LETTER Y WITH MACRON
        '\u0233': b'\xc5y', # LATIN SMALL LETTER Y WITH MACRON
        '\u02b9': b'\xbd', # MODIFIER LETTER PRIME
        '\u02ba': b'\xbe', # MODIFIER LETTER DOUBLE PRIME
        '\u02bb': b'\xb0', # MODIFIER LETTER TURNED COMMA
        '\u02bc': b'\xb1', # MODIFIER LETTER APOSTROPHE
        '\u0300': b'\xc1', # COMBINING GRAVE ACCENT
        '\u0301': b'\xc2', # COMBINING ACUTE ACCENT
        '\u0302': b'\xc3', # COMBINING CIRCUMFLEX ACCENT
        '\u0303': b'\xc4', # COMBINING TILDE
        '\u0304': b'\xc5', # COMBINING MACRON
        '\u0306': b'\xc6', # COMBINING BREVE
        '\u0307': b'\xc7', # COMBINING DOT ABOVE
        '\u0308': b'\xc8', # COMBINING DIAERESIS
        '\u0309': b'\xc0', # COMBINING HOOK ABOVE
        '\u030a': b'\xca', # COMBINING RING ABOVE
        '\u030b': b'\xcd', # COMBINING DOUBLE ACUTE ACCENT
        '\u030c': b'\xcf', # COMBINING CARON
        '\u0312': b'\xcc', # COMBINING TURNED COMMA ABOVE
        '\u0315': b'\xcb', # COMBINING COMMA ABOVE RIGHT
        '\u031b': b'\xce', # COMBINING HORN
        '\u031c': b'\xd1', # COMBINING LEFT HALF RING BELOW
        '\u0323': b'\xd6', # COMBINING DOT BELOW
        '\u0324': b'\xd7', # COMBINING DIAERESIS BELOW
        '\u0325': b'\xd4', # COMBINING RING BELOW
        '\u0326': b'\xd2', # COMBINING COMMA BELOW
        '\u0327': b'\xd0', # COMBINING CEDILLA
        '\u0328': b'\xd3', # COMBINING OGONEK
        '\u0329': b'\xda', # COMBINING VERTICAL LINE BELOW
        '\u032d': b'\xdb', # COMBINING CIRCUMFLEX ACCENT BELOW
        '\u032e': b'\xd5', # COMBINING BREVE BELOW
        '\u0332': b'\xd8', # COMBINING LOW LINE
        '\u0333': b'\xd9', # COMBINING DOUBLE LOW LINE
        '\u0340': b'\xc1', # COMBINING GRAVE TONE MARK
        '\u0341': b'\xc2', # COMBINING ACUTE TONE MARK
        '\u0344': b'\xc2\xc8', # COMBINING GREEK DIALYTIKA TONOS
        '\u0374': b'\xbd', # GREEK NUMERAL SIGN
        '\u037e': b';', # GREEK QUESTION MARK
        '\u0387': b'\xb7', # GREEK ANO TELEIA
        '\u1e00': b'\xd4A', # LATIN CAPITAL LETTER A WITH RING BELOW
        '\u1e01': b'\xd4a', # LATIN SMALL LETTER A WITH RING BELOW
        '\u1e02': b'\xc7B', # LATIN CAPITAL LETTER B WITH DOT ABOVE
        '\u1e03': b'\xc7b', # LATIN SMALL LETTER B WITH DOT ABOVE
        '\u1e04': b'\xd6B', # LATIN CAPITAL LETTER B WITH DOT BELOW
        '\u1e05': b'\xd6b', # LATIN SMALL LETTER B WITH DOT BELOW
        '\u1e08': b'\xc2\xd0C', # LATIN CAPITAL LETTER C WITH CEDILLA AND ACUTE
        '\u1e09': b'\xc2\xd0c', # LATIN SMALL LETTER C WITH CEDILLA AND ACUTE
        '\u1e0a': b'\xc7D', # LATIN CAPITAL LETTER D WITH DOT ABOVE
        '\u1e0b': b'\xc7d', # LATIN SMALL LETTER D WITH DOT ABOVE
        '\u1e0c': b'\xd6D', # LATIN CAPITAL LETTER D WITH DOT BELOW
        '\u1e0d': b'\xd6d', # LATIN SMALL LETTER D WITH DOT BELOW
        '\u1e10': b'\xd0D', # LATIN CAPITAL LETTER D WITH CEDILLA
        '\u1e11': b'\xd0d', # LATIN SMALL LETTER D WITH CEDILLA
        '\u1e12': b'\xdbD', # LATIN CAPITAL LETTER D WITH CIRCUMFLEX BELOW
        '\u1e13': b'\xdbd', # LATIN SMALL LETTER D WITH CIRCUMFLEX BELOW
        '\u1e14': b'\xc1\xc5E', # LATIN CAPITAL LETTER E WITH MACRON AND GRAVE
        '\u1e15': b'\xc1\xc5e', # LATIN SMALL LETTER E WITH MACRON AND GRAVE
        '\u1e16': b'\xc2\xc5E', # LATIN CAPITAL LETTER E WITH MACRON AND ACUTE
        '\u1e17': b'\xc2\xc5e', # LATIN SMALL LETTER E WITH MACRON AND ACUTE
        '\u1e18': b'\xdbE', # LATIN CAPITAL LETTER E WITH CIRCUMFLEX BELOW
        '\u1e19': b'\xdbe', # LATIN SMALL LETTER E WITH CIRCUMFLEX BELOW
        '\u1e1c': b'\xc6\xd0E', # LATIN CAPITAL LETTER E WITH CEDILLA AND BREVE
        '\u1e1d': b'\xc6\xd0e', # LATIN SMALL LETTER E WITH CEDILLA AND BREVE
        '\u1e1e': b'\xc7F', # LATIN CAPITAL LETTER F WITH DOT ABOVE
        '\u1e1f': b'\xc7f', # LATIN SMALL LETTER F WITH DOT ABOVE
        '\u1e20': b'\xc5G', # LATIN CAPITAL LETTER G WITH MACRON
        '\u1e21': b'\xc5g', # LATIN SMALL LETTER G WITH MACRON
        '\u1e22': b'\xc7H', # LATIN CAPITAL LETTER H WITH DOT ABOVE
        '\u1e23': b'\xc7h', # LATIN SMALL LETTER H WITH DOT ABOVE
        '\u1e24': b'\xd6H', # LATIN CAPITAL LETTER H WITH DOT BELOW
        '\u1e25': b'\xd6h', # LATIN SMALL LETTER H WITH DOT BELOW
        '\u1e26': b'\xc8H', # LATIN CAPITAL LETTER H WITH DIAERESIS
        '\u1e27': b'\xc8h', # LATIN SMALL LETTER H WITH DIAERESIS
        '\u1e28': b'\xd0H', # LATIN CAPITAL LETTER H WITH CEDILLA
        '\u1e29': b'\xd0h', # LATIN SMALL LETTER H WITH CEDILLA
        '\u1e2a': b'\xd5H', # LATIN CAPITAL LETTER H WITH BREVE BELOW
        '\u1e2b': b'\xd5h', # LATIN SMALL LETTER H WITH BREVE BELOW
        '\u1e2e': b'\xc2\xc8I', # LATIN CAPITAL LETTER I WITH DIAERESIS AND ACUTE
        '\u1e2f': b'\xc2\xc8i', # LATIN SMALL LETTER I WITH DIAERESIS AND ACUTE
        '\u1e30': b'\xc2K', # LATIN CAPITAL LETTER K WITH ACUTE
        '\u1e31': b'\xc2k', # LATIN SMALL LETTER K WITH ACUTE
        '\u1e32': b'\xd6K', # LATIN CAPITAL LETTER K WITH DOT BELOW
        '\u1e33': b'\xd6k', # LATIN SMALL LETTER K WITH DOT BELOW
        '\u1e36': b'\xd6L', # LATIN CAPITAL LETTER L WITH DOT BELOW
        '\u1e37': b'\xd6l', # LATIN SMALL LETTER L WITH DOT BELOW
        '\u1e38': b'\xc5\xd6L', # LATIN CAPITAL LETTER L WITH DOT BELOW AND MACRON
        '\u1e39': b'\xc5\xd6l', # LATIN SMALL LETTER L WITH DOT BELOW AND MACRON
        '\u1e3c': b'\xdbL', # LATIN CAPITAL LETTER L WITH CIRCUMFLEX BELOW
        '\u1e3d': b'\xdbl', # LATIN SMALL LETTER L WITH CIRCUMFLEX BELOW
        '\u1e3e': b'\xc2M', # LATIN CAPITAL LETTER M WITH ACUTE
        '\u1e3f': b'\xc2m', # LATIN SMALL LETTER M WITH ACUTE
        '\u1e40': b'\xc7M', # LATIN CAPITAL LETTER M WITH DOT ABOVE
        '\u1e41': b'\xc7m', # LATIN SMALL LETTER M WITH DOT ABOVE
        '\u1e42': b'\xd6M', # LATIN CAPITAL LETTER M WITH DOT BELOW
        '\u1e43': b'\xd6m', # LATIN SMALL LETTER M WITH DOT BELOW
        '\u1e44': b'\xc7N', # LATIN CAPITAL LETTER N WITH DOT ABOVE
        '\u1e45': b'\xc7n', # LATIN SMALL LETTER N WITH DOT ABOVE
        '\u1e46': b'\xd6N', # LATIN CAPITAL LETTER N WITH DOT BELOW
        '\u1e47': b'\xd6n', # LATIN SMALL LETTER N WITH DOT BELOW
        '\u1e4a': b'\xdbN', # LATIN CAPITAL LETTER N WITH CIRCUMFLEX BELOW
        '\u1e4b': b'\xdbn', # LATIN SMALL LETTER N WITH CIRCUMFLEX BELOW
        '\u1e4c': b'\xc2\xc4O', # LATIN CAPITAL LETTER O WITH TILDE AND ACUTE
        '\u1e4d': b'\xc2\xc4o', # LATIN SMALL LETTER O WITH TILDE AND ACUTE
        '\u1e4e': b'\xc8\xc4O', # LATIN CAPITAL LETTER O WITH TILDE AND DIAERESIS
        '\u1e4f': b'\xc8\xc4o', # LATIN SMALL LETTER O WITH TILDE AND DIAERESIS
        '\u1e50': b'\xc1\xc5O', # LATIN CAPITAL LETTER O WITH MACRON AND GRAVE
        '\u1e51': b'\xc1\xc5o', # LATIN SMALL LETTER O WITH MACRON AND GRAVE
        '\u1e52': b'\xc2\xc5O', # LATIN CAPITAL LETTER O WITH MACRON AND ACUTE
        '\u1e53': b'\xc2\xc5o', # LATIN SMALL LETTER O WITH MACRON AND ACUTE
        '\u1e54': b'\xc2P', # LATIN CAPITAL LETTER P WITH ACUTE
        '\u1e55': b'\xc2p', # LATIN SMALL LETTER P WITH ACUTE
        '\u1e56': b'\xc7P', # LATIN CAPITAL LETTER P WITH DOT ABOVE
        '\u1e57': b'\xc7p', # LATIN SMALL LETTER P WITH DOT ABOVE
        '\u1e58': b'\xc7R', # LATIN CAPITAL LETTER R WITH DOT ABOVE
        '\u1e59': b'\xc7r', # LATIN SMALL LETTER R WITH DOT ABOVE
        '\u1e5a': b'\xd6R', # LATIN CAPITAL LETTER R WITH DOT BELOW
        '\u1e5b': b'\xd6r', # LATIN SMALL LETTER R WITH DOT BELOW
        '\u1e5c': b'\xc5\xd6R', # LATIN CAPITAL LETTER R WITH DOT BELOW AND MACRON
        '\u1e5d': b'\xc5\xd6r', # LATIN SMALL LETTER R WITH DOT BELOW AND MACRON
        '\u1e60': b'\xc7S', # LATIN CAPITAL LETTER S WITH DOT ABOVE
        '\u1e61': b'\xc7s', # LATIN SMALL LETTER S WITH DOT ABOVE
        '\u1e62': b'\xd6S', # LATIN CAPITAL LETTER S WITH DOT BELOW
        '\u1e63': b'\xd6s', # LATIN SMALL LETTER S WITH DOT BELOW
        '\u1e64': b'\xc7\xc2S', # LATIN CAPITAL LETTER S WITH ACUTE AND DOT ABOVE
        '\u1e65': b'\xc7\xc2s', # LATIN SMALL LETTER S WITH ACUTE AND DOT ABOVE
        '\u1e66': b'\xc7\xcfS', # LATIN CAPITAL LETTER S WITH CARON AND DOT ABOVE
        '\u1e67': b'\xc7\xcfs', # LATIN SMALL LETTER S WITH CARON AND DOT ABOVE
        '\u1e68': b'\xc7\xd6S', # LATIN CAPITAL LETTER S WITH DOT BELOW AND DOT ABOVE
        '\u1e69': b'\xc7\xd6s', # LATIN SMALL LETTER S WITH DOT BELOW AND DOT ABOVE
        '\u1e6a': b'\xc7T', # LATIN CAPITAL LETTER T WITH DOT ABOVE
        '\u1e6b': b'\xc7t', # LATIN SMALL LETTER T WITH DOT ABOVE
        '\u1e6c': b'\xd6T', # LATIN CAPITAL LETTER T WITH DOT BELOW
        '\u1e6d': b'\xd6t', # LATIN SMALL LETTER T WITH DOT BELOW
        '\u1e70': b'\xdbT', # LATIN CAPITAL LETTER T WITH CIRCUMFLEX BELOW
        '\u1e71': b'\xdbt', # LATIN SMALL LETTER T WITH CIRCUMFLEX BELOW
        '\u1e72': b'\xd7U', # LATIN CAPITAL LETTER U WITH DIAERESIS BELOW
        '\u1e73': b'\xd7u', # LATIN SMALL LETTER U WITH DIAERESIS BELOW
        '\u1e76': b'\xdbU', # LATIN CAPITAL LETTER U WITH CIRCUMFLEX BELOW
        '\u1e77': b'\xdbu', # LATIN SMALL LETTER U WITH CIRCUMFLEX BELOW
        '\u1e78': b'\xc2\xc4U', # LATIN CAPITAL LETTER U WITH TILDE AND ACUTE
        '\u1e79': b'\xc2\xc4u', # LATIN SMALL LETTER U WITH TILDE AND ACUTE
        '\u1e7a': b'\xc8\xc5U', # LATIN CAPITAL LETTER U WITH MACRON AND DIAERESIS
        '\u1e7b': b'\xc8\xc5u', # LATIN SMALL LETTER U WITH MACRON AND DIAERESIS
        '\u1e7c': b'\xc4V', # LATIN CAPITAL LETTER V WITH TILDE
        '\u1e7d': b'\xc4v', # LATIN SMALL LETTER V WITH TILDE
        '\u1e7e': b'\xd6V', # LATIN CAPITAL LETTER V WITH DOT BELOW
        '\u1e7f': b'\xd6v', # LATIN SMALL LETTER V WITH DOT BELOW
        '\u1e80': b'\xc1W', # LATIN CAPITAL LETTER W WITH GRAVE
        '\u1e81': b'\xc1w', # LATIN SMALL LETTER W WITH GRAVE
        '\u1e82': b'\xc2W', # LATIN CAPITAL LETTER W WITH ACUTE
        '\u1e83': b'\xc2w', # LATIN SMALL LETTER W WITH ACUTE
        '\u1e84': b'\xc8W', # LATIN CAPITAL LETTER W WITH DIAERESIS
        '\u1e85': b'\xc8w', # LATIN SMALL LETTER W WITH DIAERESIS
        '\u1e86': b'\xc7W', # LATIN CAPITAL LETTER W WITH DOT ABOVE
        '\u1e87': b'\xc7w', # LATIN SMALL LETTER W WITH DOT ABOVE
        '\u1e88': b'\xd6W', # LATIN CAPITAL LETTER W WITH DOT BELOW
        '\u1e89': b'\xd6w', # LATIN SMALL LETTER W WITH DOT BELOW
        '\u1e8a': b'\xc7X', # LATIN CAPITAL LETTER X WITH DOT ABOVE
        '\u1e8b': b'\xc7x', # LATIN SMALL LETTER X WITH DOT ABOVE
        '\u1e8c': b'\xc8X', # LATIN CAPITAL LETTER X WITH DIAERESIS
        '\u1e8d': b'\xc8x', # LATIN SMALL LETTER X WITH DIAERESIS
        '\u1e8e': b'\xc7Y', # LATIN CAPITAL LETTER Y WITH DOT ABOVE
        '\u1e8f': b'\xc7y', # LATIN SMALL LETTER Y WITH DOT ABOVE
        '\u1e90': b'\xc3Z', # LATIN CAPITAL LETTER Z WITH CIRCUMFLEX
        '\u1e91': b'\xc3z', # LATIN SMALL LETTER Z WITH CIRCUMFLEX
        '\u1e92': b'\xd6Z', # LATIN CAPITAL LETTER Z WITH DOT BELOW
        '\u1e93': b'\xd6z', # LATIN SMALL LETTER Z WITH DOT BELOW
        '\u1e97': b'\xc8t', # LATIN SMALL LETTER T WITH DIAERESIS
        '\u1e98': b'\xcaw', # LATIN SMALL LETTER W WITH RING ABOVE
        '\u1e99': b'\xcay', # LATIN SMALL LETTER Y WITH RING ABOVE
        '\u1ea0': b'\xd6A', # LATIN CAPITAL LETTER A WITH DOT BELOW
        '\u1ea1': b'\xd6a', # LATIN SMALL LETTER A WITH DOT BELOW
        '\u1ea2': b'\xc0A', # LATIN CAPITAL LETTER A WITH HOOK ABOVE
        '\u1ea3': b'\xc0a', # LATIN SMALL LETTER A WITH HOOK ABOVE
        '\u1ea4': b'\xc2\xc3A', # LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND ACUTE
        '\u1ea5': b'\xc2\xc3a', # LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE
        '\u1ea6': b'\xc1\xc3A', # LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND GRAVE
        '\u1ea7': b'\xc1\xc3a', # LATIN SMALL LETTER A WITH CIRCUMFLEX AND GRAVE
        '\u1ea8': b'\xc0\xc3A', # LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE
        '\u1ea9': b'\xc0\xc3a', # LATIN SMALL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE
        '\u1eaa': b'\xc4\xc3A', # LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND TILDE
        '\u1eab': b'\xc4\xc3a', # LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE
        '\u1eac': b'\xc3\xd6A', # LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND DOT BELOW
        '\u1ead': b'\xc3\xd6a', # LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW
        '\u1eae': b'\xc2\xc6A', # LATIN CAPITAL LETTER A WITH BREVE AND ACUTE
        '\u1eaf': b'\xc2\xc6a', # LATIN SMALL LETTER A WITH BREVE AND ACUTE
        '\u1eb0': b'\xc1\xc6A', # LATIN CAPITAL LETTER A WITH BREVE AND GRAVE
        '\u1eb1': b'\xc1\xc6a', # LATIN SMALL LETTER A WITH BREVE AND GRAVE
        '\u1eb2': b'\xc0\xc6A', # LATIN CAPITAL LETTER A WITH BREVE AND HOOK ABOVE
        '\u1eb3': b'\xc0\xc6a', # LATIN SMALL LETTER A WITH BREVE AND HOOK ABOVE
        '\u1eb4': b'\xc4\xc6A', # LATIN CAPITAL LETTER A WITH BREVE AND TILDE
        '\u1eb5': b'\xc4\xc6a', # LATIN SMALL LETTER A WITH BREVE AND TILDE
        '\u1eb6': b'\xc6\xd6A', # LATIN CAPITAL LETTER A WITH BREVE AND DOT BELOW
        '\u1eb7': b'\xc6\xd6a', # LATIN SMALL LETTER A WITH BREVE AND DOT BELOW
        '\u1eb8': b'\xd6E', # LATIN CAPITAL LETTER E WITH DOT BELOW
        '\u1eb9': b'\xd6e', # LATIN SMALL LETTER E WITH DOT BELOW
        '\u1eba': b'\xc0E', # LATIN CAPITAL LETTER E WITH HOOK ABOVE
        '\u1ebb': b'\xc0e', # LATIN SMALL LETTER E WITH HOOK ABOVE
        '\u1ebc': b'\xc4E', # LATIN CAPITAL LETTER E WITH TILDE
        '\u1ebd': b'\xc4e', # LATIN SMALL LETTER E WITH TILDE
        '\u1ebe': b'\xc2\xc3E', # LATIN CAPITAL LETTER E WITH CIRCUMFLEX AND ACUTE
        '\u1ebf': b'\xc2\xc3e', # LATIN SMALL LETTER E WITH CIRCUMFLEX AND ACUTE
        '\u1ec0': b'\xc1\xc3E', # LATIN CAPITAL LETTER E WITH CIRCUMFLEX AND GRAVE
        '\u1ec1': b'\xc1\xc3e', # LATIN SMALL LETTER E WITH CIRCUMFLEX AND GRAVE
        '\u1ec2': b'\xc0\xc3E', # LATIN CAPITAL LETTER E WITH CIRCUMFLEX AND HOOK ABOVE
        '\u1ec3': b'\xc0\xc3e', # LATIN SMALL LETTER E WITH CIRCUMFLEX AND HOOK ABOVE
        '\u1ec4': b'\xc4\xc3E', # LATIN CAPITAL LETTER E WITH CIRCUMFLEX AND TILDE
        '\u1ec5': b'\xc4\xc3e', # LATIN SMALL LETTER E WITH CIRCUMFLEX AND TILDE
        '\u1ec6': b'\xc3\xd6E', # LATIN CAPITAL LETTER E WITH CIRCUMFLEX AND DOT BELOW
        '\u1ec7': b'\xc3\xd6e', # LATIN SMALL LETTER E WITH CIRCUMFLEX AND DOT BELOW
        '\u1ec8': b'\xc0I', # LATIN CAPITAL LETTER I WITH HOOK ABOVE
        '\u1ec9': b'\xc0i', # LATIN SMALL LETTER I WITH HOOK ABOVE
        '\u1eca': b'\xd6I', # LATIN CAPITAL LETTER I WITH DOT BELOW
        '\u1ecb': b'\xd6i', # LATIN SMALL LETTER I WITH DOT BELOW
        '\u1ecc': b'\xd6O', # LATIN CAPITAL LETTER O WITH DOT BELOW
        '\u1ecd': b'\xd6o', # LATIN SMALL LETTER O WITH DOT BELOW
        '\u1ece': b'\xc0O', # LATIN CAPITAL LETTER O WITH HOOK ABOVE
        '\u1ecf': b'\xc0o', # LATIN SMALL LETTER O WITH HOOK ABOVE
        '\u1ed0': b'\xc2\xc3O', # LATIN CAPITAL LETTER O WITH CIRCUMFLEX AND ACUTE
        '\u1ed1': b'\xc2\xc3o', # LATIN SMALL LETTER O WITH CIRCUMFLEX AND ACUTE
        '\u1ed2': b'\xc1\xc3O', # LATIN CAPITAL LETTER O WITH CIRCUMFLEX AND GRAVE
        '\u1ed3': b'\xc1\xc3o', # LATIN SMALL LETTER O WITH CIRCUMFLEX AND GRAVE
        '\u1ed4': b'\xc0\xc3O', # LATIN CAPITAL LETTER O WITH CIRCUMFLEX AND HOOK ABOVE
        '\u1ed5': b'\xc0\xc3o', # LATIN SMALL LETTER O WITH CIRCUMFLEX AND HOOK ABOVE
        '\u1ed6': b'\xc4\xc3O', # LATIN CAPITAL LETTER O WITH CIRCUMFLEX AND TILDE
        '\u1ed7': b'\xc4\xc3o', # LATIN SMALL LETTER O WITH CIRCUMFLEX AND TILDE
        '\u1ed8': b'\xc3\xd6O', # LATIN CAPITAL LETTER O WITH CIRCUMFLEX AND DOT BELOW
        '\u1ed9': b'\xc3\xd6o', # LATIN SMALL LETTER O WITH CIRCUMFLEX AND DOT BELOW
        '\u1eda': b'\xc2\xceO', # LATIN CAPITAL LETTER O WITH HORN AND ACUTE
        '\u1edb': b'\xc2\xceo', # LATIN SMALL LETTER O WITH HORN AND ACUTE
        '\u1edc': b'\xc1\xceO', # LATIN CAPITAL LETTER O WITH HORN AND GRAVE
        '\u1edd': b'\xc1\xceo', # LATIN SMALL LETTER O WITH HORN AND GRAVE
        '\u1ede': b'\xc0\xceO', # LATIN CAPITAL LETTER O WITH HORN AND HOOK ABOVE
        '\u1edf': b'\xc0\xceo', # LATIN SMALL LETTER O WITH HORN AND HOOK ABOVE
        '\u1ee0': b'\xc4\xceO', # LATIN CAPITAL LETTER O WITH HORN AND TILDE
        '\u1ee1': b'\xc4\xceo', # LATIN SMALL LETTER O WITH HORN AND TILDE
        '\u1ee2': b'\xd6\xceO', # LATIN CAPITAL LETTER O WITH HORN AND DOT BELOW
        '\u1ee3': b'\xd6\xceo', # LATIN SMALL LETTER O WITH HORN AND DOT BELOW
        '\u1ee4': b'\xd6U', # LATIN CAPITAL LETTER U WITH DOT BELOW
        '\u1ee5': b'\xd6u', # LATIN SMALL LETTER U WITH DOT BELOW
        '\u1ee6': b'\xc0U', # LATIN CAPITAL LETTER U WITH HOOK ABOVE
        '\u1ee7': b'\xc0u', # LATIN SMALL LETTER U WITH HOOK ABOVE
        '\u1ee8': b'\xc2\xceU', # LATIN CAPITAL LETTER U WITH HORN AND ACUTE
        '\u1ee9': b'\xc2\xceu', # LATIN SMALL LETTER U WITH HORN AND ACUTE
        '\u1eea': b'\xc1\xceU', # LATIN CAPITAL LETTER U WITH HORN AND GRAVE
        '\u1eeb': b'\xc1\xceu', # LATIN SMALL LETTER U WITH HORN AND GRAVE
        '\u1eec': b'\xc0\xceU', # LATIN CAPITAL LETTER U WITH HORN AND HOOK ABOVE
        '\u1eed': b'\xc0\xceu', # LATIN SMALL LETTER U WITH HORN AND HOOK ABOVE
        '\u1eee': b'\xc4\xceU', # LATIN CAPITAL LETTER U WITH HORN AND TILDE
        '\u1eef': b'\xc4\xceu', # LATIN SMALL LETTER U WITH HORN AND TILDE
        '\u1ef0': b'\xd6\xceU', # LATIN CAPITAL LETTER U WITH HORN AND DOT BELOW
        '\u1ef1': b'\xd6\xceu', # LATIN SMALL LETTER U WITH HORN AND DOT BELOW
        '\u1ef2': b'\xc1Y', # LATIN CAPITAL LETTER Y WITH GRAVE
        '\u1ef3': b'\xc1y', # LATIN SMALL LETTER Y WITH GRAVE
        '\u1ef4': b'\xd6Y', # LATIN CAPITAL LETTER Y WITH DOT BELOW
        '\u1ef5': b'\xd6y', # LATIN SMALL LETTER Y WITH DOT BELOW
        '\u1ef6': b'\xc0Y', # LATIN CAPITAL LETTER Y WITH HOOK ABOVE
        '\u1ef7': b'\xc0y', # LATIN SMALL LETTER Y WITH HOOK ABOVE
        '\u1ef8': b'\xc4Y', # LATIN CAPITAL LETTER Y WITH TILDE
        '\u1ef9': b'\xc4y', # LATIN SMALL LETTER Y WITH TILDE
        '\u1fef': b'`', # GREEK VARIA
        '\u2018': b'\xa9', # LEFT SINGLE QUOTATION MARK
        '\u2019': b'\xb9', # RIGHT SINGLE QUOTATION MARK
        '\u201a': b'\xb2', # SINGLE LOW-9 QUOTATION MARK
        '\u201c': b'\xaa', # LEFT DOUBLE QUOTATION MARK
        '\u201d': b'\xba', # RIGHT DOUBLE QUOTATION MARK
        '\u201e': b'\xa2', # DOUBLE LOW-9 QUOTATION MARK
        '\u2020': b'\xa6', # DAGGER
        '\u2021': b'\xb6', # DOUBLE DAGGER
        '\u2032': b'\xa8', # PRIME
        '\u2033': b'\xb8', # DOUBLE PRIME
        '\u2117': b'\xae', # SOUND RECORDING COPYRIGHT
        #'\u212a': b'K', # KELVIN SIGN
        '\u212b': b'\xcaA', # ANGSTROM SIGN
        '\u266d': b'\xac', # MUSIC FLAT SIGN
        '\u266f': b'\xbc', # MUSIC SHARP SIGN
        '\ufe20': b'\xdd', # COMBINING LIGATURE LEFT HALF
        '\ufe21': b'\xde', # COMBINING LIGATURE RIGHT HALF
        '\ufe23': b'\xdf', # COMBINING DOUBLE TILDE RIGHT HALF
    }
     
     
    charmap = {}
    for uni, char in getattr(unicodemap, "iteritems", unicodemap.items)():
        if char in charmap:
            continue
        charmap[char] = uni
    j'ai rajouté une ligne dans la table de correspondance car elle manquait (peut-être y a t il d'autres lignes qui manquent) :
    '\u00a0': b'\xa0', # SPACE AJOUT J.P 05/2023
    et voici du code pour tester le décodage :
    Code : Sélectionner tout - Visualiser dans une fenêtre à part
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    from decIso5426 import decodeIso5426 as dec5426
    chaineAdecoder = b"""L'humanit\xc2e sait qu'il lui reste quatre si\xc1ecles
    avant que la flotte trisolarienne n'envahisse le syst\xc1eme solaire.
    Les sciences fondamentales se retrouvant verrouill\xc2ees par les intellectrons,
    la Terre doit se pr\xc2eparer du mieux qu'elle peut.
    Le Conseil de D\xc2efense Plan\xc2etaire lance un nouveau\xa0projet\xa0:
    le programme \xab\xa0Colmateur\xa0\xbb, qui consiste \xc1a faire appel \xc1a quatre individus\xa0
    charg\xc2es d'envisager des strat\xc2egies secr\xc1etes pour contrer l'invasion ennemie.
    Car s'ils peuvent espionner toutes les conversations et tous les ordinateurs humains
    gr\xc3ace aux intellectrons, les Trisolariens sont en revanche incapables de lire
    dans leurs pens\xc2ees. Apr\xc1es \xabLe Probl\xc1eme \xc1a trois corps, \xbbLiu Cixin revient
    avec une suite magistrale et haletante.
    """
     
     
    print(dec5426(b'la for\xc3et'))
    print(dec5426( b"Abr\xc2eg\xc2e Historique De L'Origine"))
    print(dec5426(chaineAdecoder))
    Résultat :
    la forêt
    Abrégé Historique De L'Origine
    L'humanité sait qu'il lui reste quatre siècles
    avant que la flotte trisolarienne n'envahisse le système solaire.
    Les sciences fondamentales se retrouvant verrouillées par les intellectrons,
    la Terre doit se préparer du mieux qu'elle peut.
    Le Conseil de Défense Planétaire lance un nouveau projet :
    le programme « Colmateur », qui consiste à faire appel à quatre individus
    chargés d'envisager des stratégies secrètes pour contrer l'invasion ennemie.
    Car s'ils peuvent espionner toutes les conversations et tous les ordinateurs humains
    grâce aux intellectrons, les Trisolariens sont en revanche incapables de lire
    dans leurs pensées. Après «Le Problème à trois corps, »Liu Cixin revient
    avec une suite magistrale et haletante.
    Ami calmant, J.P
    Jurassic computer : Sinclair ZX81 - Zilog Z80A à 3,25 MHz - RAM 1 Ko - ROM 8 Ko

  19. #19
    Nouveau membre du Club
    Homme Profil pro
    Loisir / Plaisir
    Inscrit en
    Février 2012
    Messages
    32
    Détails du profil
    Informations personnelles :
    Sexe : Homme
    Localisation : France, Jura (Franche Comté)

    Informations professionnelles :
    Activité : Loisir / Plaisir

    Informations forums :
    Inscription : Février 2012
    Messages : 32
    Points : 27
    Points
    27
    Par défaut
    Merci à tous les spécialistes qui ont participés à la résolution de mon problème.

    Un coup de chapeau à "jurassic pork" pour son aide. (j'habite le Jura et j'ai commencé l'info avec le Zx81)

    Code : Sélectionner tout - Visualiser dans une fenêtre à part
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    #!/usr/bin/python3
    # -*- coding: utf-8 -*-
     
    """
    Détecte le type de fichier et son MIME
    Détecte l'encodage du fichier ISO2709 (unimarc).
    puis
    Afficher le titre si présent en zone 200$a.
     
    chardet (4.0.0)
    pymarc (5.0.0)
    decIso5426 (version jurassic pork - https://www.developpez.net/forums/d2151569/autres-langages/python/general-python/encodage-p-fichier-unimarc-iso2709/)
     
     
    argument 1 -> fichier unimarc
    """
     
     
    import sys
     
    import chardet
    import encodings
    import magic
    import pathlib
    from decIso5426 import decodeIso5426 as dec5426
    from pymarc import MARCReader
     
    def analyse(fichier):
        print("type de fichier : " + magic.from_file(fichier))
        print("type MIME du fichier : " + magic.from_file(fichier, mime = True))
     
        result = chardet.detect(pathlib.Path(fichier).read_bytes())
        charenc = str(result)
        charencencoding = result['encoding']
        print ("l'encodage probable est " + charenc)
     
        with open(fichier, 'rb') as fh:
            if charencencoding == "utf-8":
                reader = MARCReader(fh, to_unicode=True, force_utf8=True)
                analyseurunimarc(reader, charencencoding)
            else:
                try:
                    reader = MARCReader(fh, file_encoding=charencencoding) 
                    analyseurunimarc(reader, charencencoding)
                except:
                    pass
    def analyseurunimarc(reader, charencencoding):
        numnotice = 0    # le compteur de notice
        print("Analyseur Unimarc")
        print(charencencoding)
        for record in reader:
            numnotice += 1
            print("----------------------------")
            print("notice numéro : " + str(numnotice))
            for field in record.get_fields('200'):
                if field['a'] is not None:
                    if charencencoding != "utf-8":
                        titre = field['a']
                        titrelen = len(titre)
                        print("le nombre de caractère du titre est : " + str(titrelen))
                        print("Le titre non decodé est : \033[41m" + titre + "\033[0m")
                        titrebyteencoded = titre.encode(encoding=charencencoding)
                        print(titrebyteencoded )
     
                        titredecode = (dec5426(titrebyteencoded))
                        print("le titres decodé est \033[1;32m" + titredecode.upper() + "\033[0m")
     
                       # si le titre est en utf8
                    else:
                        print("le titres est \033[1;32m" + field['a'].upper())
                        print("\033[0m")
     
                # si pas de zone 200$a dans unimarc
                elif field['a'] is None:
                    print('pas de titre en 200$a')
     
     
    if __name__ == "__main__":
        fichier = (sys.argv[1])
        print("le fichier à analyser est : " + fichier)
        analyse(fichier)
    j'ajoute un fichier "jeu de données unimarc" venant de la BnF pour voir les traductions manquantes.
    Fichiers attachés Fichiers attachés

  20. #20
    Expert éminent sénior
    Avatar de Sve@r
    Homme Profil pro
    Ingénieur développement logiciels
    Inscrit en
    Février 2006
    Messages
    12 685
    Détails du profil
    Informations personnelles :
    Sexe : Homme
    Localisation : France, Oise (Picardie)

    Informations professionnelles :
    Activité : Ingénieur développement logiciels
    Secteur : Aéronautique - Marine - Espace - Armement

    Informations forums :
    Inscription : Février 2006
    Messages : 12 685
    Points : 30 974
    Points
    30 974
    Billets dans le blog
    1
    Par défaut
    Citation Envoyé par jurassic pork Voir le message
    merci papajoker pour ces conseils d'expert, voici à ce que je suis arrivé :
    Tu es prêt à reprendre la lib. Ne te reste qu'à contacter ce "Tiran" (le mainteneur actuel) qui a 40 projets sur le dos et le lui proposer...
    Mon Tutoriel sur la programmation «Python»
    Mon Tutoriel sur la programmation «Shell»
    Sinon il y en a pleins d'autres. N'oubliez pas non plus les différentes faq disponibles sur ce site
    Et on poste ses codes entre balises [code] et [/code]

+ Répondre à la discussion
Cette discussion est résolue.
Page 1 sur 2 12 DernièreDernière

Discussions similaires

  1. [VB 2005]encodage fichier texte
    Par Mandarine dans le forum VB.NET
    Réponses: 1
    Dernier message: 03/11/2007, 14h43
  2. Encodage fichier ressources
    Par vdelbart dans le forum Servlets/JSP
    Réponses: 5
    Dernier message: 16/07/2007, 12h17
  3. [java]Encodage fichier XML avec XMLType
    Par adrien.nicolet dans le forum Oracle
    Réponses: 1
    Dernier message: 08/03/2007, 13h09
  4. [encodage fichier texte]
    Par nyko_kliko dans le forum Entrée/Sortie
    Réponses: 1
    Dernier message: 28/07/2006, 16h21
  5. [XSL][C++]encodage fichier xml
    Par luta dans le forum XSL/XSLT/XPATH
    Réponses: 2
    Dernier message: 22/02/2006, 09h45

Partager

Partager
  • Envoyer la discussion sur Viadeo
  • Envoyer la discussion sur Twitter
  • Envoyer la discussion sur Google
  • Envoyer la discussion sur Facebook
  • Envoyer la discussion sur Digg
  • Envoyer la discussion sur Delicious
  • Envoyer la discussion sur MySpace
  • Envoyer la discussion sur Yahoo