problème encodage utf8

**roline** · 09/07/2018, 13h25

Bonjour à tous,
J'ai un problème d'encodage d'accents avec python 2. J'ai essayé toutes les solutions proposées mais aucune ne marche. Je dois adapter l'algo de Nervig pour le français. Pour cela ça doit passer par une prise en compte des accents, j'ai donc modifié de cette manière (ce qui est gras):

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# -*- coding: utf-8 -*-
import re
from collections import Counter
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
import codecs
def words(text): return re.findall(r'\w+', text.lower())

f     = codecs.open('big.txt',encoding='utf-8')
WORDS = Counter(words(f.read()))

def P(word, N=sum(WORDS.values())): 
    "Probability of `word`."
    return WORDS[word] / N

def correction(word): 
    "Most probable spelling correction for word."
    return max(candidates(word), key=P)

def candidates(word): 
    "Generate possible spelling corrections for word."
    return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])

def known(words): 
    "The subset of `words` that appear in the dictionary of WORDS."
    return set(w for w in words if w in WORDS)

def edits1(word):
    "All edits that are one edit away from `word`."
    letters     = ('abcdeéèfghijklmnopqrstuvwxyz')
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

def edits2(word): 
    "All edits that are two edits away from `word`."
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))

Mais ça me donne :
'dif\xc3\xa9rent'

J'ai aussi tenté de modifier la variable letters puisqu'elle prend en compte des accents mais rien ne marche...
D'avance merci :-)

problème encodage utf8

Python

Mode arborescent

Discussions similaires

Partager

Partager