re.compile(r'[\x80-\xFF]’)
[ pour notre cas, remplacer par re.compile(’'^([^\r\n]+)[\r\n]{1,2}[^\r\n]+?-programme'+str(x)) ]
This pre-compiles a regular expression designed to find non-ASCII characters in the range 128–255 (0x80–0xFF).
Wait, that’s not quite right; I need to be more precise with my terminology. This pattern is designed to find non-ASCII bytes in the range 128-255.
And therein lies the problem.
In Python 2, a string was an array of bytes whose character encoding was tracked separately. If you wanted Python 2 to keep track of the character encoding, you had to use a Unicode string (u'') instead.
*
In Python 3, a string is always what Python 2 called a Unicode string - that is,
an array of Unicode characters (of possibly varying byte lengths).
*
Since this regular expression is defined by a string pattern, it can only be used to search a string - again,
an array of characters.
But what we’re searching is not a string, it’s
a byte array.
Look carefully at the parameters used to open the file:
'rb'.
'r' is for "read"; OK, big deal, we’re reading the file. Ah, but
'b' is for "binary."
Without the
'b' flag, [the programm] would read the file, line by line, and convert each line into a string - an array of Unicode characters - according to the system default character encoding.
Partager