Parsing de fichier

**bioinfornatics** · 06/07/2013, 19h42

Bonjour,

Afin de lire les fichiers de type fasta je me suis créé un objet iterable. Puis comme les opérations que je souhaite effectuer dessus varie. J'utilise l'évaluation paresseuse.
L'évaluation paresseuse me permet de réutiliser le parseur évitant de la maintenance de code.

Ce qui donne:

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
 
PAGESIZE            = 4096
DOUBLE_PAGESIZE     = PAGESIZE * 2
HUGE_PAGESIZE_2M    = 2097152
HUGE_PAGESIZE_1G    = 1073741824
 
# An iterable object for fasta file format
class Fasta(object):
 
    def __skipComment( self ):
        while not self.__eof and ( self.__line.startswith( b'#' ) or len( self.__line.lstrip() ) == 0 ):
            self.__line = self.__descriptor.readline()
 
    def __eof( self ):
        return len( self.__line ) == 0
 
    def __init__( self, filepath, buffering = DOUBLE_PAGESIZE, func = lambda x : x.sequence ):
        self.__descriptor   = open( filepath, 'rb', buffering = buffering )
        self.__header       = ''
        self.__sequence     = ''
        self.__position     = 0
        self.__line         = b''
        self.__isNext       = False
        self.__func         = func
 
    def __iter__( self ):
        return self
 
    def __next__( self ):
        self.__line = self.__descriptor.readline()
        if self.__eof():
            raise StopIteration
 
        self.__skipComment()
        if self.__line.startswith( b'>' ):
            self.__isNext   = True
            self.__header   = self.__line.rstrip()
        else:
            self.__isNext   = False
        self.__skipComment()
        if not self.__eof() and (not self.__line.startswith( b'>' ) or len( self.__line ) != 0 ):
            self.__sequence = self.__line.rstrip()
        return self.__func(self)
 
    # return true when a header sequence is found
    @property
    def isNext( self ):
        return self.__isNext
 
    @property
    def header_bytes( self ):
        return self.__header
 
    @property
    def sequence_bytes( self ):
        return self.__sequence
 
    @property
    def header( self ):
        return str( self.__header, 'ascii' )
 
    @property
    def sequence( self ):
        return str( self.__sequence, 'ascii' )

et il s'utilise comme ceci (ici je calcul le GC % )

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
 
def main():
    fa = Fasta(
                        'Homo_sapiens.GRCh37.67.dna_rm.chromosome.Y.fa', 
                        HUGE_PAGESIZE_2M,
                        lambda x : ( x.sequence_bytes.count(b'A'), x.sequence_bytes.count(b'C'), x.sequence_bytes.count(b'G'), x.sequence_bytes.count(b'T'), x.sequence_bytes.count(b'N') )
                    )
    base = [ 0, 0, 0, 0, 0] # A C T G N
    for counter in fa:
        base[0] += counter[0]
        base[1] += counter[1]
        base[2] += counter[2]
        base[3] += counter[3]
        base[4] += counter[4]
    totalBase   = base[0] +  base[1] + base[2] + base[3] + base[4]
    gcBase      = base[1] +  base[3]
    print(  gcBase / totalBase * 100 )
 
if __name__ == '__main__':
    main()

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
python3 -m cProfile  test.py 
7.2123760525895975
         21770358 function calls in 16.420 seconds
 
   Ordered by: standard name
 
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  1979122    0.685    0.000    0.685    0.000 test3.py:10(__skipComment)
  1979123    1.015    0.000    1.256    0.000 test3.py:14(__eof)
        1    0.000    0.000    0.000    0.000 test3.py:17(__init__)
        1    0.000    0.000   16.420   16.420 test3.py:2(<module>)
        1    0.000    0.000    0.000    0.000 test3.py:26(__iter__)
   989562    4.587    0.000   14.908    0.000 test3.py:29(__next__)
  4947805    1.061    0.000    1.061    0.000 test3.py:54(sequence_bytes)
        1    1.512    1.512   16.420   16.420 test3.py:67(main)
   989561    3.899    0.000    6.977    0.000 test3.py:71(<lambda>)
        1    0.000    0.000    0.000    0.000 test3.py:8(Fasta)
        1    0.000    0.000    0.000    0.000 {built-in method __build_class__}
        1    0.000    0.000   16.420   16.420 {built-in method exec}
  1979124    0.242    0.000    0.242    0.000 {built-in method len}
        1    0.000    0.000    0.000    0.000 {built-in method open}
        1    0.000    0.000    0.000    0.000 {built-in method print}
  4947805    2.017    0.000    2.017    0.000 {method 'count' of 'bytes' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
   989562    0.491    0.000    0.491    0.000 {method 'readline' of '_io.BufferedReader' objects}
   989562    0.188    0.000    0.188    0.000 {method 'rstrip' of 'bytes' objects}
  1979122    0.725    0.000    0.725    0.000 {method 'startswith' of 'bytes' objects}

Tout semble bien marcher et finalement conviviale à utiliser. Mais les perf ne sont pas là, je mets 10 secondes pour parser 58 Mo

Si j'en crois les utilisateurs il est possible de descendre à 3sec et moins. source: http://saml.rilspace.org/calculating...x-speedup-in-d

Est il possible d'améliorer les perf tout en ayant cette souplesse d'utilisation (objet iterable et évaluation paresseuse )?

avec une version simple

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def main():
    file = open("Homo_sapiens.GRCh37.67.dna_rm.chromosome.Y.fa", "rb")
    gcCount = 0
    totalBaseCount = 0
 
    for line in file:
        if line.startswith(b">"):
            continue
        gc = line.count(b"G") + line.count(b"C")
        ta = line.count(b"T") + line.count(b"A")
        gcCount += gc
        totalBaseCount += gc + ta
 
    print(gcCount , totalBaseCount)
    gcFraction = float(gcCount) / totalBaseCount
    print( gcFraction * 100 )
 
 
if __name__ == '__main__':
    main()

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
python3 -m cProfile  test2.py 
3228502 8581482
37.62173013938618
         4947808 function calls in 3.598 seconds
 
   Ordered by: standard name
 
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    3.598    3.598 test2.py:1(<module>)
        1    1.863    1.863    3.598    3.598 test2.py:1(main)
        1    0.000    0.000    3.598    3.598 {built-in method exec}
        1    0.000    0.000    0.000    0.000 {built-in method open}
        2    0.000    0.000    0.000    0.000 {built-in method print}
  3958240    1.438    0.000    1.438    0.000 {method 'count' of 'bytes' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
   989561    0.297    0.000    0.297    0.000 {method 'startswith' of 'bytes' objects}

Parsing de fichier

Python

Mode arborescent

Discussions similaires

Partager

Partager