Bonjour,
Afin de lire les fichiers de type fasta je me suis créé un objet iterable. Puis comme les opérations que je souhaite effectuer dessus varie. J'utilise l'évaluation paresseuse.
L'évaluation paresseuse me permet de réutiliser le parseur évitant de la maintenance de code.
Ce qui donne:
et il s'utilise comme ceci (ici je calcul le GC % )
Code : Sélectionner tout - Visualiser dans une fenêtre à part
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64 PAGESIZE = 4096 DOUBLE_PAGESIZE = PAGESIZE * 2 HUGE_PAGESIZE_2M = 2097152 HUGE_PAGESIZE_1G = 1073741824 # An iterable object for fasta file format class Fasta(object): def __skipComment( self ): while not self.__eof and ( self.__line.startswith( b'#' ) or len( self.__line.lstrip() ) == 0 ): self.__line = self.__descriptor.readline() def __eof( self ): return len( self.__line ) == 0 def __init__( self, filepath, buffering = DOUBLE_PAGESIZE, func = lambda x : x.sequence ): self.__descriptor = open( filepath, 'rb', buffering = buffering ) self.__header = '' self.__sequence = '' self.__position = 0 self.__line = b'' self.__isNext = False self.__func = func def __iter__( self ): return self def __next__( self ): self.__line = self.__descriptor.readline() if self.__eof(): raise StopIteration self.__skipComment() if self.__line.startswith( b'>' ): self.__isNext = True self.__header = self.__line.rstrip() else: self.__isNext = False self.__skipComment() if not self.__eof() and (not self.__line.startswith( b'>' ) or len( self.__line ) != 0 ): self.__sequence = self.__line.rstrip() return self.__func(self) # return true when a header sequence is found @property def isNext( self ): return self.__isNext @property def header_bytes( self ): return self.__header @property def sequence_bytes( self ): return self.__sequence @property def header( self ): return str( self.__header, 'ascii' ) @property def sequence( self ): return str( self.__sequence, 'ascii' )
Code : Sélectionner tout - Visualiser dans une fenêtre à part
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20 def main(): fa = Fasta( 'Homo_sapiens.GRCh37.67.dna_rm.chromosome.Y.fa', HUGE_PAGESIZE_2M, lambda x : ( x.sequence_bytes.count(b'A'), x.sequence_bytes.count(b'C'), x.sequence_bytes.count(b'G'), x.sequence_bytes.count(b'T'), x.sequence_bytes.count(b'N') ) ) base = [ 0, 0, 0, 0, 0] # A C T G N for counter in fa: base[0] += counter[0] base[1] += counter[1] base[2] += counter[2] base[3] += counter[3] base[4] += counter[4] totalBase = base[0] + base[1] + base[2] + base[3] + base[4] gcBase = base[1] + base[3] print( gcBase / totalBase * 100 ) if __name__ == '__main__': main()Tout semble bien marcher et finalement conviviale à utiliser. Mais les perf ne sont pas là, je mets 10 secondes pour parser 58 Mo
Code : Sélectionner tout - Visualiser dans une fenêtre à part
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27 python3 -m cProfile test.py 7.2123760525895975 21770358 function calls in 16.420 seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1979122 0.685 0.000 0.685 0.000 test3.py:10(__skipComment) 1979123 1.015 0.000 1.256 0.000 test3.py:14(__eof) 1 0.000 0.000 0.000 0.000 test3.py:17(__init__) 1 0.000 0.000 16.420 16.420 test3.py:2(<module>) 1 0.000 0.000 0.000 0.000 test3.py:26(__iter__) 989562 4.587 0.000 14.908 0.000 test3.py:29(__next__) 4947805 1.061 0.000 1.061 0.000 test3.py:54(sequence_bytes) 1 1.512 1.512 16.420 16.420 test3.py:67(main) 989561 3.899 0.000 6.977 0.000 test3.py:71(<lambda>) 1 0.000 0.000 0.000 0.000 test3.py:8(Fasta) 1 0.000 0.000 0.000 0.000 {built-in method __build_class__} 1 0.000 0.000 16.420 16.420 {built-in method exec} 1979124 0.242 0.000 0.242 0.000 {built-in method len} 1 0.000 0.000 0.000 0.000 {built-in method open} 1 0.000 0.000 0.000 0.000 {built-in method print} 4947805 2.017 0.000 2.017 0.000 {method 'count' of 'bytes' objects} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 989562 0.491 0.000 0.491 0.000 {method 'readline' of '_io.BufferedReader' objects} 989562 0.188 0.000 0.188 0.000 {method 'rstrip' of 'bytes' objects} 1979122 0.725 0.000 0.725 0.000 {method 'startswith' of 'bytes' objects}
Si j'en crois les utilisateurs il est possible de descendre à 3sec et moins. source: http://saml.rilspace.org/calculating...x-speedup-in-d
Est il possible d'améliorer les perf tout en ayant cette souplesse d'utilisation (objet iterable et évaluation paresseuse )?
avec une version simple
Code : Sélectionner tout - Visualiser dans une fenêtre à part
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20 def main(): file = open("Homo_sapiens.GRCh37.67.dna_rm.chromosome.Y.fa", "rb") gcCount = 0 totalBaseCount = 0 for line in file: if line.startswith(b">"): continue gc = line.count(b"G") + line.count(b"C") ta = line.count(b"T") + line.count(b"A") gcCount += gc totalBaseCount += gc + ta print(gcCount , totalBaseCount) gcFraction = float(gcCount) / totalBaseCount print( gcFraction * 100 ) if __name__ == '__main__': main()
Code : Sélectionner tout - Visualiser dans une fenêtre à part
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16 python3 -m cProfile test2.py 3228502 8581482 37.62173013938618 4947808 function calls in 3.598 seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 3.598 3.598 test2.py:1(<module>) 1 1.863 1.863 3.598 3.598 test2.py:1(main) 1 0.000 0.000 3.598 3.598 {built-in method exec} 1 0.000 0.000 0.000 0.000 {built-in method open} 2 0.000 0.000 0.000 0.000 {built-in method print} 3958240 1.438 0.000 1.438 0.000 {method 'count' of 'bytes' objects} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 989561 0.297 0.000 0.297 0.000 {method 'startswith' of 'bytes' objects}
Partager