Analyse XML - Problème de performance

**rambc** · 30/10/2012, 23h23

Bonsoir.

J'utilise actuellement un script pour extraire des infos de mes playlists iTunes à partir du fichier XML propre à iTunes. Or ce fichier étant volumineux, d'une taille de 52,6 Mo, l'utilisation de mon script est très consommatrice de mémoire vive. Je souhaite donc améliorer ceci.

Voici la structure générale du fichier XML. Tout d'abord les chansons sont stockées ainsi.

Code xml :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
<key>6642</key>
<dict>
    <key>Track ID</key><integer>6642</integer>
    <key>Name</key><string>...</string>
    <key>Artist</key><string>...</string>
    <key>Album Artist</key><string>...</string>
    <key>Composer</key><string>...</string>
    <key>Album</key><string>...</string>
    <key>Genre</key><string>...</string>
    <key>Kind</key><string>...</string>
    <key>Size</key><integer>...</integer>
    <key>Total Time</key><integer>...</integer>
    <key>Track Number</key><integer>...</integer>
    <key>Year</key><integer>...</integer>
    <key>Date Modified</key><date>...</date>
    <key>Date Added</key><date>...</date>
    <key>Bit Rate</key><integer>...</integer>
    <key>Sample Rate</key><integer>...</integer>
    <key>Persistent ID</key><string>...</string>
    <key>Track Type</key><string>...</string>
    <key>Location</key><string>...</string>
    <key>File Folder Count</key><integer>...</integer>
    <key>Library Folder Count</key><integer>...</integer>
</dict>

Les playlists sont stockées comme suit.

Code xml :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
<dict>
    <key>Name</key><string>nameOfThPlaylist</string>
    <key>Playlist ID</key><integer>87635</integer>
    <key>Playlist Persistent ID</key><string>9A4C8A8C60D68000</string>
    <key>Parent Persistent ID</key><string>60F56F7F0F05D2B3</string>
    <key>All Items</key><true/>
    <key>Playlist Items</key>
    <array>
        <dict>
            <key>Track ID</key><integer>12010</integer>
        </dict>
        ...
    </array>
</dict>

Globalement, le fichier XML ressemble à ceci.

Code xml :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple Computer//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
    <dict>
        <key>Major Version</key><integer>1</integer>
        <key>Minor Version</key><integer>1</integer>
        <key>Application Version</key><string>9.2</string>
        <key>Features</key><integer>5</integer>
        <key>Show Content Ratings</key><true/>
        <key>Music Folder</key><string>...</string>
        <key>Library Persistent ID</key><string>44A5D91826563D8D</string>
 
        <key>Tracks</key>
        <dict>
            The definitions of the song (see before)
        </dict>
 
        <key>Playlists</key>
        <array>
            <dict>
                <key>Name</key><string>...</string>
                <key>Playlist ID</key><integer>...</integer>
                <key>Playlist Persistent ID</key><string>AF0AD7E90AB4E728</string>
                <key>Parent Persistent ID</key><string>B03891ABA2B5AB4C</string>
                <key>All Items</key><true/>
                <key>Folder</key><true/>
                <key>Playlist Items</key>
                <array>
                    <dict>
                        <key>Track ID</key><integer>...</integer>
                    </dict>
                    <dict>
                        <key>Track ID</key><integer>...</integer>
                    </dict>
                </array>
            </dict>
            <dict>
                The definitions of the playlists in the folder.
            </dict>
        </array>
    </dict>
</plist>

Mon code Python ressemble à l'horreur qui suit.

Code python :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
#! /usr/bin/env python
 
import xml.etree.ElementTree as etree
import urllib.request
 
SONGS = {}
PLAYLISTS_ID_CONTAINER = {}
PLAYLISTS_FOLDER = {}
FOLDERS = {'': {}} # This is for playlists that are not in a folder.
 
ALIAS = {
    'Track ID': "newSong",
    'Name' : "songName",
    'Artist' : "artistName",
    'Composer': "composer",
    'Album': "album",
    'Genre': "style",
    'Total Time': "time_ms",
    'Location': "path"
}
 
 
def getSongs(elem):
    lastKey = ''
    newSong = False
    text = elem.text or ""
 
    for e in elem:
        textInsideCurrentTag = getSongs(e).strip()
# A new song
        if e.tag == 'key':
            if textInsideCurrentTag in ALIAS:
                lastKey = ALIAS[textInsideCurrentTag]
 
        if e.tag == 'integer':
            if lastKey == "newSong":
                currentIndexSong = cleanSpecialCharacters(textInsideCurrentTag)
                SONGS[currentIndexSong] = {}
                lastKey = ""
 
            elif lastKey in [ "time_ms" ]:
                SONGS[currentIndexSong][lastKey] = cleanSpecialCharacters(textInsideCurrentTag)
                lastKey = ""
 
        elif e.tag == 'string' and lastKey == "path":
            textInsideCurrentTag = cleanSpecialCharacters(textInsideCurrentTag)
            if textInsideCurrentTag.startswith(r'file://localhost'):
                textInsideCurrentTag = textInsideCurrentTag [len(r'file://localhost'):]
 
            SONGS[currentIndexSong][lastKey] = cleanSpecialCharacters(textInsideCurrentTag)
            lastKey = ""
 
        elif e.tag == 'string' \
        and lastKey in [
            "songName",
            "artistName" ,
            "composer",
            "album",
            "style"
        ]:
            SONGS[currentIndexSong][lastKey] = cleanSpecialCharacters(textInsideCurrentTag)
            lastKey = ""
 
        if e.tail:
            text += e.tail
 
    return text
 
 
def getPlaylist(elem):
    global THE_PLAYLIST_IDS
    global PLAYLIST_NAME
    global WE_HAVE_FOLDER
    global ID_CONTAINER
    global ID_OBJECT
 
    lastKey = ''
    text = elem.text or ""
 
    for e in elem:
        textInsideCurrentTag = getPlaylist(e).strip()
# A new song
        if e.tag == 'key':
            if textInsideCurrentTag == 'Track ID':
                lastKey = ALIAS[textInsideCurrentTag]
 
            elif textInsideCurrentTag == 'Folder':
                WE_HAVE_FOLDER = True
 
            elif textInsideCurrentTag == 'Name':
               lastKey = 'playlistName'
 
            elif textInsideCurrentTag == 'Parent Persistent ID':
               lastKey = 'inFolder'
 
            elif textInsideCurrentTag == 'Playlist Persistent ID':
               lastKey = 'identity'
 
        elif e.tag == 'integer' and  lastKey == "newSong":
            THE_PLAYLIST_IDS.append(textInsideCurrentTag)
            lastKey = ""
 
        elif  e.tag == 'string':
            if PLAYLIST_NAME == '' and lastKey == 'playlistName':
                PLAYLIST_NAME = e.text.strip()
                lastKey = ""
            elif lastKey == 'inFolder':
                ID_CONTAINER = e.text.strip()
                lastKey = ""
            elif lastKey == 'identity':
                ID_OBJECT = e.text.strip()
                lastKey = ""
 
        if e.tail:
            text += e.tail
 
    return text
 
 
def printFolder(idObject):
    global FOLDERS
 
    if 'container' in FOLDERS[idObject]:
        if FOLDERS[idObject]['container']:
            return printFolder(FOLDERS[idObject]['container']) + '/' + FOLDERS[idObject]['name']
        return FOLDERS[idObject]['name']
 
    return ''
 
 
# Here we go...
print("",
      "===== READING iTunes Library INFOS =====",
      "1) Building the tree of the XML iTunes file. This can take a long time...",
      sep = '\n')
tree = etree.parse(pathOfThePlaylist)
 
print("2) Looking for all the songs. This can take a long time...")
getSongs(tree.getroot().find('dict/dict'))
 
print("3) Looking for all the playlists...")
folders = []
for onePlaylist in tree.getroot().findall('dict/array/dict'):
    THE_PLAYLIST_IDS = []
    PLAYLIST_NAME = ''
    WE_HAVE_FOLDER = False
    ID_CONTAINER = ''
    ID_OBJECT = ''
 
    getPlaylist(onePlaylist)
 
    if WE_HAVE_FOLDER:
        FOLDERS[ID_OBJECT] = { 'name': PLAYLIST_NAME,
                               'container': ID_CONTAINER
                                 }
    else:
        if ID_CONTAINER in PLAYLISTS_ID_CONTAINER:
            PLAYLISTS_ID_CONTAINER[ID_CONTAINER][PLAYLIST_NAME]= THE_PLAYLIST_IDS
        else:
            PLAYLISTS_ID_CONTAINER[ID_CONTAINER] = {PLAYLIST_NAME: THE_PLAYLIST_IDS}
 
for oneIdContainer in PLAYLISTS_ID_CONTAINER:
    PLAYLISTS_FOLDER[printFolder(oneIdContainer)] = PLAYLISTS_ID_CONTAINER[oneIdContainer]
 
print("4) Reading iTunes Library infos is finished.")

Dans le code ci-dessus, c'est l'utilisation de etree.parse(pathOfThePlaylist) qui pose problème. Comment éviter son emploi ?

**Stopher** · 04/11/2012, 12h18

Salut,

je suppose que ton script consomme plus de 52.6 Mo de ram dés le départ due au chargement du fichier XML par ET.

J'ai cherché un peu sur le net des solutions à cette problématique, et je suis tombé sur cet article qui peut selon moi répondre à ton besoin:

http://www.ibm.com/developerworks/xm...x-hiperfparse/

http://lxml.de/

D’après ce que j'ai compris, cela permet de spécifier ( et donc charger en mémoire ) uniquement la portion du document qui t’intéresse.

Ch.

**rambc** · 04/11/2012, 12h33

Bonjour.

Envoyé par Stopher

je suppose que ton script consomme plus de 52.6 Mo de ram dés le départ due au chargement du fichier XML par ET.

Oui. C'est la catastrophe.

lxml, je connais mais étant sous Mac, c'est un peu pénible à utiliser. Je vais m'y mettre car là je n'ai plus le choix.

Dès que je trouve une solution, je la poste ici.

**wiztricks** · 04/11/2012, 13h15

Salut,

Si les soucis sont dus à la taille du fichier, SAX permettra de le parcourir sans tout charger (d'un coup) en mémoire. Cette méthode est supportée par les bibliothèques standards (voir sur Google pour les exemples d'utilisation).

Un des gros avantages de lxml, bibliothèque externe C, était la performance.
La bibliothèque Python "standard" dispose maintenant (2.7 et +) aussi de sa mouture C via le module cElementTree.

note: Je n'ai pas fait de tests de performances récemment mais s'il faut chercher des différences ce sera plutôt côté API/fonctionnalités/réutilisation de savoir faire.

Ceci dit si SAX a l'avantage de ne pas exploser la mémoire des le départ, il faudra bien balayer le fichier et en sortir les informations recherchées: il est probable qu'on "lisse" la charge mais la mise a disposition des informations recherchées pour les traitements suivants risque de durer aussi longtemps.

La bonne nouvelle est que ces fichiers XML d'OSX sont des "plists". Python inclus le module plistlib qui permet de lire ces fichiers via le module xml.parsers.expat qui est un parseur SAX écrit en C.
=> En gros vous n'avez pas grand chose à coder pour vous mettre dans une configuration "performante".

- W

**rambc** · 04/11/2012, 13h59

Il va falloir que je fasse une formation Compétence Google car je suis un peu nul.

Merci pour plistlib qui fait le boulot proprement sans trop manger de mémoire vive.

wiztricks, merci pour cette info très précieuse.

**wiztricks** · 04/11/2012, 14h30

Salut,

Envoyé par rambc

Merci pour plistlib qui fait le boulot proprement sans trop manger de mémoire vive.
wiztricks, merci pour cette info très précieuse.

Je ne sais pas si plistlib va résoudre ces soucis de performances: tel que construit, çà devrait aider (un peu).
Sa mise en œuvre étant "simple", vous devriez pouvoir nous dire assez rapidement si c'est satisfaisant ou pas.

Il va falloir que je fasse une formation Compétence Google car je suis un peu nul.

Je n'ai pas mis ces liens car plistlib faisant déjà le boulot, voir comment faire du SAX avec les librairies standards n'est plus le sujet.

- W

Analyse XML - Problème de performance

Python

Vue hybride

Discussions similaires

Partager

Partager