Problème Génération d'un CSV

**Hypnosys** · 05/01/2018, 17h21

Bonjour à tous,

J'essaye de réaliser un CSV que je pourrai importer dans un logiciel de Link Analysys (Cytoscape)

Mon analyse porte sur les liens entre les acteurs de films. En gros je considère qu'un lien existe dès lors qu'ils ont joué dans un même film.
Pour récupérer ces informations je me sers d'un code qui va chercher sur un HTML les infos dont j'ai besoins.

Pour l'instant mes informations se génèrent de cette manière : Title, Actor1, Actor2, Actor3
Exemple : Toy Story , Tom Hanks, Tim Allen, Don Rickles
Jumanji , Robin Williams, Kirsten Dunst, Bonnie Hunt
...

Le problème est que même si mon fichier est pas trop déguelasse niveaux structure, je ne peux créer un réseau complet dans mon logiciel.

J'aurais préféré donc un code qui génère les liens de la manière suivante :
Tom Hanks - Tim Allen
Tom Hanks - Don Rickles
Tim Allen - Don Rickles
Robin Williams - Kirsten Dunst
Robin Williams - Bonnie Hunt
Kirsten Dunst - Bonnie Hunt
....

Je vous joins mon code en entier ( Il a aussi pour but de faire plus tard une analyse d’occurrence des mots et d'autres fonctions mais ce n'est pas l'objet ici )

Si vous aviez une idée de comment modifier la génération du CSV pour qu'il me sorte un tableau similaire à ce que j'ai décris ci-dessus, vous me sauveriez la vie :

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
# !/usr/bin/env python
# -*- coding: utf-8 -*-
 
 
# --------------------------------------------------------------------------
# Packages à importer
import codecs
import csv
import urllib2
from HTMLParser import HTMLParser
from collections import Counter
import csv
import json
import elasticsearch
import re
import string
 
data = " "
 
 
# --------------------------------------------------------------------------
# Classe qui va analyser le contenu d'une page de IMDB
class IMDBHTMLParser(HTMLParser):
 
    def __init__(self):
        HTMLParser.__init__(self)
        self.curDepth = 0
        self.highestScriptTag = 0  # Niveau le plus élevé d'un script (0 signifie qu'on est pas dans un script)
        self.content = ""
 
    def handle_starttag(self, tag, attrs):
        self.curDepth += 1
        if ((self.highestScriptTag == 0) and (tag == "script")):
            self.highestScriptTag = self.curDepth
 
    def handle_endtag(self, tag):
        if ((self.highestScriptTag == self.curDepth) and (tag == "script")):
            self.highestScriptTag = 0
        self.curDepth -= 1
 
    def handle_data(self, data):
        if (self.highestScriptTag != 0):  # On est dans un script
            return
        self.content += data.strip()  # enlever les espaces avant et après
 
 
# --------------------------------------------------------------------------
# Lire le  fichiers des liens entre identifiants internes et IMDB
ids = dict()
with open('ml-latest-small/links.csv', 'rb') as csvfile:
    csvfile.readline()  # Ne pas traiter la première ligne
    links = csv.reader(csvfile, delimiter=',')
    for row in links:
        ids[row[0]] = row[1]
 
# --------------------------------------------------------------------------
# Prendre le fichier des films et analyser le fichier HTML correspondant
nbTreats = 0
with open('ml-latest-small/movies.csv', 'rb') as csvfile:
    csvfile.readline()  # Ne pas traiter la première ligne
    movies = csv.reader(csvfile, delimiter=',')
    for row in movies:
        nbTreats = nbTreats + 1
        imdbId = ids[row[0]]
        imdbURL = 'http://www.imdb.com/title/tt' + imdbId
        response = urllib2.urlopen(imdbURL)
        html = response.read()
 
        # code fonctionnel pour extraire le nom du film (exemple)
        var = "meta property='og:title' content=\""
 
        indice = html.index(var)
        texte_provisoire = html[indice:]
        indice_fin = texte_provisoire.index("(")
        texte1 = texte_provisoire[len(var):indice_fin]
        print("Titre du film:")
        print(texte1)
 
        # code pour extraire le directeur du film
        var = "<meta name=\"description\" content=\"Directed by "
 
        indice = html.index(var)
        texte_provisoire = html[indice:]
        indice_fin = texte_provisoire.index(".")
        texte2 = texte_provisoire[len(var):indice_fin]
        print("Réalisateur:")
        print(texte2)
 
        # code pour extraire les acteurs
        var = "With"
 
        indice = html.index(var)
        texte_provisoire = html[indice:]
        indice_fin = texte_provisoire.index(".")
        texte3 = texte_provisoire[len(var):indice_fin]
 
        texte3 = texte3.replace(',', '')
        print("Acteurs:")
        print(texte3)
 
        # code pour extraire le budget
 
        var = "<h4 class=\"inline\">Budget:</h4>"
        try:
            indice = html.index(var)
            texte_provisoire = html[indice:]
            indice_fin = texte_provisoire.index("<span class=\"attribute\">(estimated)")
            texte4 = texte_provisoire[len(var):indice_fin]
            texte4 = texte4.replace(',', '')
            print("Budget:")
            print(texte4)
        except:
            texte4 = "/"
            print("Budget:")
            print(texte4)
 
        # code pour extraire la durée du film
 
        var = "<time itemprop=\"duration\" datetime=\"PT"
 
        indice = html.index(var)
        texte_provisoire = html[indice:]
        indice_fin = texte_provisoire.index("M")
        texte5 = texte_provisoire[len(var):indice_fin]
        print("Durée en minutes:")
        print(texte5)
 
        # code pour extraire le résumé
 
        var = "<div class=\"inline canwrap\" itemprop=\"description\">" + "\n"
 
        indice = html.index(var)
        texte_provisoire = html[indice:]
        indice_fin = texte_provisoire.index("<em")
        texte6 = texte_provisoire[len(var):indice_fin]
        print("Résumé:")
        print(texte6+"\n")
 
        # enlever les mots vides
        stopwords = ['-','<p>','does','especially', 'Movies', 'TV', 'IMDb', 'a', 'about', 'above', 'above', 'across', 'after',
                     'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although',
                     'always', 'am', 'among', 'amongst', 'amoungst', 'amount', 'an', 'and', 'another', 'any', 'anyhow',
                     'anyone', 'anything', 'anyway', 'anywhere', 'are', 'around', 'as', 'at', 'back', 'be', 'became',
                     'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'bein',
                     'below', 'beside', 'besides', 'between', 'beyond', 'bill', 'both', 'bottom', 'but', 'by', 'call',
                     'can', 'cannot', 'cant', 'co', 'con', 'could', 'couldnt', 'cry', 'de', 'describe', 'detail', 'do',
                     'done', 'down', 'due', 'during', 'each', 'eg', 'eight', 'either', 'eleven', 'else', 'elsewhere',
                     'empty', 'enough', 'etc', 'even', 'ever', 'every', 'everyone', 'everything', 'everywhere',
                     'except', 'few', 'fifteen', 'fify', 'fill', 'find', 'fire', 'first', 'five', 'for', 'former',
                     'formerly', 'forty', 'found', 'four', 'from', 'front', 'full', 'further', 'get', 'give', 'go',
                     'had', 'has', 'hasnt', 'have', 'he', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein',
                     'hereupon', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'however', 'hundred', 'ie', 'if',
                     'in', 'inc', 'indeed', 'interest', 'into', 'is', 'it', 'its', 'itself', 'keep', 'last', 'latter',
                     'latterly', 'least', 'less', 'ltd', 'made', 'many', 'may', 'me', 'meanwhile', 'might', 'mill',
                     'mine', 'more', 'moreover', 'most', 'mostly', 'move', 'much', 'must', 'my', 'myself', 'name',
                     'namely', 'neither', 'never', 'nevertheless', 'next', 'nine', 'no', 'nobody', 'none', 'noone',
                     'nor', 'not', 'nothing', 'now', 'nowhere', 'of', 'off', 'often', 'on', 'once', 'one', 'only',
                     'onto', 'or', 'other', 'others', 'otherwise', 'our', 'ours', 'ourselves', 'out', 'over', 'own',
                     'part', 'per', 'perhaps', 'please', 'put', 'rather', 're', 'same', 'see', 'seem', 'seemed',
                     'seeming', 'seems', 'serious', 'several', 'she', 'should', 'show', 'side', 'since', 'sincere',
                     'six', 'sixty', 'so', 'some', 'somehow', 'someone', 'something', 'sometime', 'sometimes',
                     'somewhere', 'still', 'such', 'system', 'take', 'ten', 'than', 'that', 'the', 'their', 'them',
                     'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby', 'therefore', 'therein',
                     'thereupon', 'these', 'they', 'thickv', 'thin', 'third', 'this', 'those', 'though', 'three',
                     'through', 'throughout', 'thru', 'thus', 'to', 'together', 'too', 'top', 'toward', 'towards',
                     'twelve', 'twenty', 'two', 'un', 'under', 'until', 'up', 'upon', 'us', 'very', 'via', 'was', 'we',
                     'well', 'were', 'what', 'whatever', 'when', 'whence', 'whenever', 'where', 'whereafter', 'whereas',
                     'whereby', 'wherein', 'whereupon', 'wherever', 'whether', 'which', 'while', 'whither', 'who',
                     'whoever', 'whole', 'whom', 'whose', 'why', 'will', 'with', 'within', 'without', 'would', 'yet',
                     'you', 'your', 'yours', 'yourself', 'yourselves', 'the']
 
        texte6_lower= texte6.lower()
        ponct = string.punctuation
 
        texte6_lower2= texte6_lower.split(".")
        text1=' '.join([word for word in texte6_lower2])
 
        texte6_lower3 = text1.split(",")
        text2 = ' '.join([word for word in texte6_lower3])
 
        texte6_lower4 = text2.split(")")
        text3 = ' '.join([word for word in texte6_lower4])
 
        texte6_lower5 = text3.split("(")
        text4 = ' '.join([word for word in texte6_lower5])
 
 
        text = ' '.join([word for word in text4.split() if word not in stopwords])
 
        # créer liste pour compter les occurences
        def controle(zz):
 
            if zz.count(" ") != 0:
                return zz.index(" ")
            else:
                return len(zz)
 
 
        maChaine = cdc = text
 
        esp = cdc.count(" ")
        deb = 0
        fin = controle(cdc)
        maListe = []
 
        for i in range(0, esp + 1):
            maListe.append(cdc[deb:fin])
            cdc = cdc[fin + 1:]
 
            fin = controle(cdc)
 
        for j in range(len(maListe)):
            # Ocurences
            X = Counter(maListe)
        print("Mots avec la plus grande occurence:")
        print(X.most_common(15))
 
        # permet de mettre une ligne entre chaque film
        print("-------------------------")
 
        # code pour convertir en csv
        data = data + texte1 + "," + texte2 + "," + texte3 + "," + texte4 + "," + texte5 + "," + texte6 + "\n"
 
        with open('data.csv', 'wb') as file:
            file.write(data)
 
        parser = IMDBHTMLParser()
        parser.feed(html)
 
        # code pour convertir en json
 
        csvfile = open('data.csv', 'r')
        jsonfile = open('file.json', 'w')
 
        fieldnames = ("Title", "Director", "Actors", "Budget", "Durée", "Storyline")
        reader = csv.DictReader(csvfile, fieldnames)
        out = json.dumps([row for row in reader])
        jsonfile.write(out)
 
        if (nbTreats == 1):  # Pour les premiers tests, commençons par récupérer 1 fichier. À enlever ensuite
            break

D'avance merci !!!!

Problème Génération d'un CSV

Python

Mode arborescent

Discussions similaires

Partager

Partager