Je me fait une petite pause python/mapreduce, en attendant de reprendre spark qui est un aussi gros sujet qu'hadoop
Je me suis fait une adaptation d'un programme pyhton pour hadoop 2.6 en mode streaming
si ca intéresse quelqu'un, avoir un exemple, ca aide bien.
voila la première partie de mon TF-IDF, la seconde partie viendra peut être ce week end.
voici le contenu de mon fichier lettre1.txt qui sera copier su mon hdfs , id document, suivi des mots à compter
le Mapper WCMapper.py
le reducer WCReducer.py
Code : Sélectionner tout - Visualiser dans une fenêtre à part
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16 #! /usr/bin/env python import os import re import sys for enr in sys.stdin: enr = str(enr).translate(None,'\n') fields=enr.split(",") documentID=fields[0] mots=fields[1].split(" ") for unMot in mots: print >>sys.stdout,"%s\t%s\t1\t%i" % \ (unMot,documentID,len(mots))
Shell de lancement wc.sh
Code : Sélectionner tout - Visualiser dans une fenêtre à part
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45 #! /usr/bin/env python import os import re import sys motCourrant='' motCourrantCpt=0 documentCourrant='' docCourrantCpt=0 totalMotsInDoc=0; motPrecedentCpt=0; for enr in sys.stdin: enr = str(enr).translate(None,'\n') fields=enr.split("\t") if ( len(fields) < 2): continue if ( motCourrant != fields[0] or documentCourrant != fields[1]): if ( docCourrantCpt > 0): print >>sys.stdout,"%s\t%s\t%i\t%f" % \ (motCourrant,documentCourrant,docCourrantCpt, docCourrantCpt/ totalMotsInDoc) documentCourrant = fields[1] docCourrantCpt=0 motCourrantCpt +=1 if ( motCourrant != fields[0]): if ( motCourrant != '' ): print >>sys.stdout,"%s\t0000\t%i" % \ (motCourrant,motCourrantCpt) motCourrant = fields[0] motPrecedentCpt=motCourrantCpt motCourrantCpt=0 docCourrantCpt +=1 totalMotsInDoc=float(fields[3]); print >>sys.stdout,"%s\t%s\t%i\t%f" % \ (motCourrant,documentCourrant,docCourrantCpt, docCourrantCpt/ totalMotsInDoc) print >>sys.stdout,"%s\t0000\t%i" % \ (motCourrant,motPrecedentCpt)
extrait sortie
Code : Sélectionner tout - Visualiser dans une fenêtre à part
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31 clean() { echo -e "suppression repertoire sur hdfs \n" hadoop fs -rm -r tempwc hadoop fs -rm -r output-streaming hadoop fs -rm -r Lettres } load () { echo -e "\n creation des repertoires et chargement des donnees\n" hadoop fs -mkdir Lettres hadoop fs -copyFromLocal lettre1.txt Lettres } clean load echo -e "\nRunning Wordcount python hadoop 2.6 \n" hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar \ -input Lettres -output tempwc \ -mapper WCMapper.py -reducer WCReducer.py \ -jobconf stream.num.map.output.key.fields=2 \ -jobconf stream.num.reduce.output.key.fields=2 \ -jobconf mapred.reduce.tasks=1 \ -file WCMapper.py -file WCReducer.py echo -e "\nTF fini affichage du contenu du dossier de esultat \t\n" hadoop fs -cat tempwc/*
Code : Sélectionner tout - Visualiser dans une fenêtre à part
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211 hduser@stargate:~/Resource-Bundle/jbepy$ ./wc.sh suppression repertoire sur hdfs 15/09/02 22:40:47 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable rm: `tempwc': No such file or directory 15/09/02 22:40:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable rm: `output-streaming': No such file or directory 15/09/02 22:40:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/09/02 22:40:51 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes. Deleted Lettres creation des repertoires et chargement des donnees 15/09/02 22:40:53 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/09/02 22:40:54 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Running Wordcount python hadoop 2.6 15/09/02 22:40:56 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead. 15/09/02 22:40:56 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/09/02 22:40:56 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead. 15/09/02 22:40:56 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces packageJobJar: [WCMapper.py, WCReducer.py, /tmp/hadoop-unjar5220630186872127756/] [] /tmp/streamjob529516712459132480.jar tmpDir=null 15/09/02 22:40:57 INFO client.RMProxy: Connecting to ResourceManager at stargate/192.168.0.11:8032 15/09/02 22:40:57 INFO client.RMProxy: Connecting to ResourceManager at stargate/192.168.0.11:8032 15/09/02 22:40:58 INFO mapred.FileInputFormat: Total input paths to process : 1 15/09/02 22:40:58 INFO mapreduce.JobSubmitter: number of splits:2 15/09/02 22:40:58 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1439891988258_0041 15/09/02 22:40:58 INFO impl.YarnClientImpl: Submitted application application_1439891988258_0041 15/09/02 22:40:58 INFO mapreduce.Job: The url to track the job: http://stargate:8088/proxy/applicati...91988258_0041/ 15/09/02 22:40:58 INFO mapreduce.Job: Running job: job_1439891988258_0041 15/09/02 22:41:03 INFO mapreduce.Job: Job job_1439891988258_0041 running in uber mode : false 15/09/02 22:41:03 INFO mapreduce.Job: map 0% reduce 0% 15/09/02 22:41:09 INFO mapreduce.Job: map 100% reduce 0% 15/09/02 22:41:22 INFO mapreduce.Job: map 100% reduce 100% 15/09/02 22:41:22 INFO mapreduce.Job: Job job_1439891988258_0041 completed successfully 15/09/02 22:41:22 INFO mapreduce.Job: Counters: 49 File System Counters FILE: Number of bytes read=913 FILE: Number of bytes written=337096 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=882 HDFS: Number of bytes written=1871 HDFS: Number of read operations=9 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=2 Launched reduce tasks=1 Data-local map tasks=2 Total time spent by all maps in occupied slots (ms)=6714 Total time spent by all reduces in occupied slots (ms)=21450 Total time spent by all map tasks (ms)=6714 Total time spent by all reduce tasks (ms)=10725 Total vcore-seconds taken by all map tasks=6714 Total vcore-seconds taken by all reduce tasks=10725 Total megabyte-seconds taken by all map tasks=10312704 Total megabyte-seconds taken by all reduce tasks=32947200 Map-Reduce Framework Map input records=15 Map output records=62 Map output bytes=783 Map output materialized bytes=919 Input split bytes=216 Combine input records=0 Combine output records=0 Reduce input groups=62 Reduce shuffle bytes=919 Reduce input records=62 Reduce output records=113 Spilled Records=124 Shuffled Maps =2 Failed Shuffles=0 Merged Map outputs=2 GC time elapsed (ms)=143 CPU time spent (ms)=4900 Physical memory (bytes) snapshot=1906573312 Virtual memory (bytes) snapshot=7287844864 Total committed heap usage (bytes)=2355101696 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=666 File Output Format Counters Bytes Written=1871 15/09/02 22:41:22 INFO streaming.StreamJob: Output directory: tempwc TF fini affichage du contenu du dossier de esultat 15/09/02 22:41:24 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable A H1 1 0.200000 A H2 1 0.200000 A 0000 2 An H1 1 0.250000 An H3 1 0.200000 An 0000 2 And H2 1 0.200000 And 0000 1 Are H2 1 0.500000 Are 0000 1 Because H3 1 1.000000 Because 0000 1 But H2 1 0.166667 But 0000 1 Elephant H1 1 0.200000 Elephant 0000 1 HDFS H2 1 0.200000 HDFS 0000 1 Hadoop H1 1 0.200000 Hadoop H3 1 0.200000 Hadoop 0000 2 Hadoop. H2 1 0.200000 Hadoop. 0000 1 He H1 1 0.333333 He H3 1 0.250000 He 0000 2 Impala H2 1 0.500000 Impala 0000 1 King! H1 1 0.200000 King! 0000 1 Or H3 1 0.250000 Or 0000 1 Sqoop. H2 1 0.166667 Sqoop. 0000 1 The H2 1 0.166667 The 0000 1 Useful H1 1 0.500000 Useful 0000 1 an H3 1 0.200000 an 0000 1 and H1 1 0.200000 and H3 1 0.200000 and 0000 2 anything H3 1 0.250000 anything 0000 1 bad H3 1 0.250000 bad 0000 1 cling! H1 1 0.250000 cling! 0000 1 data H1 1 0.500000 data 0000 1 does H3 1 0.250000 does 0000 1 elegant H1 1 0.200000 elegant H3 1 0.200000 elegant 0000 2 element H1 1 0.250000 element 0000 1 elephant H2 1 0.166667 elephant H3 1 0.200000 elephant 0000 2 extraneous H1 1 0.250000 extraneous 0000 1 fellow. H3 1 0.200000 fellow. 0000 1 forgets H1 1 0.333333 forgets 0000 1 gentle H3 1 0.200000 gentle 0000 1 gets H3 1 0.250000 gets 0000 1 group. H2 1 0.200000 group. 0000 1 helps H2 1 0.166667 helps 0000 1 him H2 1 0.166667 him 0000 1 in H2 1 0.200000 in 0000 1 is H1 1 0.200000 is H2 1 0.200000 is H3 1 0.200000 is 0000 3 king H2 1 0.200000 king 0000 1 mad H3 1 0.250000 mad 0000 1 mellow. H3 1 0.200000 mellow. 0000 1 never H1 1 0.333333 never H3 1 0.250000 never 0000 2 plays H2 1 0.166667 plays 0000 1 the H1 1 0.200000 the H2 1 0.200000 the 0000 2 thing. H1 1 0.200000 thing. 0000 1 thrive H2 1 0.166667 thrive 0000 1 to H2 1 0.166667 to 0000 1 well H2 1 0.166667 well 0000 1 what H2 1 0.166667 what 0000 1 with H2 1 0.166667 with 0000 1 wonderful H2 1 0.200000 wonderful 0000 1 yellow H1 1 0.200000 yellow 0000 1
Partager