IdentifiantMot de passe
Loading...
Mot de passe oublié ?Je m'inscris ! (gratuit)
Navigation

Inscrivez-vous gratuitement
pour pouvoir participer, suivre les réponses en temps réel, voter pour les messages, poser vos propres questions et recevoir la newsletter

Hadoop & co Discussion :

hadoop streaming - python


Sujet :

Hadoop & co

  1. #1
    Membre habitué
    Homme Profil pro
    Inscrit en
    Octobre 2007
    Messages
    190
    Détails du profil
    Informations personnelles :
    Sexe : Homme
    Localisation : France

    Informations forums :
    Inscription : Octobre 2007
    Messages : 190
    Points : 182
    Points
    182
    Par défaut hadoop streaming - python
    Je me fait une petite pause python/mapreduce, en attendant de reprendre spark qui est un aussi gros sujet qu'hadoop

    Je me suis fait une adaptation d'un programme pyhton pour hadoop 2.6 en mode streaming
    si ca intéresse quelqu'un, avoir un exemple, ca aide bien.

    voila la première partie de mon TF-IDF, la seconde partie viendra peut être ce week end.

    voici le contenu de mon fichier lettre1.txt qui sera copier su mon hdfs , id document, suivi des mots à compter

    H1,Hadoop is the Elephant King!
    H1,A yellow and elegant thing.
    H1,He never forgets
    H1,Useful data, or lets
    H1,An extraneous element cling!
    H2,A wonderful king is Hadoop.
    H2,The elephant plays well with Sqoop.
    H2,But what helps him to thrive
    H2,Are Impala, and Hive,
    H2,And HDFS in the group.
    H3,Hadoop is an elegant fellow.
    H3,An elephant gentle and mellow.
    H3,He never gets mad,
    H3,Or does anything bad,
    H3,Because, at his core, he is yellow.


    le Mapper WCMapper.py
    Code : Sélectionner tout - Visualiser dans une fenêtre à part
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    #! /usr/bin/env python
    
    
    import os
    import re
    import sys
    
    for enr in sys.stdin:
        enr = str(enr).translate(None,'\n')
        fields=enr.split(",")
        documentID=fields[0]
        mots=fields[1].split(" ")
        for unMot in mots:
                print >>sys.stdout,"%s\t%s\t1\t%i" % \
            (unMot,documentID,len(mots))
    le reducer WCReducer.py
    Code : Sélectionner tout - Visualiser dans une fenêtre à part
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    #! /usr/bin/env python
    
    import os
    import re
    import sys
    
    motCourrant=''
    motCourrantCpt=0
    documentCourrant=''
    docCourrantCpt=0
    totalMotsInDoc=0;
    motPrecedentCpt=0;
    
    for enr in sys.stdin:
    
            enr = str(enr).translate(None,'\n')
            fields=enr.split("\t")
    
            if ( len(fields) < 2):
                    continue
    
            if ( motCourrant != fields[0] or documentCourrant != fields[1]):
                    if ( docCourrantCpt > 0):
                            print >>sys.stdout,"%s\t%s\t%i\t%f" % \
                            (motCourrant,documentCourrant,docCourrantCpt, docCourrantCpt/ totalMotsInDoc)
                    documentCourrant = fields[1]
                    docCourrantCpt=0
                    motCourrantCpt +=1
    
            if ( motCourrant != fields[0]):
                    if ( motCourrant != '' ):
                            print >>sys.stdout,"%s\t0000\t%i" % \
                            (motCourrant,motCourrantCpt)
                    motCourrant = fields[0]
                    motPrecedentCpt=motCourrantCpt
                    motCourrantCpt=0
    
            docCourrantCpt +=1
            totalMotsInDoc=float(fields[3]);
    
    print >>sys.stdout,"%s\t%s\t%i\t%f" % \
            (motCourrant,documentCourrant,docCourrantCpt, docCourrantCpt/ totalMotsInDoc)
    print >>sys.stdout,"%s\t0000\t%i" % \
            (motCourrant,motPrecedentCpt)
    Shell de lancement wc.sh
    Code : Sélectionner tout - Visualiser dans une fenêtre à part
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    
    
    clean() {
    echo -e "suppression repertoire sur hdfs  \n"
    hadoop fs -rm -r tempwc
    hadoop fs -rm -r output-streaming
    hadoop fs -rm -r Lettres
    }
    
    load () {
    echo -e "\n creation des repertoires et chargement des donnees\n"
    hadoop fs -mkdir Lettres
    hadoop fs -copyFromLocal lettre1.txt Lettres
    }
    
    clean
    load
    
    echo -e "\nRunning Wordcount python  hadoop 2.6 \n"
    
    hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar \
            -input Lettres -output tempwc \
            -mapper WCMapper.py -reducer WCReducer.py \
            -jobconf stream.num.map.output.key.fields=2 \
            -jobconf stream.num.reduce.output.key.fields=2 \
            -jobconf mapred.reduce.tasks=1 \
            -file WCMapper.py -file WCReducer.py
    
    echo -e "\nTF fini affichage du contenu du dossier de esultat \t\n"
    hadoop fs -cat tempwc/*
    extrait sortie
    Code : Sélectionner tout - Visualiser dans une fenêtre à part
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    114
    115
    116
    117
    118
    119
    120
    121
    122
    123
    124
    125
    126
    127
    128
    129
    130
    131
    132
    133
    134
    135
    136
    137
    138
    139
    140
    141
    142
    143
    144
    145
    146
    147
    148
    149
    150
    151
    152
    153
    154
    155
    156
    157
    158
    159
    160
    161
    162
    163
    164
    165
    166
    167
    168
    169
    170
    171
    172
    173
    174
    175
    176
    177
    178
    179
    180
    181
    182
    183
    184
    185
    186
    187
    188
    189
    190
    191
    192
    193
    194
    195
    196
    197
    198
    199
    200
    201
    202
    203
    204
    205
    206
    207
    208
    209
    210
    211
    hduser@stargate:~/Resource-Bundle/jbepy$ ./wc.sh
    suppression repertoire sur hdfs
    
    15/09/02 22:40:47 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    rm: `tempwc': No such file or directory
    15/09/02 22:40:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    rm: `output-streaming': No such file or directory
    15/09/02 22:40:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    15/09/02 22:40:51 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
    Deleted Lettres
    
     creation des repertoires et chargement des donnees
    
    15/09/02 22:40:53 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    15/09/02 22:40:54 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    
    Running Wordcount python  hadoop 2.6
    
    15/09/02 22:40:56 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
    15/09/02 22:40:56 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    15/09/02 22:40:56 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
    15/09/02 22:40:56 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
    packageJobJar: [WCMapper.py, WCReducer.py, /tmp/hadoop-unjar5220630186872127756/] [] /tmp/streamjob529516712459132480.jar tmpDir=null
    15/09/02 22:40:57 INFO client.RMProxy: Connecting to ResourceManager at stargate/192.168.0.11:8032
    15/09/02 22:40:57 INFO client.RMProxy: Connecting to ResourceManager at stargate/192.168.0.11:8032
    15/09/02 22:40:58 INFO mapred.FileInputFormat: Total input paths to process : 1
    15/09/02 22:40:58 INFO mapreduce.JobSubmitter: number of splits:2
    15/09/02 22:40:58 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1439891988258_0041
    15/09/02 22:40:58 INFO impl.YarnClientImpl: Submitted application application_1439891988258_0041
    15/09/02 22:40:58 INFO mapreduce.Job: The url to track the job: http://stargate:8088/proxy/applicati...91988258_0041/
    15/09/02 22:40:58 INFO mapreduce.Job: Running job: job_1439891988258_0041
    15/09/02 22:41:03 INFO mapreduce.Job: Job job_1439891988258_0041 running in uber mode : false
    15/09/02 22:41:03 INFO mapreduce.Job:  map 0% reduce 0%
    15/09/02 22:41:09 INFO mapreduce.Job:  map 100% reduce 0%
    15/09/02 22:41:22 INFO mapreduce.Job:  map 100% reduce 100%
    15/09/02 22:41:22 INFO mapreduce.Job: Job job_1439891988258_0041 completed successfully
    15/09/02 22:41:22 INFO mapreduce.Job: Counters: 49
            File System Counters
                    FILE: Number of bytes read=913
                    FILE: Number of bytes written=337096
                    FILE: Number of read operations=0
                    FILE: Number of large read operations=0
                    FILE: Number of write operations=0
                    HDFS: Number of bytes read=882
                    HDFS: Number of bytes written=1871
                    HDFS: Number of read operations=9
                    HDFS: Number of large read operations=0
                    HDFS: Number of write operations=2
            Job Counters
                    Launched map tasks=2
                    Launched reduce tasks=1
                    Data-local map tasks=2
                    Total time spent by all maps in occupied slots (ms)=6714
                    Total time spent by all reduces in occupied slots (ms)=21450
                    Total time spent by all map tasks (ms)=6714
                    Total time spent by all reduce tasks (ms)=10725
                    Total vcore-seconds taken by all map tasks=6714
                    Total vcore-seconds taken by all reduce tasks=10725
                    Total megabyte-seconds taken by all map tasks=10312704
                    Total megabyte-seconds taken by all reduce tasks=32947200
            Map-Reduce Framework
                    Map input records=15
                    Map output records=62
                    Map output bytes=783
                    Map output materialized bytes=919
                    Input split bytes=216
                    Combine input records=0
                    Combine output records=0
                    Reduce input groups=62
                    Reduce shuffle bytes=919
                    Reduce input records=62
                    Reduce output records=113
                    Spilled Records=124
                    Shuffled Maps =2
                    Failed Shuffles=0
                    Merged Map outputs=2
                    GC time elapsed (ms)=143
                    CPU time spent (ms)=4900
                    Physical memory (bytes) snapshot=1906573312
                    Virtual memory (bytes) snapshot=7287844864
                    Total committed heap usage (bytes)=2355101696
            Shuffle Errors
                    BAD_ID=0
                    CONNECTION=0
                    IO_ERROR=0
                    WRONG_LENGTH=0
                    WRONG_MAP=0
                    WRONG_REDUCE=0
            File Input Format Counters
                    Bytes Read=666
            File Output Format Counters
                    Bytes Written=1871
    15/09/02 22:41:22 INFO streaming.StreamJob: Output directory: tempwc
    
    TF fini affichage du contenu du dossier de esultat
    
    15/09/02 22:41:24 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    A       H1      1       0.200000
    A       H2      1       0.200000
    A       0000    2
    An      H1      1       0.250000
    An      H3      1       0.200000
    An      0000    2
    And     H2      1       0.200000
    And     0000    1
    Are     H2      1       0.500000
    Are     0000    1
    Because H3      1       1.000000
    Because 0000    1
    But     H2      1       0.166667
    But     0000    1
    Elephant        H1      1       0.200000
    Elephant        0000    1
    HDFS    H2      1       0.200000
    HDFS    0000    1
    Hadoop  H1      1       0.200000
    Hadoop  H3      1       0.200000
    Hadoop  0000    2
    Hadoop. H2      1       0.200000
    Hadoop. 0000    1
    He      H1      1       0.333333
    He      H3      1       0.250000
    He      0000    2
    Impala  H2      1       0.500000
    Impala  0000    1
    King!   H1      1       0.200000
    King!   0000    1
    Or      H3      1       0.250000
    Or      0000    1
    Sqoop.  H2      1       0.166667
    Sqoop.  0000    1
    The     H2      1       0.166667
    The     0000    1
    Useful  H1      1       0.500000
    Useful  0000    1
    an      H3      1       0.200000
    an      0000    1
    and     H1      1       0.200000
    and     H3      1       0.200000
    and     0000    2
    anything        H3      1       0.250000
    anything        0000    1
    bad     H3      1       0.250000
    bad     0000    1
    cling!  H1      1       0.250000
    cling!  0000    1
    data    H1      1       0.500000
    data    0000    1
    does    H3      1       0.250000
    does    0000    1
    elegant H1      1       0.200000
    elegant H3      1       0.200000
    elegant 0000    2
    element H1      1       0.250000
    element 0000    1
    elephant        H2      1       0.166667
    elephant        H3      1       0.200000
    elephant        0000    2
    extraneous      H1      1       0.250000
    extraneous      0000    1
    fellow. H3      1       0.200000
    fellow. 0000    1
    forgets H1      1       0.333333
    forgets 0000    1
    gentle  H3      1       0.200000
    gentle  0000    1
    gets    H3      1       0.250000
    gets    0000    1
    group.  H2      1       0.200000
    group.  0000    1
    helps   H2      1       0.166667
    helps   0000    1
    him     H2      1       0.166667
    him     0000    1
    in      H2      1       0.200000
    in      0000    1
    is      H1      1       0.200000
    is      H2      1       0.200000
    is      H3      1       0.200000
    is      0000    3
    king    H2      1       0.200000
    king    0000    1
    mad     H3      1       0.250000
    mad     0000    1
    mellow. H3      1       0.200000
    mellow. 0000    1
    never   H1      1       0.333333
    never   H3      1       0.250000
    never   0000    2
    plays   H2      1       0.166667
    plays   0000    1
    the     H1      1       0.200000
    the     H2      1       0.200000
    the     0000    2
    thing.  H1      1       0.200000
    thing.  0000    1
    thrive  H2      1       0.166667
    thrive  0000    1
    to      H2      1       0.166667
    to      0000    1
    well    H2      1       0.166667
    well    0000    1
    what    H2      1       0.166667
    what    0000    1
    with    H2      1       0.166667
    with    0000    1
    wonderful       H2      1       0.200000
    wonderful       0000    1
    yellow  H1      1       0.200000
    yellow  0000    1

  2. #2
    Membre habitué
    Homme Profil pro
    Inscrit en
    Octobre 2007
    Messages
    190
    Détails du profil
    Informations personnelles :
    Sexe : Homme
    Localisation : France

    Informations forums :
    Inscription : Octobre 2007
    Messages : 190
    Points : 182
    Points
    182
    Par défaut
    Voila la deuxieme partie du TF-IDF, la première étant le TF (Texte Frequency) pour compter la frequences mots, le fameux wordcount avec wc.sh

    on va maintenant calculer le poids des mots dans les documents en privilegiant les moins fréquents en terme d'importance, IDF (inverse document frequency)

    le mapper IdMapper.py, tenir compte de la log des résultats de sortie du post précédent
    Code : Sélectionner tout - Visualiser dans une fenêtre à part
    1
    2
    3
    4
    5
    6
    7
    8
    9
    ! /usr/bin/env python
    
    import os
    import re
    import sys
    
    for enr in sys.stdin:
        print( str(enr).translate(None,'\n'))
    le reducer
    Code : Sélectionner tout - Visualiser dans une fenêtre à part
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    #! /usr/bin/env python
    
    
    import os
    import re
    import sys
    import math
    
    totalDocuments=3
    motCourantDansDocuments=0;
    idf=0;
    
    for enr in sys.stdin:
    
        fields = str(enr).translate(None,'\n').split("\t")
    
        if ( len(fields) < 2):
            continue
    
        if ( fields[1] == "0000"):
            motCourantDansDocuments=int(fields[2])
            idf= math.log1p( totalDocuments / motCourantDansDocuments )
            continue;
    
        if ( motCourantDansDocuments > 0) :
             print >>sys.stdout,"%s\t%s\t%i\t%f\t%f" % \
                 (fields[0],fields[1], int(fields[2]), float(fields[3]), \
                    float(fields[3]) * idf )


    le script tfidf.sh - note il faut exécuter préalablement le wc.sh pour alimenter les données en input

    Code : Sélectionner tout - Visualiser dans une fenêtre à part
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    echo -e "\nTF (wordcount) affichage resultat - apres avoir executer wc.sh \n"
    hadoop fs -cat tempwc/*
    
    hadoop fs -rm -r tempidf
    
    echo -e "\n TF-IDF \n"
    
    hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar \
            -input tempwc -output tempidf \
            -mapper IdMapper.py -reducer IDFReducer.py \
            -jobconf stream.num.map.output.key.fields=2 \
            -jobconf stream.num.reduce.output.key.fields=2 \
            -jobconf mapred.reduce.tasks=1 \
            -file IdMapper.py  -file IDFReducer.py
    
    echo -e "\n affichage resultat TF-IDF. \n"
    hadoop fs -cat tempidf/*
    extraction logs
    Code : Sélectionner tout - Visualiser dans une fenêtre à part
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    
     affichage resultat TF-IDF.
    
    15/09/03 19:16:04 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    A       H1      1       0.200000        0.138629
    A       H2      1       0.200000        0.138629
    An      H1      1       0.250000        0.173287
    An      H3      1       0.200000        0.138629
    And     H2      1       0.200000        0.277259
    Are     H2      1       0.500000        0.693147
    Because H3      1       1.000000        1.386294
    But     H2      1       0.166667        0.231050
    Elephant        H1      1       0.200000        0.277259
    HDFS    H2      1       0.200000        0.277259
    Hadoop  H1      1       0.200000        0.138629
    Hadoop  H3      1       0.200000        0.138629
    Hadoop. H2      1       0.200000        0.277259
    He      H1      1       0.333333        0.231049
    He      H3      1       0.250000        0.173287
    Impala  H2      1       0.500000        0.693147
    King!   H1      1       0.200000        0.277259
    Or      H3      1       0.250000        0.346574
    Sqoop.  H2      1       0.166667        0.231050
    The     H2      1       0.166667        0.231050
    Useful  H1      1       0.500000        0.693147
    an      H3      1       0.200000        0.277259
    and     H1      1       0.200000        0.138629
    and     H3      1       0.200000        0.138629
    anything        H3      1       0.250000        0.346574
    bad     H3      1       0.250000        0.346574
    cling!  H1      1       0.250000        0.346574
    data    H1      1       0.500000        0.693147
    does    H3      1       0.250000        0.346574
    elegant H1      1       0.200000        0.138629
    elegant H3      1       0.200000        0.138629
    element H1      1       0.250000        0.346574
    elephant        H2      1       0.166667        0.115525
    elephant        H3      1       0.200000        0.138629
    extraneous      H1      1       0.250000        0.346574
    fellow. H3      1       0.200000        0.277259
    forgets H1      1       0.333333        0.462098
    gentle  H3      1       0.200000        0.277259
    gets    H3      1       0.250000        0.346574
    group.  H2      1       0.200000        0.277259
    helps   H2      1       0.166667        0.231050
    him     H2      1       0.166667        0.231050
    in      H2      1       0.200000        0.277259
    is      H1      1       0.200000        0.138629
    is      H2      1       0.200000        0.138629
    is      H3      1       0.200000        0.138629
    king    H2      1       0.200000        0.277259
    mad     H3      1       0.250000        0.346574
    mellow. H3      1       0.200000        0.277259
    never   H1      1       0.333333        0.231049
    never   H3      1       0.250000        0.173287
    plays   H2      1       0.166667        0.231050
    the     H1      1       0.200000        0.138629
    the     H2      1       0.200000        0.138629
    thing.  H1      1       0.200000        0.277259
    thrive  H2      1       0.166667        0.231050
    to      H2      1       0.166667        0.231050
    well    H2      1       0.166667        0.231050
    what    H2      1       0.166667        0.231050
    with    H2      1       0.166667        0.231050
    wonderful       H2      1       0.200000        0.277259
    yellow  H1      1       0.200000        0.277259

  3. #3
    Membre habitué
    Homme Profil pro
    Inscrit en
    Octobre 2007
    Messages
    190
    Détails du profil
    Informations personnelles :
    Sexe : Homme
    Localisation : France

    Informations forums :
    Inscription : Octobre 2007
    Messages : 190
    Points : 182
    Points
    182
    Par défaut
    un petit dernier pour la route, tout simple pour finir le hadoop streaming en python

    un total par ville pour les prestations accessoires+lavage

    venteservices.Csv
    Code : Sélectionner tout - Visualiser dans une fenêtre à part
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    33000,Bordeaux,Reparation,1300
    75000,Paris,accessoires,170
    13000,Marseille,accessoires,400
    13000,Marseille,lavage,75
    33000,Bordeaux,accessoires,700
    13000,Marseille,Reparation,1300
    33000,Bordeaux,Reparation,540
    13000,Marseille,accessoires,1200
    75000,Paris,lavage,75
    13000,Marseille,Reparation,600
    33000,Bordeaux,accessoires,700
    13000,Marseille,Reparation,500
    le mapper FiltreMapper.py
    Code : Sélectionner tout - Visualiser dans une fenêtre à part
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    #! /usr/bin/env python
    
    import os
    import re
    import sys
    
    for line in sys.stdin:
            fields = str(line).translate(None,'\n').split(",")
            cp=fields[0]
            ville=fields[1]
            service=fields[2]
            montant=float(fields[3])
            if( service == 'accessoires' ):
                    print >> sys.stdout, "%s\t%s\t%6.2f"  %  (ville, cp, montant)
            if( service == 'lavage' ):
                    print >> sys.stdout, "%s\t%s\t%6.2f"  %  (ville, cp, montant)
    le reducer StatReducer.py
    Code : Sélectionner tout - Visualiser dans une fenêtre à part
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    #! /usr/bin/env python
    
    import os
    import re
    import sys
    
    ville=""
    total=0.0
    
    for line in sys.stdin:
            fields= str(line).translate(None,'\n').split("\t")
            if ( len(fields) > 1 ):
                    if ( fields[0] != ville ):
                            if ( ville != ""):
                                    print >> sys.stdout, "%s\t%6.2f"  %  (ville, total)
                            ville=fields[0]
                            total=0.0
                    total+= float(fields[2])
    if ( ville != ""):
            print >> sys.stdout, "%s\t%6.2f"  %  (ville, total)
    script shell, rapportvente.sh
    Code : Sélectionner tout - Visualiser dans une fenêtre à part
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    hadoop fs -rm -r tempstat
    hadoop fs -rm -r rapports
    
    echo -e "\n initialisation \n"
    hadoop fs -mkdir rapports
    hadoop fs -copyFromLocal venteservices.csv rapports
    
    echo -e "\n start Ventes Map Reduce\n"
    
    hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar \
            -input rapports -output tempstat \
            -mapper FiltreMapper.py -reducer StatReducer.py \
            -jobconf stream.num.map.output.key.fields=2 \
            -jobconf stream.num.reduce.output.key.fields=2 \
            -jobconf mapred.reduce.tasks=1 \
            -file FiltreMapper.py -file StatReducer.py
    
    echo -e "\nresultats \n"
    hadoop fs -cat tempstat/*
    logs resultat
    Code : Sélectionner tout - Visualiser dans une fenêtre à part
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
     ./rapportvente.sh
    15/09/04 20:34:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    15/09/04 20:34:50 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
    Deleted tempstat
    15/09/04 20:34:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    15/09/04 20:34:51 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
    Deleted rapports
    
    Creating input directory and copying input data
    
    15/09/04 20:34:53 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    15/09/04 20:34:54 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    
     start Ventes Map Reduce
    
    15/09/04 20:34:56 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
    15/09/04 20:34:56 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    15/09/04 20:34:56 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
    15/09/04 20:34:56 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
    packageJobJar: [FiltreMapper.py, StatReducer.py, /tmp/hadoop-unjar2124791153853851193/] [] /tmp/streamjob6319431278732850672.jar tmpDir=null
    15/09/04 20:34:57 INFO client.RMProxy: Connecting to ResourceManager at stargate/192.168.0.11:8032
    15/09/04 20:34:57 INFO client.RMProxy: Connecting to ResourceManager at stargate/192.168.0.11:8032
    15/09/04 20:34:58 INFO mapred.FileInputFormat: Total input paths to process : 1
    15/09/04 20:34:58 INFO mapreduce.JobSubmitter: number of splits:2
    15/09/04 20:34:58 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1439891988258_0091
    15/09/04 20:34:58 INFO impl.YarnClientImpl: Submitted application application_1439891988258_0091
    15/09/04 20:34:58 INFO mapreduce.Job: The url to track the job: http://stargate:8088/proxy/application_1439891988258_0091/
    15/09/04 20:34:58 INFO mapreduce.Job: Running job: job_1439891988258_0091
    15/09/04 20:35:03 INFO mapreduce.Job: Job job_1439891988258_0091 running in uber mode : false
    15/09/04 20:35:03 INFO mapreduce.Job:  map 0% reduce 0%
    15/09/04 20:35:08 INFO mapreduce.Job:  map 100% reduce 0%
    15/09/04 20:35:14 INFO mapreduce.Job:  map 100% reduce 100%
    15/09/04 20:35:14 INFO mapreduce.Job: Job job_1439891988258_0091 completed successfully
    15/09/04 20:35:15 INFO mapreduce.Job: Counters: 49
            File System Counters
                    FILE: Number of bytes read=172
                    FILE: Number of bytes written=335713
                    FILE: Number of read operations=0
                    FILE: Number of large read operations=0
                    FILE: Number of write operations=0
                    HDFS: Number of bytes read=767
                    HDFS: Number of bytes written=51
                    HDFS: Number of read operations=9
                    HDFS: Number of large read operations=0
                    HDFS: Number of write operations=2
            Job Counters
                    Launched map tasks=2
                    Launched reduce tasks=1
                    Data-local map tasks=2
                    Total time spent by all maps in occupied slots (ms)=6390
                    Total time spent by all reduces in occupied slots (ms)=5556
                    Total time spent by all map tasks (ms)=6390
                    Total time spent by all reduce tasks (ms)=2778
                    Total vcore-seconds taken by all map tasks=6390
                    Total vcore-seconds taken by all reduce tasks=2778
                    Total megabyte-seconds taken by all map tasks=9815040
                    Total megabyte-seconds taken by all reduce tasks=8534016
            Map-Reduce Framework
                    Map input records=12
                    Map output records=7
                    Map output bytes=152
                    Map output materialized bytes=178
                    Input split bytes=230
                    Combine input records=0
                    Combine output records=0
                    Reduce input groups=3
                    Reduce shuffle bytes=178
                    Reduce input records=7
                    Reduce output records=3
                    Spilled Records=14
                    Shuffled Maps =2
                    Failed Shuffles=0
                    Merged Map outputs=2
                    GC time elapsed (ms)=97
                    CPU time spent (ms)=2680
                    Physical memory (bytes) snapshot=1932500992
                    Virtual memory (bytes) snapshot=7280640000
                    Total committed heap usage (bytes)=2460483584
            Shuffle Errors
                    BAD_ID=0
                    CONNECTION=0
                    IO_ERROR=0
                    WRONG_LENGTH=0
                    WRONG_MAP=0
                    WRONG_REDUCE=0
            File Input Format Counters
                    Bytes Read=537
            File Output Format Counters
                    Bytes Written=51
    15/09/04 20:35:15 INFO streaming.StreamJob: Output directory: tempstat
    
    resultats
    
    15/09/04 20:35:15 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Bordeaux        1400.00
    Marseille       1675.00
    Paris   245.00

+ Répondre à la discussion
Cette discussion est résolue.

Discussions similaires

  1. DirectSound et le streaming
    Par Shakram dans le forum DirectX
    Réponses: 57
    Dernier message: 09/06/2005, 11h05
  2. CORBA & PYTHON
    Par stan91stan dans le forum CORBA
    Réponses: 5
    Dernier message: 10/06/2004, 12h32
  3. Streaming video sous Linux
    Par freeshman dans le forum Applications et environnements graphiques
    Réponses: 2
    Dernier message: 03/01/2004, 17h17
  4. Streaming fichier PDF
    Par rgarnier dans le forum XMLRAD
    Réponses: 4
    Dernier message: 22/05/2003, 22h14
  5. Comment enregistrer un stream de longueur fixe ?
    Par Alcarbone dans le forum MFC
    Réponses: 5
    Dernier message: 13/04/2003, 20h14

Partager

Partager
  • Envoyer la discussion sur Viadeo
  • Envoyer la discussion sur Twitter
  • Envoyer la discussion sur Google
  • Envoyer la discussion sur Facebook
  • Envoyer la discussion sur Digg
  • Envoyer la discussion sur Delicious
  • Envoyer la discussion sur MySpace
  • Envoyer la discussion sur Yahoo