Hadoop ecosysteme - Hive - hbase - Pig - Map reduce

**bordi** · 01/06/2015, 22h07

Bonjour,

me revoila

Je suis en train de regarder Hive 0.14, pour l'instant j'ai pas trop de problème avec le dml, depuis la 0.14 on peut utiliser un format étendu appeler ORC,
pour faire des update, delete, et des insert value, a chaque action, le map reduce se déclenche pour la faire les mises à jours sur les noeuds.

Avant on ne pouvait qu'alimenter la table et faire des queries dessus, il y avait quelques astuces mais c'était assez limité
J'ai utilisé aussi hive pour mapper et creer des tables hbase à partir de table temporaire qui sont alimentés par des fichier csv,

Mais je m'interroge sur les comportements de la décomposition des données au niveau partition et/ou des buckets ?

Les grandes tables hive peuvent être découpée en partion ok, hive va organiser en une hierarchie de répertoire les données

create table agence {
nom String,
ville String,
telephone String,
code_departement String
} PARTITIONED BY (region String)

en l'occurence un répertoire sera créer par region pour les agences

create table agence {
nom String,
ville String,
telephone,
region String,
code_departement String
}
clustered by (code_departement) into 3 buckets;

on peut avoir plusieurs departements dans un fichier bucket, mais toutes les agences d'un code-department appartiendront à un seul bucket

Mais je vois mal l'interaction des deux ensembles, des infos ?

Est ce que si on a trop de partition, les buckets donne la possibilité les découper les données plus rapidement via l'execution d'un nombre plus important de map reduce
en parallele

edit! après quelques tests

create table agence (
nom String,
ville String,
telephone String,
code_departement String
)
PARTITIONED BY (region String)
CLUSTERED BY (code_departement) INTO 3 BUCKETS
STORED AS ORC tblproperties ("orc.compress"="NONE");

INSERT INTO TABLE agence PARTITION (region='Aquitaine') VALUES ('TOTAL', 'BORDEAUX', '05.01.01.01.02','33'), ('ESSO','TOULOUSE','09.09.09.09', '31') ;
INSERT INTO TABLE agence PARTITION (region='IDF') VALUES ('ELF','PARIS','09.88.83.33.45','75');
INSERT INTO TABLE agence PARTITION (region='PACA') VALUES ('BP','PARIS','06.32.54.31.53','13');

je pensai qu'il allait faire la création de trois partitions pour Aquitaine/ IDF/PACA, puis dans chacune de ces partitions , faire la gestion des 3 buckets selon le departement
mais curieusement il met tout dans la même partition, il y a un truc qui m'échappe dans le mécanisme.

Une idée ? bon je vais creuser avec quelques tests en relisant la doc, j'ai du rater un truc.

edit: bon j'étais pas au bon endroit, c'est bien ce qu'il fait, ils respectent la hiérarchie, des fois on ne voit pas ce qu'on a devant les yeux

/ user/ hive/ warehouse/ jbedb.db/ agence

mes 3 partitions, avec chacun 3 buckets contenant bien mes données, bizarre le nom de la partition contient l'intégralité de l'expression que j'ai mis dans la partition( ), la syntaxe est pourtant correcte,
louche

region=Aquitaine
region=IDF
region=PACA

JP

**bordi** · 02/06/2015, 23h27

j'ai fait quelques tests avec les map et les arays de hive, j'ai tatoné un peu
mais c'est intéressant

voici l'echantillon de mon fichier test map.csv. le '-' separe les collections, le ':' separe la cle:valeur, la ',' separe les champs

Garfield-Odie,001:pizzat-000:Lasagne
Mermal,002:leplusmignon
Liz,003:veto
Squeak,004:Souris

les séparateurs sont importants pour charger les données du fichier dans les bons champs, j'ai mis un array en premier champ, une map en second champ.

CREATE TABLE test_map( monarray array<string>, mymap map<int,string> )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY '-' MAP KEYS TERMINATED BY ':' LINES TERMINATED BY '\n';

LOAD DATA LOCAL inpath '/tmp/map.csv' INTO TABLE test_map;

0: jdbc:hive2://stargate:10000/jbedb> select * from test_map;
+----------------------+---------------------------+--+
| test_map.monarray | test_map.mymap |
+----------------------+---------------------------+--+
| ["Garfield","Odie"] | {1:"pizzat",0:"Lasagne"} |
| ["Mermal"] | {2:"leplusmignon"} |
| ["Liz"] | {3:"veto"} |
| ["Squeak"] | {4:"Souris"} |
+----------------------+---------------------------+--+
4 rows selected (0,4 seconds)

describe test_map;
+-----------+------------------+----------+--+
| col_name | data_type | comment |
+-----------+------------------+----------+--+
| monarray | array<string> | |
| mymap | map<int,string> | |
+-----------+------------------+----------+--+
2 rows selected (0,689 seconds)

je cherche à savoir si c'est possible de faire avec des structures, ca serait top. Affaire à suivre.

**bordi** · 03/06/2015, 23h52

Après quelques recherches, j'ai fini par trouver les structures;

j'ai donc initialisé un fichier csv de la facon suivante, en respectant les mecanisme des sepérateur

Paul Dufilo,47000,Mr E,Ass.Maladie:1400-Ass.Vieillesse:1900,lesueur-Tours-France-:41000
Jacques Dupond,31000,,Ass.Maladie:1100-Ass.Vieillesse:1300,deshommes-Paris-France-75000
Marcel Martin,19000,,Ass.Maladie:600-Ass.Vieillesse:800,delasomme-Bordeaux-France-33000

creation d'une table simple,array,map, struct

CREATE TABLE salaries (
nom STRING, salaire FLOAT,
subordonnes ARRAY<STRING>,
cotisations MAP<STRING, FLOAT>,
adresse STRUCT<rue:STRING, ville:STRING, pays:STRING, cp:INT>
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '-'
MAP KEYS TERMINATED BY ':' LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

load data local inpath '/tmp/employees.csv' OVERWRITE into table salaries;

0: jdbc:hive2://stargate:10000/jbedb> select * from salaries;
+-----------------+-------------------+-----------------------+-------------------------------------------------+--------------------------------------------------------------------+--+
| salaries.nom | salaries.salaire | salaries.subordonnes | salaries.cotisations | salaries.adresse |
+-----------------+-------------------+-----------------------+-------------------------------------------------+--------------------------------------------------------------------+--+
| Paul Dufilo | 47000.0 | ["Mr E"] | {"Ass.Maladie":1400.0,"Ass.Vieillesse":1900.0} | {"rue":"lesueur","ville":"Tours","pays":"France","cp":null} |
| Jacques Dupond | 31000.0 | [] | {"Ass.Maladie":1100.0,"Ass.Vieillesse":1300.0} | {"rue":"deshommes","ville":"Paris","pays":"France","cp":75000} |
| Marcel Martin | 19000.0 | [] | {"Ass.Maladie":600.0,"Ass.Vieillesse":800.0} | {"rue":"delasomme","ville":"Bordeaux","pays":"France","cp":33000} |
+-----------------+-------------------+-----------------------+-------------------------------------------------+--------------------------------------------------------------------+--+

traitement de calcul sur des éléments de la map

0: jdbc:hive2://stargate:10000/jbedb>
0: jdbc:hive2://stargate:10000/jbedb> select nom,round( salaire-(cotisations['Ass.Maladie']+cotisations['Ass.Vieillesse'])) from salaries;
I
NFO : Number of reduce tasks is set to 0 since there's no reduce operator
WARN : Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
INFO : number of splits:1
INFO : Submitting tokens for job: job_1433351652657_0011
INFO : The url to track the job: http://stargate:8088/proxy/applicati...51652657_0011/
INFO : Starting Job = job_1433351652657_0011, Tracking URL = http://stargate:8088/proxy/applicati...51652657_0011/
INFO : Kill Command = /usr/local/hadoop/bin/hadoop job -kill job_1433351652657_0011
INFO : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
INFO : 2015-06-03 22:58:42,790 Stage-1 map = 0%, reduce = 0%
INFO : 2015-06-03 22:58:47,953 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.7 sec
INFO : MapReduce Total cumulative CPU time: 1 seconds 700 msec
INFO : Ended Job = job_1433351652657_0011
+-----------------+----------+--+
| nom | _c1 |
+-----------------+----------+--+
| Paul Dufilo | 43700.0 |
| Jacques Dupond | 28600.0 |
| Marcel Martin | 17600.0 |
+-----------------+----------+--+

test case en passant sur valeur salaire

0: jdbc:hive2://stargate:10000/jbedb> select nom,round( (salaire-(cotisations['Ass.Maladie']+ cotisations['Ass.Vieillesse'])) ) ,
. . . . . . . . . . . . . . . . . . > case
. . . . . . . . . . . . . . . . . . > when salaire < 20000.0 THEN 'petit'
. . . . . . . . . . . . . . . . . . > when salaire >= 20000.0 AND salaire<40000.0 THEN 'moyen'
. . . . . . . . . . . . . . . . . . > when salaire >= 40000.0 THEN 'mieux'
. . . . . . . . . . . . . . . . . . > else 'secret'
. . . . . . . . . . . . . . . . . . > END AS categorie FROM salaries;
INFO : Number of reduce tasks is set to 0 since there's no reduce operator
WARN : Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
INFO : number of splits:1
INFO : Submitting tokens for job: job_1433351652657_0014
INFO : The url to track the job: http://stargate:8088/proxy/applicati...51652657_0014/
INFO : Starting Job = job_1433351652657_0014, Tracking URL = http://stargate:8088/proxy/applicati...51652657_0014/
INFO : Kill Command = /usr/local/hadoop/bin/hadoop job -kill job_1433351652657_0014
INFO : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
INFO : 2015-06-03 23:20:30,916 Stage-1 map = 0%, reduce = 0%
INFO : 2015-06-03 23:20:37,087 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.05 sec
INFO : MapReduce Total cumulative CPU time: 2 seconds 50 msec
INFO : Ended Job = job_1433351652657_0014
+-----------------+----------+------------+--+
| nom | _c1 | categorie |
+-----------------+----------+------------+--+
| Paul Dufilo | 43700.0 | mieux |
| Jacques Dupond | 28600.0 | moyen |
| Marcel Martin | 17600.0 | petit |
+-----------------+----------+------------+--+
3 rows selected (12,482 seconds)

traitement d'une query sur un champ structure

0: jdbc:hive2://stargate:10000/jbedb> select nom, adresse.ville from salaries;
INFO : Number of reduce tasks is set to 0 since there's no reduce operator
WARN : Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
INFO : number of splits:1
INFO : Submitting tokens for job: job_1433351652657_0019
INFO : The url to track the job: http://stargate:8088/proxy/applicati...51652657_0019/
INFO : Starting Job = job_1433351652657_0019, Tracking URL = http://stargate:8088/proxy/applicati...51652657_0019/
INFO : Kill Command = /usr/local/hadoop/bin/hadoop job -kill job_1433351652657_0019
INFO : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
INFO : 2015-06-03 23:47:41,156 Stage-1 map = 0%, reduce = 0%
INFO : 2015-06-03 23:47:54,817 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.97 sec
+-----------------+-----------+--+
| nom | ville |
+-----------------+-----------+--+
| Paul Dufilo | Tours |
| Jacques Dupond | Paris |
| Marcel Martin | Bordeaux |
+-----------------+-----------+--+
3 rows selected (30,401 seconds)
INFO : MapReduce Total cumulative CPU time: 3 seconds 970 msec
INFO : Ended Job = job_1433351652657_0019

j'ai pas encore trouvé, comment faire la somme des lignes cotisations de la map pour un salarié. affaire à suivre

il faut que je regarde comment on peut calculer un resultat dans une variable temporaire si cela est possible
en hive.

a noter, il n y aucun controle d'injection dans la table, on peut mettre n'importe quoi ou avoir des champs
décalé de l'un par rapport à l'autre, genre le salaire qui se retrouve dans le champ pays.
pas cool.

**bordi** · 04/06/2015, 22h03

J'ai regardé une capacité intéressante de hive, les UDF (user definition function), on peut lui ajouter nos propres macro très facilement.

j'ai utilisé l'exemple, mais il n'était pas très explicite, j'ai trouvé un vieux script qui n'était plus adapter et que j'ai remis au gout du jour
dans le contexte de mon hadoop 2.6.0

https://cwiki.apache.org/confluence/...ve/HivePlugins

la classe macro qu'on souhaite ajouter

hduser@stargate:~/hive/function$ cat /home/hduser/udf/com/example/hive/udf/Lower.java

package com.example.hive.udf;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

public final class Lower extends UDF {
public Text evaluate(final Text s) {
if (s == null) { return null; }
return new Text(s.toString().toLowerCase());
}
}

Le script que j'ai bricolé pour hadoop 2.6.0, il va faire le classpath en récuperant les jar hadoop/hive, puis compiler la classe UDF et faire le jar et pour finir il va afficher l'usage pour un test
sous l'interpreteur beeline.

compile.sh

#!/bin/bash
set -x
if [ "$1" == "" ]; then
echo "Usage: $0 <java file>"
exit 1
fi

CNAME=${1%.java}
JARNAME=$CNAME.jar
JARDIR=/tmp/hive_jars/$CNAME
CLASSPATH=$(ls $HIVE_HOME/lib/hive-serde-*.jar):$(ls $HIVE_HOME/lib/hive-exec-*.jar):$(ls $HADOOP_HOME/share/hadoop/common/hadoop-common-?.?.?.jar)

function tell {
echo
echo "$1 successfully compiled. In Hive run:"
echo "$> add jar $JARNAME;"
echo "$> create temporary function $CNAME as 'com.example.hive.udf.$CNAME';"
echo
}

mkdir -p $JARDIR
javac -classpath $CLASSPATH -d $JARDIR/ $1 && jar -cf $JARNAME -C $JARDIR/ . && tell $1
~

ca donne le résultat ci dessous

hduser@stargate:~/hive/function$ ./compile.sh ~/udf/com/example/hive/udf/Lower.java

/home/hduser/udf/com/example/hive/udf/Lower.java successfully compiled. In Hive run:
$> add jar /home/hduser/udf/com/example/hive/udf/Lower.jar;
$> create temporary function /home/hduser/udf/com/example/hive/udf/Lower as 'com.example.hive.udf./home/hduser/udf/com/example/hive/udf/Lower';

sous l'interpreteur beeline

ajout de la resource
0: jdbc:hive2://stargate:10000/jbedb> add jar /home/hduser/udf/com/example/hive/udf/Lower.jar;
INFO : Added [/home/hduser/udf/com/example/hive/udf/Lower.jar] to class path
INFO : Added resources: [/home/hduser/udf/com/example/hive/udf/Lower.jar]
No rows affected (0,017 seconds)

cretion de la function durant la session

0: jdbc:hive2://stargate:10000/jbedb> create temporary function my_lower as 'com.example.hive.udf.Lower';
No rows affected (0,013 seconds)

execution de la macro sur une query

0: jdbc:hive2://stargate:10000/jbedb> select my_lower(nom) from salaries;
INFO : Number of reduce tasks is set to 0 since there's no reduce operator
WARN : Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
INFO : number of splits:1
INFO : Submitting tokens for job: job_1433437271446_0007
INFO : The url to track the job: http://stargate:8088/proxy/applicati...37271446_0007/
INFO : Starting Job = job_1433437271446_0007, Tracking URL = http://stargate:8088/proxy/applicati...37271446_0007/
INFO : Kill Command = /usr/local/hadoop/bin/hadoop job -kill job_1433437271446_0007
INFO : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
INFO : 2015-06-04 21:42:55,902 Stage-1 map = 0%, reduce = 0%
INFO : 2015-06-04 21:43:10,586 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.86 sec
INFO : MapReduce Total cumulative CPU time: 4 seconds 860 msec
INFO : Ended Job = job_1433437271446_0007
+-----------------+--+
| _c0 |
+-----------------+--+
| paul dufilo |
| jacques dupond |
| marcel martin |
+-----------------+--+
3 rows selected (32,719 seconds)
0: jdbc:hive2://stargate:10000/jbedb> select nom from salaries;
+-----------------+--+
| nom |
+-----------------+--+
| Paul Dufilo |
| Jacques Dupond |
| Marcel Martin |
+-----------------+--+

avec l'exécution de la macro les majuscules sont convertis en minuscules, cet exemple ouvre beaucoup de perspective dans hive, qui est très ouvert.
j'aime bien. les insert en streaming ne sont pas mal non plus. je me vais chercher un exemple concret.

affaire à suivre.

**bordi** · 06/06/2015, 16h38

Hive Serde est un mécanisme de sérialisation désérialisation qui permet à hive d'importer des données dans sa base
sans le supporter directement, hive fourni un interface SerDe qui doit être implémenter par un utilisateur.

Exemple avec un fichier csv

http://ogrodnek.github.io/csv-serde/

hduser@stargate:~$ cat /tmp/csv.txt
aaa,bbb
ccc,tddd
eee,fff

sous l'interpreteur beeline

gestion resource de l'implementation

add jar /tmp/csv-serde-1.1.2.jar;

create table my_table(a string, b string) row format serde 'com.bizo.hive.serde.csv.CSVSerde' with serdeproperties ( "separatorChar" = ",", "quoteChar" = "'", "escapeChar" = "\\" ) stored as textfile;

load data local inpath "/tmp/csv.txt" into table my_table_csv;
No rows affected (0,574 seconds)

0: jdbc:hive2://stargate:10000/jbedb> select * from my_table_csv; +-----------------+-----------------+--+
| my_table_csv.a | my_table_csv.b |
+-----------------+-----------------+--+
| aaa | bbb |
| ccc | tddd |
| eee | fff |
+-----------------+-----------------+--+

meme chose avec json

https://code.google.com/p/hive-json-...GettingStarted

hduser@stargate:~$ cat /tmp/json.txt
{"field1":"data1","field2":100,"field3":"more data1","field4":123.001}
{"field1":"data2","field2":200,"field3":"more data2","field4":123.002}
{"field1":"data3","field2":300,"field3":"more data3","field4":123.003}
{"field1":"data4","field2":400,"field3":"more data4","field4":123.004}

dans beeline, on ajoute la ressource

0: jdbc:hive2://stargate:10000/jbedb>

add jar /tmp/hive-json-serde-0.2.jar

INFO : Added [/tmp/hive-json-serde-0.2.jar] to class path
INFO : Added resources: [/tmp/hive-json-serde-0.2.jar]
No rows affected (0,06 seconds)

0: jdbc:hive2://stargate:10000/jbedb>

CREATE EXTERNAL TABLE IF NOT EXISTS my_table (field1 string, field2 int, field3 string, field4 double )
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde';

0: jdbc:hive2://stargate:10000/jbedb> load data local inpath "/tmp/json.txt" into table my_table;
INFO : Loading data to table jbedb.my_table from file:/tmp/json.txt
INFO : Table jbedb.my_table stats: [numFiles=0, totalSize=0]
0: jdbc:hive2://stargate:10000/jbedb> select * from my_table;
+------------------+------------------+------------------+------------------+--+
| my_table.field1 | my_table.field2 | my_table.field3 | my_table.field4 |
+------------------+------------------+------------------+------------------+--+
| data1 | 100 | more data1 | 123.001 |
| data2 | 200 | more data2 | 123.002 |
| data3 | 300 | more data3 | 123.003 |
| data4 | 400 | more data4 | 123.004 |
+------------------+------------------+------------------+------------------+--+
4 rows selected (0,595 seconds)

On peut implémeter les méthodes initialize(), serialize(), deserialize() de l'interface Serde soit même pour avoir un autre format, je trouve cela pas mal

How-to: Use a SerDe in Apache Hive

http://blog.cloudera.com/blog/2012/1...n-apache-hive/

c'était mon dernier chapitre, j'en sais suffisament pour l'instant, je passe maintenant à pig & hive &hbase.

affaire à suivre

**bordi** · 07/06/2015, 17h34

bon pig et hive,

Pour résumer on a en gros, quelques mots cle principaux ( LOAD, DUMP, FILTER, GROUP, JOIN, FOR EACH,STORE)
après on quelques function supplémentaire ( cross, split, ..)

pas si simple, j'ai rencontré plusieurs problème lié à la configuration.

pig doit utiliser le metastore, mais pour ca il faut que metastore accepte de causer avec pig,

j'ai été de revoir la conf du metastore pour enfin la possitilité d'uliser les tables hive, ca peut servir.

<property>
<name>hive.metastore.uris</name>
<value>thrift://stargate:9083</value>
<description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description>
</property>

Après il faut valoriser ses variables d'environnement dans le .bashrc, et on oublie pas le pig adapter

export PIG_CLASSPATH=$HCAT_HOME/share/hcatalog/hcatalog-*.jar:\
$HCAT_HOME/share/hcatalog/hcatalog-pig-adapter*.jar:\
$HIVE_HOME/lib/hive-metastore-*.jar:$HIVE_HOME/lib/libthrift-*.jar:\
$HIVE_HOME/lib/hive-exec-*.jar:$HIVE_HOME/lib/libfb303-*.jar:\
$HIVE_HOME/lib/jdo2-api-*-ec.jar:$HIVE_HOME/conf:$HADOOP_HOME/etc/hadoop:\
$HIVE_HOME/lib/slf4j-api-*.jar
export PIG_OPTS=-Dhive.metastore.uris=thrift://stargate:9083

que dans /usr/local/pig/conf il faut activer dans pig.propertie le fichier .pigbootup, bizarrement il n'a pas l'air
de reconnaitre les variables d'environnement, je verrais plus tard.

pig.load.default.statements=/usr/local/pig/.pigbootup

hduser@stargate:/usr/local/pig$ cat .pigbootup
REGISTER /usr/local/hive/hcatalog/share/hcatalog/hive-hcatalog-core-0.14.0.jar;
REGISTER /usr/local/hive//lib/hive-exec-0.14.0.jar;
REGISTER /usr/local/hive/lib/hive-metastore-0.14.0.jar;

Très important, il faut démarrer le hive metastore avant le hiveserver2 pour éviter les problèmes

d'abord le metastore

nohup hive --service metastore &

on vérifier que le monsieur metastore est bien en écoute sur le port
netstat -an | grep 9083

puis
nohup /usr/local/hive/bin/hiveserver2 >$HIVE_LOG_DIR/hiveServer2.out 2>$HIVE_LOG_DIR/hiveServer2.log &

Cerise sur le gâteau, il faut faire attention au nom de package qu'on appelle dans le LOAD pour accéder au metastore, sinon on plante en
ERROR 1070 problem import.blabla, il connait pas.

ventes = LOAD 'jbedb.vente' USING org.apache.hive.hcatalog.pig.HCatLoader();

Première étape hive, je charge un fichier de vente csv bidon dans une table qui sera utilisée par pig via le metastore
il en fera une recap avec un total par client et le stockera dans une table de résultat.

1001,Platini,Menage,2000
1002,Zidane,Menage,500
1001,Platini,Menage,600
1002,Zidane,Ordinateur,1000
1001,Platini,Ordinateur,500
1002,Zidane,Menage,1000
1002,Zidane,Ordinateur,600
1001,Platini,Menage,700
1002,Zidane,Ordinateur,800

beeline> !connect jdbc:hive2://stargate:10000/jbedb hduser servus
scan complete in 4ms
Connecting to jdbc:hive2://stargate:10000/jbedb
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/apache-hive-0.14.0-bin/lib/hive-jdbc-0.14.0-standalone.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Connected to: Apache Hive (version 0.14.0)
Driver: Hive JDBC (version 0.14.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://stargate:10000/jbedb> truncate table ventes;
No rows affected (0,656 seconds)
0: jdbc:hive2://stargate:10000/jbedb> drop table ventes;
No rows affected (0,889 seconds)
0: jdbc:hive2://stargate:10000/jbedb> CREATE TABLE ventes ( custId int, custName String, productType String, value Int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
No rows affected (0,512 seconds)
0: jdbc:hive2://stargate:10000/jbedb>LOAD DATA LOCAL inpath '/usr/local/pig/SalesData.csv' overwrite into table ventes;
No rows affected (0,793 seconds)
INFO : Loading data to table jbedb.ventes from file:/usr/local/pig/SalesData.csv
INFO : Table jbedb.ventes stats: [numFiles=1, numRows=0, totalSize=230, rawDataSize=0]
0: jdbc:hive2://stargate:10000/jbedb> select * from ventes;
+----------------+------------------+---------------------+---------------+--+
| ventes.custid | ventes.custname | ventes.producttype | ventes.value |
+----------------+------------------+---------------------+---------------+--+
| 1001 | Platini | Menage | 2000 |
| 1002 | Zidane | Menage | 500 |
| 1001 | Platini | Menage | 600 |
| 1002 | Zidane | Ordinateur | 1000 |
| 1001 | Platini | Ordinateur | 500 |
| 1002 | Zidane | Menage | 1000 |
| 1002 | Zidane | Ordinateur | 600 |
| 1001 | Platini | Menage | 700 |
| 1002 | Zidane | Ordinateur | 800 |
+----------------+------------------+---------------------+---------------+--+
9 rows selected (0,469 seconds)

Maintenant je lance ka seconde étape avec pig en mode distribuée et sur metastore de hive

hduser@stargate:/usr/local/pig/bin$ pig -x mapreduce -useHCatalog
ls: impossible d'accÃ©der Ã* /usr/local/hive/lib/slf4j-api-*.jar: Aucun fichier ou dossier de ce type
ls: impossible d'accÃ©der Ã* /usr/local/hive/hcatalog/lib/*hbase-storage-handler-*.jar: Aucun fichier ou dossier de ce type
15/06/07 16:59:20 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
15/06/07 16:59:20 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
15/06/07 16:59:20 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType
2015-06-07 16:59:21,031 [main] INFO org.apache.pig.Main - Apache Pig version 0.14.0 (r1640057) compiled Nov 16 2014, 18:02:05
2015-06-07 16:59:21,032 [main] INFO org.apache.pig.Main - Logging error messages to: /usr/local/pig-0.14.0/bin/pig_1433689161031.log
2015-06-07 16:59:21,527 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 16:59:21,530 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 16:59:21,530 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://stargate:9000
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hbase-0.98.4-hadoop2/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2015-06-07 16:59:21,718 [main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where appl icable
2015-06-07 16:59:22,179 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 16:59:22,179 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: stargate:8050
2015-06-07 16:59:22,179 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 16:59:22,240 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 16:59:22,241 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 16:59:22,275 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 16:59:22,277 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 16:59:22,307 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 16:59:22,308 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 16:59:22,340 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 16:59:22,342 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 16:59:22,371 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 16:59:22,373 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 16:59:22,402 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 16:59:22,404 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 16:59:22,431 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 16:59:22,432 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 16:59:22,458 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 16:59:22,460 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
grunt> REGISTER /usr/local/hive/hcatalog/share/hcatalog/hive-hcatalog-core-0.14.0.jar;
2015-06-07 16:59:22,655 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 16:59:22,655 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
grunt> REGISTER /usr/local/hive//lib/hive-exec-0.14.0.jar;
2015-06-07 16:59:22,678 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 16:59:22,679 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
grunt> REGISTER /usr/local/hive/lib/hive-metastore-0.14.0.jar;
2015-06-07 16:59:22,701 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 16:59:22,701 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
grunt> ventes = LOAD 'jbedb.vente' USING org.apache.hive.hcatalog.pig.HCatLoader();
2015-06-07 17:00:40,503 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 17:00:40,504 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 17:00:40,873 [main] WARN org.apache.hadoop.hive.conf.HiveConf - DEPRECATED: hive.metastore.ds.retry.* no longer has any effect. Use hive.hmshandler.retry.* inste ad
2015-06-07 17:00:40,875 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.attempts does not exist
2015-06-07 17:00:40,875 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.interval does not exist
2015-06-07 17:00:40,875 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.stats.map.parallelism does not exist
2015-06-07 17:00:40,941 [main] INFO hive.metastore - Trying to connect to metastore with URI thrift://stargate:9083
2015-06-07 17:00:40,994 [main] INFO hive.metastore - Connected to metastore.

grunt> ventes = LOAD 'jbedb.ventes' USING org.apache.hive.hcatalog.pig.HCatLoader();
2015-06-07 17:02:49,988 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 17:02:49,988 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 17:02:50,012 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 17:02:50,012 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 17:02:50,061 [main] WARN org.apache.hadoop.hive.conf.HiveConf - DEPRECATED: hive.metastore.ds.retry.* no longer has any effect. Use hive.hmshandler.retry.* inste ad
2015-06-07 17:02:50,061 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.attempts does not exist
2015-06-07 17:02:50,062 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.interval does not exist
2015-06-07 17:02:50,062 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.stats.map.parallelism does not exist
2015-06-07 17:02:50,063 [main] INFO hive.metastore - Trying to connect to metastore with URI thrift://stargate:9083
2015-06-07 17:02:50,063 [main] INFO hive.metastore - Connected to metastore.
2015-06-07 17:02:50,227 [main] WARN org.apache.hadoop.hive.conf.HiveConf - DEPRECATED: hive.metastore.ds.retry.* no longer has any effect. Use hive.hmshandler.retry.* inste ad
2015-06-07 17:02:50,228 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.attempts does not exist
2015-06-07 17:02:50,228 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.interval does not exist
2015-06-07 17:02:50,228 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.stats.map.parallelism does not exist
2015-06-07 17:02:50,308 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 17:02:50,308 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 17:02:50,341 [main] WARN org.apache.hadoop.hive.conf.HiveConf - DEPRECATED: hive.metastore.ds.retry.* no longer has any effect. Use hive.hmshandler.retry.* inste ad
2015-06-07 17:02:50,341 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.attempts does not exist
2015-06-07 17:02:50,342 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.interval does not exist
2015-06-07 17:02:50,342 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.stats.map.parallelism does not exist
grunt> dump
2015-06-07 17:03:10,146 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 17:03:10,146 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 17:03:10,179 [main] WARN org.apache.hadoop.hive.conf.HiveConf - DEPRECATED: hive.metastore.ds.retry.* no longer has any effect. Use hive.hmshandler.retry.* inste ad
2015-06-07 17:03:10,180 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.attempts does not exist
2015-06-07 17:03:10,180 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.interval does not exist
2015-06-07 17:03:10,180 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.stats.map.parallelism does not exist
2015-06-07 17:03:10,199 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 17:03:10,199 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 17:03:10,204 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: UNKNOWN
2015-06-07 17:03:10,228 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 17:03:10,228 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 17:03:10,232 [main] INFO org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
2015-06-07 17:03:10,262 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupB yConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFil ter, SplitFilter, StreamTypeCastInserter]}
2015-06-07 17:03:10,371 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2015-06-07 17:03:10,394 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2015-06-07 17:03:10,394 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2015-06-07 17:03:10,416 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 17:03:10,440 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-07 17:03:10,596 [main] INFO org.apache.pig.tools.pigstats.mapreduce.MRScriptState - Pig script settings are added to the job
2015-06-07 17:03:10,600 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.reduce.markreset.buffer.percent is deprecated. Instead, use mapreduce.reduc e.markreset.buffer.percent
2015-06-07 17:03:10,600 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2015-06-07 17:03:10,600 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.output.compress is deprecated. Instead, use mapreduce.output.fileoutputformat.c ompress
2015-06-07 17:03:10,631 [main] WARN org.apache.hadoop.hive.conf.HiveConf - DEPRECATED: hive.metastore.ds.retry.* no longer has any effect. Use hive.hmshandler.retry.* inste ad
2015-06-07 17:03:10,632 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.attempts does not exist
2015-06-07 17:03:10,632 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.interval does not exist
2015-06-07 17:03:10,632 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.stats.map.parallelism does not exist
2015-06-07 17:03:10,760 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 17:03:10,761 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 17:03:10,761 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
2015-06-07 17:03:10,761 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - This job cannot be converted run in-process
2015-06-07 17:03:11,107 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/lib/hive-metastore-0.14. 0.jar to DistributedCache through /tmp/temp-623380625/tmp1308536430/hive-metastore-0.14.0.jar
2015-06-07 17:03:11,137 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/lib/libthrift-0.9.0.jar to DistributedCache through /tmp/temp-623380625/tmp409944007/libthrift-0.9.0.jar
2015-06-07 17:03:11,304 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/lib/hive-exec-0.14.0.jar to DistributedCache through /tmp/temp-623380625/tmp-1125580366/hive-exec-0.14.0.jar
2015-06-07 17:03:11,404 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/lib/libfb303-0.9.0.jar t o DistributedCache through /tmp/temp-623380625/tmp655893709/libfb303-0.9.0.jar
2015-06-07 17:03:11,503 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/lib/jdo-api-3.0.1.jar to DistributedCache through /tmp/temp-623380625/tmp-125161463/jdo-api-3.0.1.jar
2015-06-07 17:03:11,595 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/lib/hive-hbase-handler-0 .14.0.jar to DistributedCache through /tmp/temp-623380625/tmp-1525857397/hive-hbase-handler-0.14.0.jar
2015-06-07 17:03:11,628 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/hcatalog/share/hcatalog/ hive-hcatalog-core-0.14.0.jar to DistributedCache through /tmp/temp-623380625/tmp1869088516/hive-hcatalog-core-0.14.0.jar
2015-06-07 17:03:11,661 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/hcatalog/share/hcatalog/ hive-hcatalog-pig-adapter-0.14.0.jar to DistributedCache through /tmp/temp-623380625/tmp570675543/hive-hcatalog-pig-adapter-0.14.0.jar
2015-06-07 17:03:11,703 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/pig-0.14.0/pig-0.14.0-core-h2 .jar to DistributedCache through /tmp/temp-623380625/tmp-1688609819/pig-0.14.0-core-h2.jar
2015-06-07 17:03:11,736 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/pig-0.14.0/lib/automaton-1.11 -8.jar to DistributedCache through /tmp/temp-623380625/tmp596228608/automaton-1.11-8.jar
2015-06-07 17:03:11,761 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/pig-0.14.0/lib/antlr-runtime- 3.4.jar to DistributedCache through /tmp/temp-623380625/tmp-1959640922/antlr-runtime-3.4.jar
2015-06-07 17:03:11,795 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/pig-0.14.0/lib/joda-time-2.1. jar to DistributedCache through /tmp/temp-623380625/tmp73397238/joda-time-2.1.jar
2015-06-07 17:03:11,835 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2015-06-07 17:03:11,914 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2015-06-07 17:03:11,915 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker.http.address is deprecated. Instead, use mapreduce.jobtracker.http. address
2015-06-07 17:03:11,915 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 17:03:11,918 [JobControl] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-07 17:03:11,934 [JobControl] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 17:03:12,003 [JobControl] WARN org.apache.hadoop.mapreduce.JobSubmitter - No job jar file set. User classes may not be found. See Job or Job#setJar(String).
2015-06-07 17:03:12,090 [JobControl] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inp utdir
2015-06-07 17:03:12,109 [JobControl] INFO org.apache.hadoop.mapred.FileInputFormat - Total input paths to process : 1
2015-06-07 17:03:12,117 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
2015-06-07 17:03:12,244 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2015-06-07 17:03:12,344 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1433663908007_0005
2015-06-07 17:03:12,459 [JobControl] INFO org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resources.
2015-06-07 17:03:12,567 [JobControl] INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1433663908007_0005
2015-06-07 17:03:12,634 [JobControl] INFO org.apache.hadoop.mapreduce.Job - The url to track the job: http://stargate:8088/proxy/applicati...63908007_0005/
2015-06-07 17:03:12,634 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1433663908007_0005
2015-06-07 17:03:12,634 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases ventes
2015-06-07 17:03:12,635 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: ventes[1,9] C: R:
2015-06-07 17:03:12,644 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2015-06-07 17:03:12,644 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1433663908007_0005]
2015-06-07 17:03:34,714 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete
2015-06-07 17:03:34,714 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1433663908007_0005]
2015-06-07 17:03:37,725 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-07 17:03:37,733 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job his tory server
2015-06-07 17:03:37,920 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-07 17:03:37,926 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job his tory server
2015-06-07 17:03:37,965 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
2015-06-07 17:03:37,967 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-07 17:03:37,971 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job his tory server
2015-06-07 17:03:38,020 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2015-06-07 17:03:38,022 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics:

HadoopVersion PigVersion UserId StartedAt FinishedAt Features
2.6.0 0.14.0 hduser 2015-06-07 17:03:10 2015-06-07 17:03:38 UNKNOWN

Success!

Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTime AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Featur e Outputs
job_1433663908007_0005 1 0 14 14 14 14 0 0 0 0 ventes MAP_ONLY hdfs://stargate:9000/tmp/temp-623380625/tmp-30 1340857,

Input(s):
Successfully read 9 records (12422 bytes) from: "jbedb.ventes"

Output(s):
Successfully stored 9 records (272 bytes) in: "hdfs://stargate:9000/tmp/temp-623380625/tmp-301340857"

Counters:
Total records written : 9
Total bytes written : 272
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_1433663908007_0005

2015-06-07 17:03:38,024 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-07 17:03:38,028 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job his tory server
2015-06-07 17:03:38,060 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-07 17:03:38,063 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job his tory server
2015-06-07 17:03:38,087 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-07 17:03:38,090 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job his tory server
2015-06-07 17:03:38,122 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2015-06-07 17:03:38,125 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 17:03:38,125 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 17:03:38,125 [main] WARN org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
2015-06-07 17:03:38,134 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2015-06-07 17:03:38,134 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(1001,Platini,Menage,2000)
(1002,Zidane,Menage,500)
(1001,Platini,Menage,600)
(1002,Zidane,Ordinateur,1000)
(1001,Platini,Ordinateur,500)
(1002,Zidane,Menage,1000)
(1002,Zidane,Ordinateur,600)
(1001,Platini,Menage,700)
(1002,Zidane,Ordinateur,800)

Tout ca pour ca, afin de traiter un fichier dans pig à partir de hive,

**bordi** · 07/06/2015, 17h53

on a vu le LOAD précedement qui avait besoin d'accèdes au metastore de hive, j'aurais pu prendre le fichier csv directement en utilsant
USING PigStorage(',') au lieu de l'api Metatore, mais c'est plus marrant par la db.

exemple: on charge le fichier direcectement, mais on doit décrire le séparateur et la liste des champs et de leur type
ventes= LOAD 'SalesData.csv' using PigStorage (',') as (custId:int, custName:chararray, producttype:chararray, value:int );

Maintenant on peut application un filtre avec une condition, cela pour effet de réduire la liste.

on applique le filtre

grunt> ordiVendus = FILTER ventes BY producttype == 'Ordinateur';
2015-06-07 17:42:54,204 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 17:42:54,204 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 17:42:54,235 [main] WARN org.apache.hadoop.hive.conf.HiveConf - DEPRECATED: hive.metastore.ds.retry.* no longer has any effect. Use hive.hmshandler.retry.* instead
2015-06-07 17:42:54,235 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.attempts does not exist
2015-06-07 17:42:54,235 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.interval does not exist
2015-06-07 17:42:54,236 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.stats.map.parallelism does not exist
2015-06-07 17:42:54,267 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 17:42:54,267 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 17:42:54,292 [main] WARN org.apache.hadoop.hive.conf.HiveConf - DEPRECATED: hive.metastore.ds.retry.* no longer has any effect. Use hive.hmshandler.retry.* instead
2015-06-07 17:42:54,292 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.attempts does not exist
2015-06-07 17:42:54,292 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.interval does not exist
2015-06-07 17:42:54,292 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.stats.map.parallelism does not exist

On traite le contenu pour l'affiche la liste sera reduite
grunt> dump ordiVendus;
2015-06-07 17:43:02,464 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 17:43:02,464 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 17:43:02,491 [main] WARN org.apache.hadoop.hive.conf.HiveConf - DEPRECATED: hive.metastore.ds.retry.* no longer has any effect. Use hive.hmshandler.retry.* instead
2015-06-07 17:43:02,491 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.attempts does not exist
2015-06-07 17:43:02,491 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.interval does not exist
2015-06-07 17:43:02,491 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.stats.map.parallelism does not exist
2015-06-07 17:43:02,512 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 17:43:02,512 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 17:43:02,514 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: FILTER
2015-06-07 17:43:02,534 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 17:43:02,534 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 17:43:02,534 [main] WARN org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
2015-06-07 17:43:02,535 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}
2015-06-07 17:43:02,541 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2015-06-07 17:43:02,542 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2015-06-07 17:43:02,542 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2015-06-07 17:43:02,551 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 17:43:02,552 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-07 17:43:02,554 [main] INFO org.apache.pig.tools.pigstats.mapreduce.MRScriptState - Pig script settings are added to the job
2015-06-07 17:43:02,555 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2015-06-07 17:43:02,577 [main] WARN org.apache.hadoop.hive.conf.HiveConf - DEPRECATED: hive.metastore.ds.retry.* no longer has any effect. Use hive.hmshandler.retry.* instead
2015-06-07 17:43:02,577 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.attempts does not exist
2015-06-07 17:43:02,577 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.interval does not exist
2015-06-07 17:43:02,577 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.stats.map.parallelism does not exist
2015-06-07 17:43:02,614 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 17:43:02,614 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 17:43:02,614 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - This job cannot be converted run in-process
2015-06-07 17:43:02,692 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/lib/hive-metastore-0.14.0.jar to DistributedCache through /tmp/temp-623380625/tmp1468266506/hive-metastore-0.14.0.jar
2015-06-07 17:43:02,725 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/lib/libthrift-0.9.0.jar to DistributedCache through /tmp/temp-623380625/tmp-654533098/libthrift-0.9.0.jar
2015-06-07 17:43:02,809 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/lib/hive-exec-0.14.0.jar to DistributedCache through /tmp/temp-623380625/tmp-1389471049/hive-exec-0.14.0.jar
2015-06-07 17:43:02,842 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/lib/libfb303-0.9.0.jar to DistributedCache through /tmp/temp-623380625/tmp1345463028/libfb303-0.9.0.jar
2015-06-07 17:43:02,875 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/lib/jdo-api-3.0.1.jar to DistributedCache through /tmp/temp-623380625/tmp-93693194/jdo-api-3.0.1.jar
2015-06-07 17:43:02,908 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/lib/hive-hbase-handler-0.14.0.jar to DistributedCache through /tmp/temp-623380625/tmp796598022/hive-hbase-handler-0.14.0.jar
2015-06-07 17:43:02,942 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/hcatalog/share/hcatalog/hive-hcatalog-core-0.14.0.jar to DistributedCache through /tmp/temp-623380625/tmp-2114612693/hive-hcatalog-core-0.14.0.jar
2015-06-07 17:43:02,967 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/hcatalog/share/hcatalog/hive-hcatalog-pig-adapter-0.14.0.jar to DistributedCache through /tmp/temp-623380625/tmp-1599892601/hive-hcatalog-pig-adapter-0.14.0.jar
2015-06-07 17:43:03,017 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/pig-0.14.0/pig-0.14.0-core-h2.jar to DistributedCache through /tmp/temp-623380625/tmp1524629778/pig-0.14.0-core-h2.jar
2015-06-07 17:43:03,058 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/pig-0.14.0/lib/automaton-1.11-8.jar to DistributedCache through /tmp/temp-623380625/tmp63877365/automaton-1.11-8.jar
2015-06-07 17:43:03,092 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/pig-0.14.0/lib/antlr-runtime-3.4.jar to DistributedCache through /tmp/temp-623380625/tmp800259486/antlr-runtime-3.4.jar
2015-06-07 17:43:03,125 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/pig-0.14.0/lib/joda-time-2.1.jar to DistributedCache through /tmp/temp-623380625/tmp-161986306/joda-time-2.1.jar
2015-06-07 17:43:03,144 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2015-06-07 17:43:03,181 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2015-06-07 17:43:03,183 [JobControl] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-07 17:43:03,216 [JobControl] WARN org.apache.hadoop.mapreduce.JobSubmitter - No job jar file set. User classes may not be found. See Job or Job#setJar(String).
2015-06-07 17:43:03,291 [JobControl] INFO org.apache.hadoop.mapred.FileInputFormat - Total input paths to process : 1
2015-06-07 17:43:03,292 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
2015-06-07 17:43:03,399 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2015-06-07 17:43:03,449 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1433663908007_0006
2015-06-07 17:43:03,455 [JobControl] INFO org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resources.
2015-06-07 17:43:03,519 [JobControl] INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1433663908007_0006
2015-06-07 17:43:03,523 [JobControl] INFO org.apache.hadoop.mapreduce.Job - The url to track the job: http://stargate:8088/proxy/applicati...63908007_0006/
2015-06-07 17:43:03,682 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1433663908007_0006
2015-06-07 17:43:03,682 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases ordiVendus,ventes
2015-06-07 17:43:03,682 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: ventes[1,9],ordiVendus[2,13] C: R:
2015-06-07 17:43:03,688 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2015-06-07 17:43:03,688 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1433663908007_0006]
2015-06-07 17:43:25,718 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete
2015-06-07 17:43:25,718 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1433663908007_0006]
2015-06-07 17:43:28,726 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-07 17:43:28,733 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-06-07 17:43:28,861 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-07 17:43:28,864 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-06-07 17:43:28,890 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
2015-06-07 17:43:28,890 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-07 17:43:28,894 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-06-07 17:43:28,929 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2015-06-07 17:43:28,929 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics:

HadoopVersion PigVersion UserId StartedAt FinishedAt Features
2.6.0 0.14.0 hduser 2015-06-07 17:43:02 2015-06-07 17:43:28 FILTER

Success!

Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTime AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs
job_1433663908007_0006 1 0 14 14 14 14 0 0 0 0 ordiVendus,ventes MAP_ONLY hdfs://stargate:9000/tmp/temp-623380625/tmp1336548572,

Input(s):
Successfully read 9 records (12422 bytes) from: "jbedb.ventes"

Output(s):
Successfully stored 4 records (129 bytes) in: "hdfs://stargate:9000/tmp/temp-623380625/tmp1336548572"

Counters:
Total records written : 4
Total bytes written : 129
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_1433663908007_0006

2015-06-07 17:43:28,930 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-07 17:43:28,935 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-06-07 17:43:28,964 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-07 17:43:28,968 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-06-07 17:43:28,991 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-07 17:43:28,995 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-06-07 17:43:29,026 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2015-06-07 17:43:29,027 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 17:43:29,027 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 17:43:29,027 [main] WARN org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
2015-06-07 17:43:29,035 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2015-06-07 17:43:29,035 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(1002,Zidane,Ordinateur,1000)
(1001,Platini,Ordinateur,500)
(1002,Zidane,Ordinateur,600)
(1002,Zidane,Ordinateur,800)

on a réduit la liste, c'est verbeux, il n'est pas très économe en log.

**bordi** · 07/06/2015, 18h08

Toujours après le LOAD, le FILTER est toujours actif sur la sélection des données
on utilise un GROUP BY sur le nom client
puis on vas dumper le résultat

grunt> groupeVendus = GROUP ordiVendus BY custname;
2015-06-07 18:04:29,026 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 18:04:29,026 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 18:04:29,057 [main] WARN org.apache.hadoop.hive.conf.HiveConf - DEPRECATED: hive.metastore.ds.retry.* no longer has any effect. Use hive.hmshandler.retry.* instead
2015-06-07 18:04:29,057 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.attempts does not exist
2015-06-07 18:04:29,057 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.interval does not exist
2015-06-07 18:04:29,057 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.stats.map.parallelism does not exist
2015-06-07 18:04:29,081 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 18:04:29,081 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 18:04:29,108 [main] WARN org.apache.hadoop.hive.conf.HiveConf - DEPRECATED: hive.metastore.ds.retry.* no longer has any effect. Use hive.hmshandler.retry.* instead
2015-06-07 18:04:29,108 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.attempts does not exist
2015-06-07 18:04:29,108 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.interval does not exist
2015-06-07 18:04:29,108 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.stats.map.parallelism does not exist
grunt> dump
2015-06-07 18:04:41,725 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 18:04:41,725 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 18:04:41,752 [main] WARN org.apache.hadoop.hive.conf.HiveConf - DEPRECATED: hive.metastore.ds.retry.* no longer has any effect. Use hive.hmshandler.retry.* instead
2015-06-07 18:04:41,752 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.attempts does not exist
2015-06-07 18:04:41,752 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.interval does not exist
2015-06-07 18:04:41,752 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.stats.map.parallelism does not exist
2015-06-07 18:04:41,774 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 18:04:41,774 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 18:04:41,776 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: GROUP_BY,FILTER
2015-06-07 18:04:41,798 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 18:04:41,798 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 18:04:41,798 [main] WARN org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
2015-06-07 18:04:41,798 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}
2015-06-07 18:04:41,804 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2015-06-07 18:04:41,808 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2015-06-07 18:04:41,808 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2015-06-07 18:04:41,816 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 18:04:41,817 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-07 18:04:41,819 [main] INFO org.apache.pig.tools.pigstats.mapreduce.MRScriptState - Pig script settings are added to the job
2015-06-07 18:04:41,820 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2015-06-07 18:04:41,842 [main] WARN org.apache.hadoop.hive.conf.HiveConf - DEPRECATED: hive.metastore.ds.retry.* no longer has any effect. Use hive.hmshandler.retry.* instead
2015-06-07 18:04:41,843 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.attempts does not exist
2015-06-07 18:04:41,843 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.interval does not exist
2015-06-07 18:04:41,843 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.stats.map.parallelism does not exist
2015-06-07 18:04:41,873 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 18:04:41,874 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 18:04:41,874 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Reduce phase detected, estimating # of required reducers.
2015-06-07 18:04:41,874 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Using reducer estimator: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator
2015-06-07 18:04:41,900 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=0
2015-06-07 18:04:41,900 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting Parallelism to 1
2015-06-07 18:04:41,900 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
2015-06-07 18:04:41,900 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - This job cannot be converted run in-process
2015-06-07 18:04:41,975 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/lib/hive-metastore-0.14.0.jar to DistributedCache through /tmp/temp-623380625/tmp-759461806/hive-metastore-0.14.0.jar
2015-06-07 18:04:42,000 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/lib/libthrift-0.9.0.jar to DistributedCache through /tmp/temp-623380625/tmp-2001077575/libthrift-0.9.0.jar
2015-06-07 18:04:42,091 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/lib/hive-exec-0.14.0.jar to DistributedCache through /tmp/temp-623380625/tmp1993579607/hive-exec-0.14.0.jar
2015-06-07 18:04:42,116 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/lib/libfb303-0.9.0.jar to DistributedCache through /tmp/temp-623380625/tmp1525816211/libfb303-0.9.0.jar
2015-06-07 18:04:42,141 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/lib/jdo-api-3.0.1.jar to DistributedCache through /tmp/temp-623380625/tmp1838452442/jdo-api-3.0.1.jar
2015-06-07 18:04:42,175 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/lib/hive-hbase-handler-0.14.0.jar to DistributedCache through /tmp/temp-623380625/tmp-693606662/hive-hbase-handler-0.14.0.jar
2015-06-07 18:04:42,208 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/hcatalog/share/hcatalog/hive-hcatalog-core-0.14.0.jar to DistributedCache through /tmp/temp-623380625/tmp-1953607127/hive-hcatalog-core-0.14.0.jar
2015-06-07 18:04:42,233 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/hcatalog/share/hcatalog/hive-hcatalog-pig-adapter-0.14.0.jar to DistributedCache through /tmp/temp-623380625/tmp-1306112620/hive-hcatalog-pig-adapter-0.14.0.jar
2015-06-07 18:04:42,275 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/pig-0.14.0/pig-0.14.0-core-h2.jar to DistributedCache through /tmp/temp-623380625/tmp-2120105410/pig-0.14.0-core-h2.jar
2015-06-07 18:04:42,308 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/pig-0.14.0/lib/automaton-1.11-8.jar to DistributedCache through /tmp/temp-623380625/tmp-248233088/automaton-1.11-8.jar
2015-06-07 18:04:42,341 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/pig-0.14.0/lib/antlr-runtime-3.4.jar to DistributedCache through /tmp/temp-623380625/tmp942003173/antlr-runtime-3.4.jar
2015-06-07 18:04:42,375 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/pig-0.14.0/lib/joda-time-2.1.jar to DistributedCache through /tmp/temp-623380625/tmp1959606950/joda-time-2.1.jar
2015-06-07 18:04:42,385 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2015-06-07 18:04:42,436 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2015-06-07 18:04:42,437 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 18:04:42,439 [JobControl] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-07 18:04:42,447 [JobControl] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 18:04:42,474 [JobControl] WARN org.apache.hadoop.mapreduce.JobSubmitter - No job jar file set. User classes may not be found. See Job or Job#setJar(String).
2015-06-07 18:04:42,548 [JobControl] INFO org.apache.hadoop.mapred.FileInputFormat - Total input paths to process : 1
2015-06-07 18:04:42,548 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
2015-06-07 18:04:42,716 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2015-06-07 18:04:42,774 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1433663908007_0007
2015-06-07 18:04:42,778 [JobControl] INFO org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resources.
2015-06-07 18:04:42,827 [JobControl] INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1433663908007_0007
2015-06-07 18:04:42,830 [JobControl] INFO org.apache.hadoop.mapreduce.Job - The url to track the job: http://stargate:8088/proxy/applicati...63908007_0007/
2015-06-07 18:04:42,937 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1433663908007_0007
2015-06-07 18:04:42,937 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases groupeVendus,ordiVendus,ventes
2015-06-07 18:04:42,938 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: ventes[1,9],ordiVendus[2,13],groupeVendus[3,15] C: R:
2015-06-07 18:04:42,945 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2015-06-07 18:04:42,945 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1433663908007_0007]
2015-06-07 18:05:17,174 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete
2015-06-07 18:05:17,174 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1433663908007_0007]
2015-06-07 18:05:32,710 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1433663908007_0007]
2015-06-07 18:05:33,215 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-07 18:05:33,223 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-06-07 18:05:33,353 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-07 18:05:33,357 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-06-07 18:05:33,392 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-07 18:05:33,396 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-06-07 18:05:33,431 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2015-06-07 18:05:33,432 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics:

HadoopVersion PigVersion UserId StartedAt FinishedAt Features
2.6.0 0.14.0 hduser 2015-06-07 18:04:41 2015-06-07 18:05:33 GROUP_BY,FILTER

Success!

Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTime AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs
job_1433663908007_0007 1 1 15 15 15 15 11 11 11 11 groupeVendus,ordiVendus,ventes GROUP_BY hdfs://stargate:9000/tmp/temp-623380625/tmp-669030939,

Input(s):
Successfully read 9 records (12422 bytes) from: "jbedb.ventes"

Output(s):
Successfully stored 2 records (148 bytes) in: "hdfs://stargate:9000/tmp/temp-623380625/tmp-669030939"

Counters:
Total records written : 2
Total bytes written : 148
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_1433663908007_0007

2015-06-07 18:05:33,433 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-07 18:05:33,437 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-06-07 18:05:33,465 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-07 18:05:33,469 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-06-07 18:05:33,501 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-07 18:05:33,504 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-06-07 18:05:33,536 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2015-06-07 18:05:33,536 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 18:05:33,536 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 18:05:33,537 [main] WARN org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
2015-06-07 18:05:33,544 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2015-06-07 18:05:33,544 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(Zidane,{(1002,Zidane,Ordinateur,800),(1002,Zidane,Ordinateur,600),(1002,Zidane,Ordinateur,1000)})
(Platini,{(1001,Platini,Ordinateur,500)})
grunt>

**bordi** · 07/06/2015, 18h25

Utilisation du FOREACH pour faire le total des groupes de vente par client.

on applique le FOREACH

grunt> custTotalVendus = FOREACH groupeVendus GENERATE group as custname, SUM( ordiVendus.(value)) as value;
2015-06-07 18:20:59,717 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 18:20:59,718 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 18:20:59,747 [main] WARN org.apache.hadoop.hive.conf.HiveConf - DEPRECATED: hive.metastore.ds.retry.* no longer has any effect. Use hive.hmshandler.retry.* instead
2015-06-07 18:20:59,748 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.attempts does not exist
2015-06-07 18:20:59,748 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.interval does not exist
2015-06-07 18:20:59,748 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.stats.map.parallelism does not exist
2015-06-07 18:20:59,806 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 18:20:59,806 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 18:20:59,831 [main] WARN org.apache.hadoop.hive.conf.HiveConf - DEPRECATED: hive.metastore.ds.retry.* no longer has any effect. Use hive.hmshandler.retry.* instead
2015-06-07 18:20:59,831 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.attempts does not exist
2015-06-07 18:20:59,831 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.interval does not exist
2015-06-07 18:20:59,831 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.stats.map.parallelism does not exist

on dump
grunt> dump custTotalVendus;
2015-06-07 18:24:28,776 [main] WARN org.apache.hadoop.hive.conf.HiveConf - DEPRECATED: hive.metastore.ds.retry.* no longer has any effect. Use hive.hmshandler.retry.* instead
2015-06-07 18:24:28,776 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.attempts does not exist
2015-06-07 18:24:28,776 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.interval does not exist
2015-06-07 18:24:28,776 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.stats.map.parallelism does not exist
2015-06-07 18:24:28,798 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 18:24:28,798 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 18:24:28,800 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: GROUP_BY,FILTER
2015-06-07 18:24:28,816 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 18:24:28,816 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 18:24:28,816 [main] INFO org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
2015-06-07 18:24:28,816 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}
2015-06-07 18:24:28,819 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2015-06-07 18:24:28,820 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.CombinerOptimizerUtil - Choosing to move algebraic foreach to combiner
2015-06-07 18:24:28,822 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2015-06-07 18:24:28,823 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2015-06-07 18:24:28,830 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 18:24:28,831 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-07 18:24:28,833 [main] INFO org.apache.pig.tools.pigstats.mapreduce.MRScriptState - Pig script settings are added to the job
2015-06-07 18:24:28,834 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2015-06-07 18:24:28,854 [main] WARN org.apache.hadoop.hive.conf.HiveConf - DEPRECATED: hive.metastore.ds.retry.* no longer has any effect. Use hive.hmshandler.retry.* instead
2015-06-07 18:24:28,854 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.attempts does not exist
2015-06-07 18:24:28,854 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.interval does not exist
2015-06-07 18:24:28,854 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.stats.map.parallelism does not exist
2015-06-07 18:24:28,880 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 18:24:28,880 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 18:24:28,880 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Reduce phase detected, estimating # of required reducers.
2015-06-07 18:24:28,880 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Using reducer estimator: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator
2015-06-07 18:24:28,902 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=0
2015-06-07 18:24:28,902 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting Parallelism to 1
2015-06-07 18:24:28,902 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
2015-06-07 18:24:28,902 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - This job cannot be converted run in-process
2015-06-07 18:24:28,974 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/lib/hive-metastore-0.14.0.jar to DistributedCache through /tmp/temp-623380625/tmp-37665561/hive-metastore-0.14.0.jar
2015-06-07 18:24:29,008 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/lib/libthrift-0.9.0.jar to DistributedCache through /tmp/temp-623380625/tmp-137153112/libthrift-0.9.0.jar
2015-06-07 18:24:29,099 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/lib/hive-exec-0.14.0.jar to DistributedCache through /tmp/temp-623380625/tmp389340284/hive-exec-0.14.0.jar
2015-06-07 18:24:29,132 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/lib/libfb303-0.9.0.jar to DistributedCache through /tmp/temp-623380625/tmp-1697339981/libfb303-0.9.0.jar
2015-06-07 18:24:29,166 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/lib/jdo-api-3.0.1.jar to DistributedCache through /tmp/temp-623380625/tmp1397486110/jdo-api-3.0.1.jar
2015-06-07 18:24:29,191 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/lib/hive-hbase-handler-0.14.0.jar to DistributedCache through /tmp/temp-623380625/tmp-17183688/hive-hbase-handler-0.14.0.jar
2015-06-07 18:24:29,224 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/hcatalog/share/hcatalog/hive-hcatalog-core-0.14.0.jar to DistributedCache through /tmp/temp-623380625/tmp185634080/hive-hcatalog-core-0.14.0.jar
2015-06-07 18:24:29,249 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/hcatalog/share/hcatalog/hive-hcatalog-pig-adapter-0.14.0.jar to DistributedCache through /tmp/temp-623380625/tmp1947741449/hive-hcatalog-pig-adapter-0.14.0.jar
2015-06-07 18:24:29,291 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/pig-0.14.0/pig-0.14.0-core-h2.jar to DistributedCache through /tmp/temp-623380625/tmp-541595554/pig-0.14.0-core-h2.jar
2015-06-07 18:24:29,324 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/pig-0.14.0/lib/automaton-1.11-8.jar to DistributedCache through /tmp/temp-623380625/tmp2124757516/automaton-1.11-8.jar
2015-06-07 18:24:29,357 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/pig-0.14.0/lib/antlr-runtime-3.4.jar to DistributedCache through /tmp/temp-623380625/tmp2107909947/antlr-runtime-3.4.jar
2015-06-07 18:24:29,391 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/pig-0.14.0/lib/joda-time-2.1.jar to DistributedCache through /tmp/temp-623380625/tmp-833660242/joda-time-2.1.jar
2015-06-07 18:24:29,399 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2015-06-07 18:24:29,400 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is false, will not generate code.
2015-06-07 18:24:29,400 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Starting process to move generated code to distributed cacche
2015-06-07 18:24:29,400 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Setting key [pig.schematuple.classes] with classes to deserialize []
2015-06-07 18:24:29,438 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2015-06-07 18:24:29,438 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 18:24:29,440 [JobControl] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-07 18:24:29,446 [JobControl] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 18:24:29,474 [JobControl] WARN org.apache.hadoop.mapreduce.JobSubmitter - No job jar file set. User classes may not be found. See Job or Job#setJar(String).
2015-06-07 18:24:29,539 [JobControl] INFO org.apache.hadoop.mapred.FileInputFormat - Total input paths to process : 1
2015-06-07 18:24:29,539 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
2015-06-07 18:24:29,640 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2015-06-07 18:24:29,690 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1433663908007_0009
2015-06-07 18:24:29,695 [JobControl] INFO org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resources.
2015-06-07 18:24:29,750 [JobControl] INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1433663908007_0009
2015-06-07 18:24:29,753 [JobControl] INFO org.apache.hadoop.mapreduce.Job - The url to track the job: http://stargate:8088/proxy/applicati...63908007_0009/
2015-06-07 18:24:29,939 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1433663908007_0009
2015-06-07 18:24:29,939 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases custTotalVendus,groupeVendus,ordiVendus,ventes
2015-06-07 18:24:29,939 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: ventes[1,9],ordiVendus[2,13],custTotalVendus[4,18],groupeVendus[3,15] C: custTotalVendus[4,18],groupeVendus[3,15] R: custTotalVendus[4,18]
2015-06-07 18:24:29,945 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2015-06-07 18:24:29,945 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1433663908007_0009]
2015-06-07 18:25:02,469 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete
2015-06-07 18:25:02,469 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1433663908007_0009]
2015-06-07 18:25:07,479 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1433663908007_0009]
2015-06-07 18:25:10,489 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-07 18:25:10,494 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-06-07 18:25:10,618 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-07 18:25:10,621 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-06-07 18:25:10,671 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-07 18:25:10,675 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-06-07 18:25:10,707 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2015-06-07 18:25:10,707 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics:

HadoopVersion PigVersion UserId StartedAt FinishedAt Features
2.6.0 0.14.0 hduser 2015-06-07 18:24:28 2015-06-07 18:25:10 GROUP_BY,FILTER

Success!

Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTime AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs
job_1433663908007_0009 1 1 13 13 13 13 3 3 3 3 custTotalVendus,groupeVendus,ordiVendus,ventes GROUP_BY,COMBINER hdfs://stargate:9000/tmp/temp-623380625/tmp-830833194,

Input(s):
Successfully read 9 records (12422 bytes) from: "jbedb.ventes"

Output(s):
Successfully stored 2 records (33 bytes) in: "hdfs://stargate:9000/tmp/temp-623380625/tmp-830833194"

Counters:
Total records written : 2
Total bytes written : 33
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_1433663908007_0009

2015-06-07 18:25:10,709 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-07 18:25:10,713 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-06-07 18:25:10,740 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-07 18:25:10,743 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-06-07 18:25:10,775 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-07 18:25:10,778 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-06-07 18:25:10,813 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2015-06-07 18:25:10,814 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 18:25:10,814 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 18:25:10,814 [main] INFO org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
2015-06-07 18:25:10,821 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2015-06-07 18:25:10,821 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(Zidane,2400)
(Platini,500)

**bordi** · 07/06/2015, 19h36

la boucle est bouclée après le LOAD, le FILTER, le GROUP BY, le FO EACH on sauve le résultat dans une table hive qui doit à au format et au type attendus par
pig , dans ce cas précis, un ensemble de ligne (string, bigint)

Une surprise, lors du store on ne peut pas préciser les champ, il faut le faire avant au niveau du foreach, exemple group as custname, sum( x) as value;
sinon le store sera rejeté par hive. ca ce n'est pas clairement dit.

Dans avant lancement du STORE PIG, il faut creer une table résultat du côté HIVE
: jdbc:hive2://stargate:10000/jbedb>[ create table ventesclient ( custname String, value bigint) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
No rows affected (1,041 seconds)
0: jdbc:hive2://stargate:10000/jbedb> select * from ventesclient;
+------------------------+---------------------+--+
| ventesclient.custname | ventesclient.value |
+------------------------+---------------------+--+

Maintenant on peut lancer le STORE côté pig
grunt> STORE custTotalVendus INTO 'jbedb.ventesclient' USING org.apache.hive.hcatalog.pig.HCatStorer();
2015-06-07 19:28:09,295 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 19:28:09,295 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 19:28:09,328 [main] WARN org.apache.hadoop.hive.conf.HiveConf - DEPRECATED: hive.metastore.ds.retry.* no longer has any effect. Use hive.hmshandler.retry.* instead
2015-06-07 19:28:09,328 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.attempts does not exist
2015-06-07 19:28:09,328 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.interval does not exist
2015-06-07 19:28:09,328 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.stats.map.parallelism does not exist
2015-06-07 19:28:09,329 [main] INFO hive.metastore - Trying to connect to metastore with URI thrift://stargate:9083
2015-06-07 19:28:09,329 [main] INFO hive.metastore - Connected to metastore.
2015-06-07 19:28:09,367 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 19:28:09,367 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 19:28:09,385 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 19:28:09,385 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 19:28:09,421 [main] WARN org.apache.hadoop.hive.conf.HiveConf - DEPRECATED: hive.metastore.ds.retry.* no longer has any effect. Use hive.hmshandler.retry.* instead
2015-06-07 19:28:09,422 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.attempts does not exist
2015-06-07 19:28:09,422 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.interval does not exist
2015-06-07 19:28:09,422 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.stats.map.parallelism does not exist
2015-06-07 19:28:09,446 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 19:28:09,446 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 19:28:09,472 [main] WARN org.apache.hadoop.hive.conf.HiveConf - DEPRECATED: hive.metastore.ds.retry.* no longer has any effect. Use hive.hmshandler.retry.* instead
2015-06-07 19:28:09,472 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.attempts does not exist
2015-06-07 19:28:09,472 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.interval does not exist
2015-06-07 19:28:09,472 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.stats.map.parallelism does not exist
2015-06-07 19:28:09,512 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
2015-06-07 19:28:09,528 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 19:28:09,531 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 19:28:09,572 [main] WARN org.apache.hadoop.hive.conf.HiveConf - DEPRECATED: hive.metastore.ds.retry.* no longer has any effect. Use hive.hmshandler.retry.* instead
2015-06-07 19:28:09,572 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.attempts does not exist
2015-06-07 19:28:09,572 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.interval does not exist
2015-06-07 19:28:09,572 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.stats.map.parallelism does not exist
2015-06-07 19:28:09,589 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 19:28:09,590 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 19:28:09,632 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: GROUP_BY,FILTER
2015-06-07 19:28:09,652 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 19:28:09,652 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 19:28:09,653 [main] INFO org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
2015-06-07 19:28:09,653 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}
2015-06-07 19:28:09,665 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2015-06-07 19:28:09,668 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.CombinerOptimizerUtil - Choosing to move algebraic foreach to combiner
2015-06-07 19:28:09,680 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2015-06-07 19:28:09,680 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2015-06-07 19:28:09,689 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 19:28:09,690 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-07 19:28:09,693 [main] INFO org.apache.pig.tools.pigstats.mapreduce.MRScriptState - Pig script settings are added to the job
2015-06-07 19:28:09,693 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2015-06-07 19:28:09,717 [main] WARN org.apache.hadoop.hive.conf.HiveConf - DEPRECATED: hive.metastore.ds.retry.* no longer has any effect. Use hive.hmshandler.retry.* instead
2015-06-07 19:28:09,718 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.attempts does not exist
2015-06-07 19:28:09,718 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.interval does not exist
2015-06-07 19:28:09,718 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.stats.map.parallelism does not exist
2015-06-07 19:28:09,751 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 19:28:09,751 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 19:28:09,751 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Reduce phase detected, estimating # of required reducers.
2015-06-07 19:28:09,751 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Using reducer estimator: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator
2015-06-07 19:28:09,778 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=0
2015-06-07 19:28:09,778 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting Parallelism to 1
2015-06-07 19:28:09,778 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
2015-06-07 19:28:09,778 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - This job cannot be converted run in-process
2015-06-07 19:28:09,869 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/lib/hive-metastore-0.14.0.jar to DistributedCache through /tmp/temp1318958339/tmp-782314409/hive-metastore-0.14.0.jar
2015-06-07 19:28:09,902 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/lib/libthrift-0.9.0.jar to DistributedCache through /tmp/temp1318958339/tmp-237449792/libthrift-0.9.0.jar
2015-06-07 19:28:09,986 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/lib/hive-exec-0.14.0.jar to DistributedCache through /tmp/temp1318958339/tmp-1645679962/hive-exec-0.14.0.jar
2015-06-07 19:28:10,019 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/lib/libfb303-0.9.0.jar to DistributedCache through /tmp/temp1318958339/tmp754678357/libfb303-0.9.0.jar
2015-06-07 19:28:10,052 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/lib/jdo-api-3.0.1.jar to DistributedCache through /tmp/temp1318958339/tmp50696003/jdo-api-3.0.1.jar
2015-06-07 19:28:10,086 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/lib/hive-hbase-handler-0.14.0.jar to DistributedCache through /tmp/temp1318958339/tmp461403586/hive-hbase-handler-0.14.0.jar
2015-06-07 19:28:10,119 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/hcatalog/share/hcatalog/hive-hcatalog-core-0.14.0.jar to DistributedCache through /tmp/temp1318958339/tmp1556016879/hive-hcatalog-core-0.14.0.jar
2015-06-07 19:28:10,144 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/hcatalog/share/hcatalog/hive-hcatalog-pig-adapter-0.14.0.jar to DistributedCache through /tmp/temp1318958339/tmp-703920126/hive-hcatalog-pig-adapter-0.14.0.jar
2015-06-07 19:28:10,186 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/pig-0.14.0/pig-0.14.0-core-h2.jar to DistributedCache through /tmp/temp1318958339/tmp1338067857/pig-0.14.0-core-h2.jar
2015-06-07 19:28:10,219 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/pig-0.14.0/lib/automaton-1.11-8.jar to DistributedCache through /tmp/temp1318958339/tmp-185389387/automaton-1.11-8.jar
2015-06-07 19:28:10,252 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/pig-0.14.0/lib/antlr-runtime-3.4.jar to DistributedCache through /tmp/temp1318958339/tmp-1567315164/antlr-runtime-3.4.jar
2015-06-07 19:28:10,285 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/pig-0.14.0/lib/joda-time-2.1.jar to DistributedCache through /tmp/temp1318958339/tmp-1207475514/joda-time-2.1.jar
2015-06-07 19:28:10,297 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2015-06-07 19:28:10,298 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is false, will not generate code.
2015-06-07 19:28:10,298 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Starting process to move generated code to distributed cacche
2015-06-07 19:28:10,299 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Setting key [pig.schematuple.classes] with classes to deserialize []
2015-06-07 19:28:10,382 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2015-06-07 19:28:10,382 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 19:28:10,384 [JobControl] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-07 19:28:10,392 [JobControl] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 19:28:10,427 [JobControl] WARN org.apache.hadoop.hive.conf.HiveConf - DEPRECATED: hive.metastore.ds.retry.* no longer has any effect. Use hive.hmshandler.retry.* instead
2015-06-07 19:28:10,427 [JobControl] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.attempts does not exist
2015-06-07 19:28:10,427 [JobControl] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.ds.retry.interval does not exist
2015-06-07 19:28:10,427 [JobControl] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.stats.map.parallelism does not exist
2015-06-07 19:28:10,440 [JobControl] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-07 19:28:10,440 [JobControl] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-07 19:28:10,441 [JobControl] INFO hive.metastore - Trying to connect to metastore with URI thrift://stargate:9083
2015-06-07 19:28:10,441 [JobControl] INFO hive.metastore - Connected to metastore.
2015-06-07 19:28:10,454 [JobControl] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
2015-06-07 19:28:10,485 [JobControl] WARN org.apache.hadoop.mapreduce.JobSubmitter - No job jar file set. User classes may not be found. See Job or Job#setJar(String).
2015-06-07 19:28:10,555 [JobControl] INFO org.apache.hadoop.mapred.FileInputFormat - Total input paths to process : 1
2015-06-07 19:28:10,555 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
2015-06-07 19:28:10,651 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2015-06-07 19:28:10,710 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1433663908007_0016
2015-06-07 19:28:10,715 [JobControl] INFO org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resources.
2015-06-07 19:28:10,783 [JobControl] INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1433663908007_0016
2015-06-07 19:28:10,786 [JobControl] INFO org.apache.hadoop.mapreduce.Job - The url to track the job: http://stargate:8088/proxy/applicati...63908007_0016/
2015-06-07 19:28:10,883 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1433663908007_0016
2015-06-07 19:28:10,883 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases custTotalVendus,groupeVendus,ordiVendus,ventes
2015-06-07 19:28:10,883 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: ventes[1,9],ordiVendus[2,13],custTotalVendus[5,18],groupeVendus[3,15] C: custTotalVendus[5,18],groupeVendus[3,15] R: custTotalVendus[5,18]
2015-06-07 19:28:10,889 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2015-06-07 19:28:10,890 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1433663908007_0016]
2015-06-07 19:28:35,693 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete
2015-06-07 19:28:35,693 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1433663908007_0016]
2015-06-07 19:28:53,232 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1433663908007_0016]
2015-06-07 19:28:56,241 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-07 19:28:56,248 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-06-07 19:28:56,372 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-07 19:28:56,377 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-06-07 19:28:56,415 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-07 19:28:56,419 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-06-07 19:28:56,447 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2015-06-07 19:28:56,447 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics:

HadoopVersion PigVersion UserId StartedAt FinishedAt Features
2.6.0 0.14.0 hduser 2015-06-07 19:28:09 2015-06-07 19:28:56 GROUP_BY,FILTER

Success!

Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTime AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs
job_1433663908007_0016 1 1 4 4 4 4 14 14 14 14 custTotalVendus,groupeVendus,ordiVendus,ventes GROUP_BY,COMBINER jbedb.ventesclient,

Input(s):
Successfully read 9 records (12422 bytes) from: "jbedb.ventes"

Output(s):
Successfully stored 2 records (24 bytes) in: "jbedb.ventesclient"

Counters:
Total records written : 2
Total bytes written : 24
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_1433663908007_0016

2015-06-07 19:28:56,448 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-07 19:28:56,452 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-06-07 19:28:56,481 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-07 19:28:56,485 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-06-07 19:28:56,512 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-07 19:28:56,517 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-06-07 19:28:56,548 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
grunt>

Dans Hive ca donne
une recap des ventes avec un total par client

0: jdbc:hive2://stargate:10000/jbedb> select * from ventesclient;
+------------------------+---------------------+--+
| ventesclient.custname | ventesclient.value |
+------------------------+---------------------+--+
| Zidane | 2400 |
| Platini | 500 |
+------------------------+---------------------+--+
2 rows selected (0,581 seconds)
0: jdbc:hive2://stargate:10000/jbedb>

Le script pig complet, il est pas bien gros

ventes = LOAD 'jbedb.ventes' USING org.apache.hive.hcatalog.pig.HCatLoader();
ordiVendus = FILTER ventes BY producttype == 'Ordinateur';
dump ordiVendus;
groupeVendus = GROUP ordiVendus BY custname;
dump groupeVendus;
custTotalVendus = FOREACH groupeVendus GENERATE group as custname, SUM( ordiVendus.(value)) as value;
dump custTotalVendus;
STORE custTotalVendus INTO 'jbedb.ventesclient' USING org.apache.hive.hcatalog.pig.HCatStorer();

Conclusion, pig est un langage simplifié et relativement facile à apprendre une fois qu'on a une plateforme correctement installer et configurer, c'est une alternative à la programmation
plus complexe du map reduce en java.

il me reste quelques fonctions à voir avant d'attaquer le gros morceau qu'est le java map reduce.

Affaire à suivre

**bordi** · 08/06/2015, 21h48

on peut faire la même chose avec du hbase, j'évite de reposter les logs qui sont quasiment les même, néanmoins il y a une subtilité, il faut une colonne id en clé primaire avec
une valeur unique de type sequence pour éviter l'écrasement des données via le custId, en hive il fait ca en append alors qu'en hbase il est en insert/update, en hbase
mon exemple marche moins bien puisqu'il a un custId 1001 ou 1002 et on se retrouve qu'avec 2 records au lieu de 10, les 2 derniers écrasent les autres records qui ont la même PK.

Hive: creation au niveau de hive d'une table hbase (ligne,colonne) avec un mapping hive pour manipuler les données de hbase en sql like.

CREATE TABLE hbventes( custId int, custName String, productType String, value Int ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,f:custName,f:productType,f:value') TBLPROPERTIES ('hbase.table.name' = 'hbventes');

pig:

on utilise HBaseStorage avec le mapping des champs hbase/pig

ventes = LOAD 'hbase://hbventes' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('f:custName f:productType f:value', '-loadKey true -limit 5') as ( custId:bytearray, custName:chararray, producttype:chararray, value:int );

la clé se place implicitement dans le premier champ qui est mappé, custId en l'occurrence, les noms des champs sont case sensitif, si les noms sont incorrects, ils seront vide.

ce qui suit est pareil a ce qui a été fait pour hive, on peut traiter comme une table hive-hbase via hcatalog, mais c'est beaucoup plus lent qu'en direct à ce que j'ai constaté.

pour le store, même on créer une table de résultat à partir de hive sur hbase.

Hive
CREATE TABLE hbventesclient( custName String, value Int) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,f:value') TBLPROPERTIES ('hbase.table.name' = 'hbventesclient');
No rows affected (0,983 seconds)

pig:
custTotalVendus = FOREACH groupeVendus GENERATE group as custName, SUM( ordiVendus.(value)) as value;

STORE custTotalVendus INTO 'hbase://hbventesclient' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage( 'f:value');

la cle custName est implicitement stockée

./hbase shell

t.scan hbase

hbase(main):008:0> t=get_table 'hbventes'
base(main):011:0> t.describe
DESCRIPTION ENABLED
'hbventes', {NAME => 'f', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS true
=> '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'false', BLOCKSIZE =
> '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
1 row(s) in 0.0900 seconds
hbase(main):012:0> t.scan
ROW COLUMN+CELL
1001 column=f:custName, timestamp=1433789800767, value=Platini
1001 column=f:productType, timestamp=1433789800767, value=Menage
1001 column=f:value, timestamp=1433789800767, value=700
1002 column=f:custName, timestamp=1433789800767, value=Zidane
1002 column=f:productType, timestamp=1433789800767, value=Ordinateur
1002 column=f:value, timestamp=1433789800767, value=800
2 row(s) in 0.0840 seconds
hbase(main):013:0>

hbase(main):008:0> t=get_table 'hbventesclient'
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hbase-0.98.4-hadoop2/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
2015-06-08 23:13:31,897 WARN [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
0 row(s) in 0.8650 seconds

=> Hbase::Table - hbventesclient
hbase(main):009:0> t.scan
ROW COLUMN+CELL
Zidane column=f:value, timestamp=1433797707423, value=800
1 row(s) in 0.2330 seconds

Affaire à suivre

**bordi** · 10/06/2015, 20h27

les jointures en pig, j'ai fait deux fichiers bidons csv, pas de table, employe et service

la jointure se fait sur l serviceId,
/tmp/employe.csv
1,Bordenave,10,M
2,Dupond,12,M
3,Lexa,11,F
4,Doig,10,F

/tmp/serivce.csv
10,sales
11,purchase
12,inventory

les deux fichiers sont copier par la commande hadoop sur hdfs
hadoop fs -put -f /tmp/employe.csv /tmp
hadoop fs -put -f /tmp/service.csv /tmp

pig -x -mapreduce -useHCatalog
grunt> empData = LOAD '/tmp/employe.csv' USING PigStorage(',') AS ( empId:int, empNom:chararray, serviceId:int, genre:chararray);
2015-06-10 20:20:49,949 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-10 20:20:49,949 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
grunt> serviceData = LOAD '/tmp/service.csv' USING PigStorage(',') AS ( serviceId:int, serviceNom:chararray);
2015-06-10 20:20:57,628 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-10 20:20:57,628 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
grunt> joinEmpService = JOIN serviceData by serviceId, empData by serviceId;
grunt> describe joinEmpService;
joinEmpService: {serviceData::serviceId: int,serviceData::serviceNom: chararray,empData::empId: int,empData::empNom: chararray,empData::serviceId: int,empData::genre: chararray}
grunt>
grunt> dump joinEmpService;
2015-06-10 20:21:15,190 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: HASH_JOIN
2015-06-10 20:21:15,210 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-10 20:21:15,210 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-10 20:21:15,210 [main] WARN org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
2015-06-10 20:21:15,210 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}
2015-06-10 20:21:15,215 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2015-06-10 20:21:15,217 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler$LastInputStreamingOptimizer - Rewrite: POPackage->POForEach to POPackage(JoinPackager)
2015-06-10 20:21:15,217 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2015-06-10 20:21:15,217 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2015-06-10 20:21:15,225 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-10 20:21:15,227 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-10 20:21:15,228 [main] INFO org.apache.pig.tools.pigstats.mapreduce.MRScriptState - Pig script settings are added to the job
2015-06-10 20:21:15,229 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2015-06-10 20:21:15,229 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Reduce phase detected, estimating # of required reducers.
2015-06-10 20:21:15,229 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Using reducer estimator: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator
2015-06-10 20:21:15,233 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=91
2015-06-10 20:21:15,233 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting Parallelism to 1
2015-06-10 20:21:15,233 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - This job cannot be converted run in-process
2015-06-10 20:21:15,299 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/lib/hive-metastore-0.14.0.jar to DistributedCache through /tmp/temp103144561/tmp-1252284296/hive-metastore-0.14.0.jar
2015-06-10 20:21:15,332 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/lib/libthrift-0.9.0.jar to DistributedCache through /tmp/temp103144561/tmp-364917371/libthrift-0.9.0.jar
2015-06-10 20:21:15,424 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/lib/hive-exec-0.14.0.jar to DistributedCache through /tmp/temp103144561/tmp47921849/hive-exec-0.14.0.jar
2015-06-10 20:21:15,457 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/lib/libfb303-0.9.0.jar to DistributedCache through /tmp/temp103144561/tmp227102308/libfb303-0.9.0.jar
2015-06-10 20:21:15,491 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/lib/jdo-api-3.0.1.jar to DistributedCache through /tmp/temp103144561/tmp-209625498/jdo-api-3.0.1.jar
2015-06-10 20:21:15,524 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/lib/hive-hbase-handler-0.14.0.jar to DistributedCache through /tmp/temp103144561/tmp1304119021/hive-hbase-handler-0.14.0.jar
2015-06-10 20:21:15,557 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/hcatalog/share/hcatalog/hive-hcatalog-core-0.14.0.jar to DistributedCache through /tmp/temp103144561/tmp-828728074/hive-hcatalog-core-0.14.0.jar
2015-06-10 20:21:15,582 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/hive/hcatalog/share/hcatalog/hive-hcatalog-pig-adapter-0.14.0.jar to DistributedCache through /tmp/temp103144561/tmp1761929676/hive-hcatalog-pig-adapter-0.14.0.jar
2015-06-10 20:21:15,624 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/pig-0.14.0/pig-0.14.0-core-h2.jar to DistributedCache through /tmp/temp103144561/tmp-1160520263/pig-0.14.0-core-h2.jar
2015-06-10 20:21:15,657 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/pig-0.14.0/lib/automaton-1.11-8.jar to DistributedCache through /tmp/temp103144561/tmp-1409813105/automaton-1.11-8.jar
2015-06-10 20:21:15,682 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/pig-0.14.0/lib/antlr-runtime-3.4.jar to DistributedCache through /tmp/temp103144561/tmp-747331071/antlr-runtime-3.4.jar
2015-06-10 20:21:15,715 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Added jar file:/usr/local/pig-0.14.0/lib/joda-time-2.1.jar to DistributedCache through /tmp/temp103144561/tmp-1024208495/joda-time-2.1.jar
2015-06-10 20:21:15,724 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2015-06-10 20:21:15,725 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is false, will not generate code.
2015-06-10 20:21:15,725 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Starting process to move generated code to distributed cacche
2015-06-10 20:21:15,725 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Setting key [pig.schematuple.classes] with classes to deserialize []
2015-06-10 20:21:15,743 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2015-06-10 20:21:15,745 [JobControl] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-10 20:21:15,782 [JobControl] WARN org.apache.hadoop.mapreduce.JobSubmitter - No job jar file set. User classes may not be found. See Job or Job#setJar(String).
2015-06-10 20:21:15,845 [JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2015-06-10 20:21:15,845 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2015-06-10 20:21:15,847 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
2015-06-10 20:21:15,854 [JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2015-06-10 20:21:15,854 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2015-06-10 20:21:15,856 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
2015-06-10 20:21:15,957 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - number of splits:2
2015-06-10 20:21:16,007 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1433956492399_0009
2015-06-10 20:21:16,011 [JobControl] INFO org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resources.
2015-06-10 20:21:16,066 [JobControl] INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1433956492399_0009
2015-06-10 20:21:16,070 [JobControl] INFO org.apache.hadoop.mapreduce.Job - The url to track the job: http://stargate:8088/proxy/applicati...56492399_0009/
2015-06-10 20:21:16,244 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1433956492399_0009
2015-06-10 20:21:16,244 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases empData,joinEmpService,serviceData
2015-06-10 20:21:16,244 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: serviceData[8,14],serviceData[-1,-1],joinEmpService[9,17],empData[7,10],empData[-1,-1],joinEmpService[9,17] C: R:
2015-06-10 20:21:16,249 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2015-06-10 20:21:16,249 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1433956492399_0009]
2015-06-10 20:21:28,265 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete
2015-06-10 20:21:28,265 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1433956492399_0009]
2015-06-10 20:21:43,285 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1433956492399_0009]
2015-06-10 20:21:46,293 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-10 20:21:46,300 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-06-10 20:21:46,458 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-10 20:21:46,463 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-06-10 20:21:46,505 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-10 20:21:46,510 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-06-10 20:21:46,542 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2015-06-10 20:21:46,542 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics:

HadoopVersion PigVersion UserId StartedAt FinishedAt Features
2.6.0 0.14.0 hduser 2015-06-10 20:21:15 2015-06-10 20:21:46 HASH_JOIN

Success!

Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTime AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs
job_1433956492399_0009 2 1 3 3 3 3 12 12 12 12 empData,joinEmpService,serviceData HASH_JOIN hdfs://stargate:9000/tmp/temp103144561/tmp-1484661101,

Input(s):
Successfully read 5 records from: "/tmp/employe.csv"
Successfully read 3 records from: "/tmp/service.csv"

Output(s):
Successfully stored 4 records (131 bytes) in: "hdfs://stargate:9000/tmp/temp103144561/tmp-1484661101"

Counters:
Total records written : 4
Total bytes written : 131
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_1433956492399_0009

2015-06-10 20:21:46,543 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-10 20:21:46,547 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-06-10 20:21:46,579 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-10 20:21:46,582 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-06-10 20:21:46,610 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at stargate/192.168.0.11:8032
2015-06-10 20:21:46,614 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-06-10 20:21:46,649 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning ACCESSING_NON_EXISTENT_FIELD 3 time(s).
2015-06-10 20:21:46,649 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2015-06-10 20:21:46,649 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-06-10 20:21:46,650 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-06-10 20:21:46,650 [main] INFO org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
2015-06-10 20:21:46,657 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2015-06-10 20:21:46,658 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(10,sales ,4,Doig,10,F)
(10,sales ,1,Bordenave,10,M)
(11,purchase,3,Lexa,11,F)
(12,inventory,2,Dupond,12,M)

**bordi** · 10/06/2015, 20h57

ORDER BY en pig sur une table hive via metastore hcatalog

ventes = LOAD 'jbedb.ventes' USING org.apache.hive.hcatalog.pig.HCatLoader;
ordreVentes = ORDER ventes by value DESC;
dump orderVentes;

utilisation de la boite à outils datafu

http://datafu.incubator.apache.org/d...tatistics.html

creation d'un fichier input
0
1
2
3
4
3
2
copy sur le hdfs avec hadoop fs -put -f input /tmp

calcul du median

REGISTER /usr/local/datafu/lib/datafu-1.2.0.jar
DEFINE Median datafu.pig.stats.StreamingMedian();
data = LOAD '/tmp/input' using PigStorage() as (val:int);
data = FOREACH (GROUP data ALL) GENERATE Median(data);
dump data;

calcul quantile.

ventes = LOAD 'jbedb.ventes' USING org.apache.hive.hcatalog.pig.HCatLoader;
ordreVentes = ORDER ventes by value DESC;
dump ordreVentes;

2015-06-10 21:45:47,821 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(1001,Platini,Menage,2000,)
(1002,Zidane,Menage,1000,)
(1002,Zidane,Ordinateur,1000,)
(1002,Zidane,Ordinateur,800,)
(1001,Platini,Menage,700,)
(1002,Zidane,Ordinateur,600,)
(1001,Platini,Menage,600,)
(1001,Platini,Ordinateur,500,)
(1002,Zidane,Menage,500,)

REGISTER /usr/local/datafu/lib/datafu-1.2.0.jar
DEFINE Quantile datafu.pig.stats.Quantile( '0.0', '0.25', '0.5', '0.75', '1.0' );
quantData = FOREACH ( GROUP ordreVentes ALL ) GENERATE Quantile(ordreVentes.value);
describe quantData;
quantData: {(quantile_0_0: double,quantile_0_25: double,quantile_0_5: double,quantile_0_75: double,quantile_1_0: double)}

dump quantData;

extrait de log après calcul
2015-06-10 21:30:40,541 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
((500.0,600.0,700.0,1000.0,2000.0))

**bordi** · 10/06/2015, 23h13

utilisation des UDF (user definition function) pour pig

http://pig.apache.org/docs/r0.14.0/udf.html

REGISTER /home/hduser/udf/jbetest/genreLibelle.jar
empData = LOAD '/tmp/employe.csv' USING PigStorage(',') AS ( empId:int, empNom:chararray, serviceId:int, genre:chararray);
dump empData

(1,Bordenave,10,M)
(2,Dupond,12,M)
(3,Lexa,11,F)
(4,Doig,10,F)
(,,,)

on va applquer un UDF pour afficher Homme/Femme pour F/M

genre = FOREACH empData generate empId, jbetest.genreLibelle(genre);

dump genre
2015-06-10 23:09:44,158 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1

(1,Homme)
(2,Homme)
(3,Femme)
(4,Femme)
(,)

grunt>

le programme java, subtilité ! j'ai une ligne vide dans mon fichier, je suis obligé de tester le null.

package jbetest;

import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;

public class genreLibelle extends EvalFunc<String>
{
public String exec(Tuple input) throws IOException {

if ( input == null || input.size() == 0) return null;

try {

String str=(String) input.get(0);
if ( str == null ) return null;
if ( str.equals("F") ) {
return "Femme";
}
else {
return "Homme";
}

}
catch ( Exception e ) {
throw new IOException("caught exception processing input row",e);
}
}
}

./compilepig.sh jbetest/genreLibelle.java

script compilepig.sh

#!/bin/bash
set -x
if [ "$1" == "" ]; then
echo "Usage: $0 <java file>"
exit 1
fi

CNAME=${1%.java}
JARNAME=$CNAME.jar
JARDIR=/tmp/pig_jars/$CNAME
CLASSPATH=$(ls $PIG_HOME/pig*h1.jar):$(ls $PIG_HOME/pig*h2.jar):$(ls $HADOOP_HOME/share/hadoop/common/hadoop-common-?.?.?.jar)

mkdir -p $JARDIR
javac -classpath $CLASSPATH -d $JARDIR/ $1 && jar -cf $JARNAME -C $JARDIR/ .
~

Affaire à suivre

**bordi** · 15/06/2015, 23h54

premier map reduce java, je ne voulais pas faire de wordcount comme en trouve partout et à toutes les sauces.

j'ai fait un fichier csv de test, un quiz par candidat avec un score par question, je vais regrouper ces résultats dans le fichier de sortie
j'utiliser le séparateur :

cat quizcandidat.csv
1:S DeSuza:Q4:0
2:JP Durand:Q2:2
3

dupond:Q3:0
2:JP Durand:Q4:1
2:JP Durand:Q5:5
3:C. Martin:Q1:0
2:JP Durand:Q2:4
2:JP Durand:Q3:1
1:S DeSuza:Q1:1
2:JP Durand:Q4:0
2:JP Durand:Q5:1
3

dupond:Q1:2
3

dupond:Q2:3
2:JP Durand:Q3:0
3

dupond:Q4:1
3

dupond:Q5:3
1:S DeSuza:Q2:2
2:JP Durand:Q2:4
1:S DeSuza:Q3:3
1:S DeSuza:Q5:1
2:JP Durand:Q1:3

LE code, pour comprendre, en regardant le mapper, il faut raisonner, dans l'ordre des paramètres, KEYIN, VALUEIN, KEYOUT, VALUEOUT avec leur type, le reducer est plus compliqué.

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
package jbetest;

import java.io.IOException;
import java.util.StringTokenizer;

import jbetest.QuizScore.QuizMapper;
import jbetest.QuizScore.QuizReducer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class QuizScore {

	public static class QuizMapper extends
			Mapper<LongWritable, Text, Text, LongWritable> {

		private Text word = new Text();

		private LongWritable scoreQ = new LongWritable();

		public void map(LongWritable key, Text value, Context context)
				throws IOException, InterruptedException {

			if (value.getLength() > 0) {
				String[] line = value.toString().split(":");
				try {
					word.set(line[1]);
					scoreQ.set(Long.parseLong(line[3]));

					context.write(word, scoreQ);
				} catch (NumberFormatException e) {
					// cannot parse - ignore
				}
			}
		}
	}

	public static class QuizReducer<KEY> extends
			Reducer<KEY, LongWritable, KEY, LongWritable> {

		private LongWritable result = new LongWritable();

		public void reduce(KEY key, Iterable<LongWritable> values,
				Context context) throws IOException, InterruptedException {
			long sum = 0;
			for (LongWritable val : values) {
				sum += val.get();
			}
			result.set(sum);
			context.write(key, result);
		}

	}

	public static void main(String[] args) throws Exception {

		if (args.length != 2) {
			System.err.println("Usage: QuizScore  <input path> <output path>");
			System.exit(-1);
		}

		for (String arg : args) {
			System.out.println("arg=" + arg);
		}

		Configuration conf = new Configuration();
		Job job = Job.getInstance(conf, "quizscoring");

		job.setJarByClass(QuizScore.class);

		job.setMapperClass(QuizMapper.class);
		job.setCombinerClass(QuizReducer.class);
		job.setReducerClass(QuizReducer.class);

		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(LongWritable.class);

		FileInputFormat.addInputPath(job, new Path(args[0]));

		FileOutputFormat.setOutputPath(job, new Path(args[1]));

		boolean result = job.waitForCompletion(true);

		System.exit(result ? 0 : 1);

	}

}

compilemr.sh

#!/bin/bash
set -x
if [ "$1" == "" ]; then
echo "Usage: $0 <java file>"
exit 1
fi

CNAME=${1%.java}
JARNAME=$CNAME.jar
JARDIR=/tmp/mr_jars/$CNAME
CLASSPATH=$(ls $HADOOP_HOME/share/hadoop/common/hadoop-common-?.?.?.jar):$(ls $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-core-?.?.?.jar):$(ls $HADOOP_HOME/share/hadoop/common/lib/hadoop-annotations-?.?.?.jar)
echo $CLASSPATH
mkdir -p $JARDIR
javac -classpath $CLASSPATH -d $JARDIR/ $1 && jar -cf $JARNAME -C $JARDIR/ .

execution, le ficiher quizcandidat.csv aura été préalablement copier sur l'hdfs, le répertoire output sera créer, il y aura rejet si il existe déjà sur l'hdfs

hadoop jar jbetest/QuizScore.jar jbetest.QuizScore /tmp/mapreduce/input/quizcandidat.csv /tmp/mapreduce/output

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
arg=/tmp/mapreduce/input/quizcandidat.csv
arg=/tmp/mapreduce/output
15/06/15 23:48:11 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/06/15 23:48:12 INFO client.RMProxy: Connecting to ResourceManager at stargate/192.168.0.11:8032
15/06/15 23:48:12 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
15/06/15 23:48:13 INFO input.FileInputFormat: Total input paths to process : 1
15/06/15 23:48:13 INFO mapreduce.JobSubmitter: number of splits:1
15/06/15 23:48:13 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1434388221382_0018
15/06/15 23:48:13 INFO impl.YarnClientImpl: Submitted application application_1434388221382_0018
15/06/15 23:48:13 INFO mapreduce.Job: The url to track the job: http://stargate:8088/proxy/applicati...88221382_0018/
15/06/15 23:48:13 INFO mapreduce.Job: Running job: job_1434388221382_0018
15/06/15 23:48:29 INFO mapreduce.Job: Job job_1434388221382_0018 running in uber mode : false
15/06/15 23:48:29 INFO mapreduce.Job:  map 0% reduce 0%
15/06/15 23:48:34 INFO mapreduce.Job:  map 100% reduce 0%
15/06/15 23:48:40 INFO mapreduce.Job:  map 100% reduce 100%
15/06/15 23:48:40 INFO mapreduce.Job: Job job_1434388221382_0018 completed successfully
15/06/15 23:48:40 INFO mapreduce.Job: Counters: 49
        File System Counters
                FILE: Number of bytes read=86
                FILE: Number of bytes written=214259
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=484
                HDFS: Number of bytes written=49
                HDFS: Number of read operations=6
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
        Job Counters
                Launched map tasks=1
                Launched reduce tasks=1
                Data-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=3168
                Total time spent by all reduces in occupied slots (ms)=5862
                Total time spent by all map tasks (ms)=3168
                Total time spent by all reduce tasks (ms)=2931
                Total vcore-seconds taken by all map tasks=3168
                Total vcore-seconds taken by all reduce tasks=2931
                Total megabyte-seconds taken by all map tasks=4866048
                Total megabyte-seconds taken by all reduce tasks=9004032
        Map-Reduce Framework
                Map input records=22
                Map output records=21
                Map output bytes=378
                Map output materialized bytes=86
                Input split bytes=126
                Combine input records=21
                Combine output records=4
                Reduce input groups=4
                Reduce shuffle bytes=86
                Reduce input records=4
                Reduce output records=4
                Spilled Records=8
                Shuffled Maps =1
                Failed Shuffles=0
                Merged Map outputs=1
                GC time elapsed (ms)=57
                CPU time spent (ms)=1610
                Physical memory (bytes) snapshot=1070522368
                Virtual memory (bytes) snapshot=5298221056
                Total committed heap usage (bytes)=1404043264
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=358
        File Output Format Counters
                Bytes Written=49

Sortie

hduser@stargate:~/mapreduce$ hadoop fs -ls /tmp/mapreduce/output/
Found 2 items
-rw-r--r-- 1 hduser supergroup 0 2015-06-15 23:48 /tmp/mapreduce/output/_SUCCESS
-rw-r--r-- 1 hduser supergroup 49 2015-06-15 23:48 /tmp/mapreduce/output/part-r-00000

fichier du resultat
hduser@stargate:~/mapreduce$ hadoop fs -cat /tmp/mapreduce/output/part-r-00000
C. Martin 0
JP Durand 21
P dupond 9
S DeSuza 7
hduser@stargate:~/mapreduce$

Il va falloir que je regarde comment je peux filtrer/trier

Affaire à suivre

**bordi** · 19/06/2015, 16h23

Bon, je me heurte à un nouveau mur, mes map reduce fonctionnent bien quand ils sont exécutés directement sur le cluster
et maintenant j'essaye d'executer en client sous eclipse un map reduce, je plante en classe not found

j'ai vu des plugin, mais cela me plait a moitier, cela exige une installation locale de hadoop, or je suis sur un window 7 en mode client en travail distant de mon cluster linux.

j'établi bien la communication avec le cluster hdfs/yarn, il voit mes fichiers, il fait la création de mon dossier de résultat, et paf!

je plante en class not found sur la classe locale mapper de ma clase principale, gné! pourtant elle sont définies en statique à l'intérieur de ma classe principale ? bizarre le truc.

quelque chose m'échappe, j'ai essayé de mettre dans des classes séparées, je prends la même claque.

A première vue, je croyais pouvoir envoyer sur mon cluster linux mes classes sérialisés, en utilisant le job.setJarByClass(QuizScore.class); qui contient mes classes mapper/reducer
elle sont pourtant définies en statique et publiques, donc visibles de l'extérieur.

J'espère ne pas devoir envoyer le jar sur le cluster, parce que ca serait pas très commode comme utilisation.

Job job = Job.getInstance(conf, "quizscoring");

job.setJarByClass(QuizScore.class);
// job.setJar("quizScore.jar");

job.setJobName("quizscoring");

job.setUser("hduser");
job.setNumReduceTasks(4);

job.setMapperClass(QuizMapper.class);
// job.setCombinerClass(QuizReducer.class);
job.setReducerClass(QuizReducer.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);

job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

boolean result = job.waitForCompletion(true);
System.out.println("End result=" + result);

arg=/usr/hadoop/
arg=/usr/hadoop/result
log4j:WARN No appenders could be found for logger (org.apache.hadoop.conf.Configuration.deprecation).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class jbetest.QuizScore$QuizMapper not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074)
at org.apache.hadoop.mapreduce.task.JobContextImpl.getMapperClass(JobContextImpl.java:186)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:742)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.ClassNotFoundException: Class jbetest.QuizScore$QuizMapper not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1980)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2072)
... 8 more

soit c'est un truc que j'ai pas compris pour ce qui concerne un client, ou j'ai un problème de conf quelque part

Une idée, comme d'hab, il va falloir que je cherche pour savoir d'ou vient ce truc, bon je reviens quand j'aurais trouvé.

Affaire à suivre

**bordi** · 19/06/2015, 17h44

Me revoila

gné ! c'est bien ce que je craignais, faut que je lui fasse manger le jar, sinon il est pas content, grouink ! il ne semble pas capable de se contenter seulement des classes, dommage,
mais cela peut se comprendre si on doit lui fournir d'autres classes en plus de l'implémentation des classes du map reduce.

Pour eclipse, il faut impérativement ajouter le jar du projet en plus de classes mapper,reducer dans la config du job. lors de l'exécution il faut faire attention aux permissions
des dossiers hadoop, sinon cela sera rejetté en permission denied

la fameuse commande en hard coded dans la description du job, reste plus qu'a rendre cela plus paramétrable. sans ça mon client ne marche pas.
je n'utilise pas de plugin, c'est du brut direct.

job.setJar("/Users/bordi/.m2/repository/QuizScore/QuizScore/0.0.1-SNAPSHOT/QuizScore-0.0.1-SNAPSHOT.jar");

j'ai mis un seul reducer, sinon il réparti par autant de fichier de résultat que de reducer, la je n'ai qu'un fichier de résultat*
mon fichier input sur hdfs

hduser@stargate:/usr/local/hadoop/logs$ hadoop fs -cat /tmp/mapreduce/input/quizcandidat.csv
1:S DeSuza:Q4:0
2:JP Durand:Q2:2
3

dupond:Q3:0
2:JP Durand:Q4:1
2:JP Durand:Q5:5
3:C. Martin:Q1:0
2:JP Durand:Q2:4
2:JP Durand:Q3:1
1:S DeSuza:Q1:1
2:JP Durand:Q4:0
2:JP Durand:Q5:1
3

dupond:Q1:2
3

dupond:Q2:3
2:JP Durand:Q3:0
3

dupond:Q4:1
3

dupond:Q5:3
1:S DeSuza:Q2:2
2:JP Durand:Q2:4
1:S DeSuza:Q3:3
1:S DeSuza:Q5:1
2:JP Durand:Q1:3

hduser@stargate:/usr/local/hadoop/logs$ hadoop fs -ls /tmp/mapreduce/result
Found 2 items
-rw-r--r-- 3 bordi supergroup 0 2015-06-19 17:37 /tmp/mapreduce/result/_SUCCESS
-rw-r--r-- 3 bordi supergroup 49 2015-06-19 17:37 /tmp/mapreduce/result/part-r-00000

hduser@stargate:/usr/local/hadoop/logs$ hadoop fs -cat /tmp/mapreduce/result/part-r-00000
C. Martin 0
JP Durand 21
P dupond 9
S DeSuza 7

voila le code pour executer dans eclipse W7 en client sur un cluster linux.

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
package jbetest;

import java.io.IOException;

import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Reducer.Context;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class QuizScore {

	public static class QuizMapper extends
			Mapper<LongWritable, Text, Text, LongWritable> {

		private Text word = new Text();

		private LongWritable scoreQ = new LongWritable();

		@Override
		public void map(LongWritable key, Text value, Context context)
				throws IOException, InterruptedException {

			if (value.getLength() > 0) {
				String[] line = value.toString().split(":");
				try {
					word.set(line[1]);
					scoreQ.set(Long.parseLong(line[3]));

					context.write(word, scoreQ);
				} catch (NumberFormatException e) {
					// cannot parse - ignore
				}
			}
		}
	}

	public static class QuizReducer<KEY> extends
			Reducer<KEY, LongWritable, KEY, LongWritable> {

		private LongWritable result = new LongWritable();

		public void reduce(KEY key, Iterable<LongWritable> values,
				Context context) throws IOException, InterruptedException {
			long sum = 0;
			for (LongWritable val : values) {
				sum += val.get();
			}
			result.set(sum);		
			context.write(key, result);
		}

	}

	public static void main(String[] args) throws Exception {

		if (args.length != 2) {
			System.err.println("Usage: QuizScore  <input path> <output path>");
			System.exit(-1);
		}

		for (String arg : args) {
			System.out.println("arg=" + arg);
		}

		Configuration conf = new Configuration();

		// conf.set("fs.defaultFS", "hdfs://192.168.0.11:9000");

		conf.set("fs.default.name", "hdfs://192.168.0.11:9000");
		conf.set("yarn.resourcemanager.scheduler.address", "192.168.0.11:8030");
		conf.set("yarn.resourcemanager.resource-tracker.address",
				"192.168.0.11:8031");
		conf.set("yarn.resourcemanager.address", "192.168.0.11:8032");

		conf.set("mapreduce.jobhistory.address", "192.168.0.11:10020");

		conf.set("mapred.job.tracker", "192.168.0.11:54311");

		conf.set("mapreduce.framework.name", "yarn");

		conf.set("mapreduce.app-submission.cross-platform", "true");
		conf.set("hadoop.job.ugi", "hduser");


		Job job = Job.getInstance(conf, "quizscoring");

		job.setJarByClass(QuizScore.class); // pas suffisant en soi pour le remote, il faut utiliser aussi setjar
	
		job.setJobName("quizscoring");

		job.setUser("hduser");
		job.setNumReduceTasks(1);

		job.setMapperClass(QuizMapper.class);
		job.setCombinerClass(QuizReducer.class);
		job.setReducerClass(QuizReducer.class);

		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(LongWritable.class);
		
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(LongWritable.class);
		
		job.setInputFormatClass(TextInputFormat.class);
		job.setOutputFormatClass(TextOutputFormat.class);
		job.setJar("/Users/bordi/.m2/repository/QuizScore/QuizScore/0.0.1-SNAPSHOT/QuizScore-0.0.1-SNAPSHOT.jar"); // pas cool, mais vraiement pas cool ce truc.
			
		FileInputFormat.addInputPath(job, new Path(args[0]));

		FileOutputFormat.setOutputPath(job, new Path(args[1]));

		boolean result = job.waitForCompletion(true);
		System.out.println("End result=" + result);
		System.exit(result ? 0 : 1);

	}

}

job history

http://192.168.0.11:19888/jobhistory/app

2015.06.19 17:36:36 CEST 2015.06.19 17:36:40 CEST 2015.06.19 17:37:04 CEST job_1434718393926_0017 quizscoring bordi default SUCCEEDED 1 1 1 1

Autre code mais qui utilise Tool runner, pour voir traiter le job via workflow oozie et pas seulement sur le jobhistory

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117


import java.security.PrivilegedExceptionAction;

import jbetest.QuizScore.QuizMapper;
import jbetest.QuizScore.QuizReducer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.security.UserGroupInformation;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;


public class QuizScoreJob extends Configured implements Tool{

	 static Configuration conf = new Configuration();
	 
    public static void main(String[] args) throws Exception {
    	
       
        
        int res = ToolRunner.run(conf, new QuizScoreJob(), args);
        System.exit(res);       
    }

    
    public int run(String[] args) throws Exception {

    		try {
    				UserGroupInformation ugi
    				= UserGroupInformation.createRemoteUser("hduser");
  		      	
			    	ugi.doAs(new PrivilegedExceptionAction<Void>() {
			
			        	public Void run() throws Exception {
			        		
			        	System.out.println("debut");
			        			
			        	
					        String input="/tmp/mapreduce/input/quizcandidat.csv";
					        String output="/tmp/mapreduce/result2";
					        

					        conf.set("fs.default.name", "hdfs://192.168.0.11:9000");
							conf.set("yarn.resourcemanager.scheduler.address", "192.168.0.11:8030");
							conf.set("yarn.resourcemanager.resource-tracker.address",
									"192.168.0.11:8031");
							conf.set("yarn.resourcemanager.address", "192.168.0.11:8032");

							conf.set("mapreduce.jobhistory.address", "192.168.0.11:10020");

							conf.set("mapred.job.tracker", "192.168.0.11:54311");

							conf.set("mapreduce.framework.name", "yarn");

							conf.set("mapreduce.app-submission.cross-platform", "true");
							conf.set("hadoop.job.ugi", "hduser");

					  
					        Job job =  Job.getInstance(conf);
					        job.setJarByClass(QuizScoreJob.class);
					        job.setJobName("Job QuizScore");
					        
					    	job.setUser("hduser");
							job.setNumReduceTasks(1);

							job.setMapperClass(QuizMapper.class);
							job.setCombinerClass(QuizReducer.class);
							job.setReducerClass(QuizReducer.class);

							job.setOutputKeyClass(Text.class);
							job.setOutputValueClass(LongWritable.class);
							
							job.setMapOutputKeyClass(Text.class);
							job.setMapOutputValueClass(LongWritable.class);
							
							job.setInputFormatClass(TextInputFormat.class);
							job.setOutputFormatClass(TextOutputFormat.class);
							
					        // Job Input path
					        FileInputFormat.addInputPath(job, new  
					        Path(input)); 	
					        // Job Output path
					        FileOutputFormat.setOutputPath(job, new 
					        Path(output)); 
					        job.setJar("/Users/bordi/.m2/repository/QuizScore/QuizScore/0.0.1-SNAPSHOT/QuizScore-0.0.1-SNAPSHOT.jar");
				
					        System.out.println("call submit");
					
					        boolean bool=job.waitForCompletion(true);
					        
					        System.out.println(job.getSchedulingInfo());
					        
					        System.out.println("find boolean job="+bool);
							return null; /// return run
						      
					        }} );// run
			     	
			              
			    } catch (Exception e) {
			             e.printStackTrace();
			    }

    		return 0;
    }
}

chose importante, voila les dependencies que j'utilise pour executer mon projet, le reste est standard

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32


	<dependencies>
		<dependency>
			<groupId>org.apache.hadoop</groupId>
			<artifactId>hadoop-common</artifactId>
			<version>2.6.0</version>
		</dependency>
		<dependency>
			<groupId>org.apache.hadoop</groupId>
			<artifactId>hadoop-mapreduce-client-core</artifactId>
			<version>2.6.0</version>
		</dependency>
		<dependency>
			<groupId>org.apache.hadoop</groupId>
			<artifactId>hadoop-mapreduce-client-jobclient</artifactId>
			<version>2.6.0</version>
		</dependency>
			<dependency>
				<groupId>org.apache.hadoop</groupId>
				<artifactId>hadoop-mapreduce-client-shuffle</artifactId>
				<version>2.6.0</version>
			</dependency>

		<dependency>
			<groupId>org.apache.hadoop</groupId>
			<artifactId>hadoop-hdfs</artifactId>
			<version>2.6.0</version>
		</dependency>

	</dependencies>

Affaire à suivre

**bordi** · 19/06/2015, 21h17

Utilisation sous eclipse du hive-jdbc et l'envoi de de requete au serveur hive
ca marche pas mal, c'est plus facile qu'avec le map reduce.

attention à la cohérence des versions, hadoop ne pardonne pas.

Hive version 0.14
hive jdbc driver 1.1.1 (org.apache.hive.jdbc.HiveDriver)

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102

public class TestHiveRemote {
	private static String driverName = "org.apache.hive.jdbc.HiveDriver";

	public static void main(String[] args) throws SQLException {

		try {
			// Register driver and create driver instance
			Class.forName(driverName);
		} catch (ClassNotFoundException ex) {

		}

		// get connection
		System.out.println("before trying to connect");
		Connection con = DriverManager.getConnection(
				"jdbc:hive2://192.168.0.11:10000/jbedb", "hduser", "servus");
		System.out.println("connected");

		// create statement
		Statement stmt = con.createStatement();
		// show tables

		drop_table_consultant(stmt);
		
		create_table_consultant(stmt);
		
		describe_table(stmt);
		
		load_data_consultant(stmt);
		
		select_consultant(stmt);
		
		con.close();
		System.out.println("=============================");
		System.out.println("fini");

	}

	static void show_tables(Statement stmt) throws SQLException {

		String sql = "show tables";
		System.out.println("Running: " + sql);
		ResultSet res = stmt.executeQuery(sql);

		while (res.next()) {

			System.out.println("str=" + res.getString("tab_name"));
		}

	}
	static void drop_table_consultant( Statement stmt) throws SQLException {
		System.out.println("===========DROP==============");
		// execute statement
		stmt.execute("DROP TABLE IF EXISTS "
				+ " consultant");

		System.out.println("Table employee droped.");
	}
	static void create_table_consultant( Statement stmt) throws SQLException {
		System.out.println("===========CREATE============");
		// execute statement
		stmt.execute("CREATE TABLE IF NOT EXISTS "
				+ " consultant ( eid int, nom String, salaire String, job String, serviceId String )"
				+ " ROW FORMAT DELIMITED" + " FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'" 
				+ " STORED AS TEXTFILE ");

		System.out.println("Table employee created.");
	}
	
	
	static void describe_table(Statement stmt) throws SQLException {

		System.out.println("=========DESCRIBE============");
		String sql = "desc consultant";
		System.out.println("describe : " + sql);
		ResultSet res = stmt.executeQuery(sql);

		while (res.next()) {

			System.out.println(res.getString("col_name")+" "+res.getString("data_type"));
		}

	}
	
	static void load_data_consultant( Statement stmt) throws SQLException {
		System.out.println("===========LOAD==============");
		stmt.execute("LOAD DATA LOCAL INPATH '/tmp/consultant.txt' INTO TABLE  consultant");
	}
	
	static void select_consultant( Statement stmt) throws SQLException {
		System.out.println("===========SELECT============");
		System.out.println("list consultant");
		ResultSet res=stmt.executeQuery("SELECT * FROM consultant");
		
		while (res.next() ) {
			
			System.out.println( res.getInt("eid")+" "+res.getString("nom")+" "+res.getString("salaire")+" "+res.getString("job"));
		}
	}
}

dependencies

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
<dependencies>
		<dependency>
			<groupId>org.apache.hadoop</groupId>
			<artifactId>hadoop-common</artifactId>
			<version>2.6.0</version>
		</dependency>
		<dependency>
			<groupId>org.apache.hadoop</groupId>
			<artifactId>hadoop-mapreduce-client-core</artifactId>
			<version>2.6.0</version>
		</dependency>
		<dependency>
			<groupId>org.apache.hadoop</groupId>
			<artifactId>hadoop-mapreduce-client-jobclient</artifactId>
			<version>2.6.0</version>
		</dependency>
			<dependency>
				<groupId>org.apache.hadoop</groupId>
				<artifactId>hadoop-mapreduce-client-shuffle</artifactId>
				<version>2.6.0</version>
			</dependency>

		<dependency>
			<groupId>org.apache.hadoop</groupId>
			<artifactId>hadoop-hdfs</artifactId>
			<version>2.6.0</version>
		</dependency>
		
		<dependency>
			<groupId>org.apache.hive</groupId>
			<artifactId>hive-jdbc</artifactId>
			<version>1.1.1</version>
		</dependency>

Résultat console

connected
===========DROP==============
Table employee droped.
===========CREATE============
Table employee created.
=========DESCRIBE============
describe : desc consultant
eid int
nom string
salaire string
job string
serviceid string
===========LOAD==============
===========SELECT============
list consultant
1 S.Dupont 450000 Developpeur
1 J.Milou 350000 integrateur
1 V.Marin 370000 Manager
=============================
fini

Reste à voir pig et hbase

Affaire à suivre

**bordi** · 21/06/2015, 13h05

Comme vous vous en doutez, je travaille avec eclipse en mode distant sur mon cluster linux, cela fonctionne bien avec
les tâches java map reduce, les tâches java avec hive jdbc.

Maintenant je travaille sur les jobs pig sous eclipse et c'est un peu plus compliqué, faut pas mal bidouiller.

d'abord il faut compiler pig selon la version hadoop avec lequel on travaille, sinon il considère qu'il est en hadoop1
et cela fait des chocapics à l'exécution de mon programme, les interfaces sont incompatibles.

on peut voir les erreurs en ajoutant le log4j.properties sinon on ne voit rien du tout.

il faut récupérer les sources sous windows, et compiler à la racine du projet
avec la commande, il peut générer différent jar selon les versions

ant hadoopversion=23 jar

ensuite il faut récupérer le pig snapshot et je l'ai installé manuellement dans mon m2 pour ce qui me concerne
et j'ai ajouté ma dépendance dans mon pom.xml

extrait pom.xml

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
<dependencies>
		<dependency>
			<groupId>org.apache.hadoop</groupId>
			<artifactId>hadoop-common</artifactId>
			<version>2.6.0</version>
		</dependency>
		<dependency>
			<groupId>org.apache.hadoop</groupId>
			<artifactId>hadoop-mapreduce-client-core</artifactId>
			<version>2.6.0</version>
		</dependency>
		<dependency>
			<groupId>org.apache.hadoop</groupId>
			<artifactId>hadoop-mapreduce-client-jobclient</artifactId>
			<version>2.6.0</version>
		</dependency>
		<dependency>
			<groupId>org.apache.hadoop</groupId>
			<artifactId>hadoop-mapreduce-client-shuffle</artifactId>
			<version>2.6.0</version>
		</dependency>

		<dependency>
			<groupId>org.apache.hadoop</groupId>
			<artifactId>hadoop-hdfs</artifactId>
			<version>2.6.0</version>
		</dependency>

		<dependency>
			<groupId>org.apache.hive</groupId>
			<artifactId>hive-jdbc</artifactId>
			<version>1.1.1</version>
		</dependency>

		<dependency>
			<groupId>org.apache.pig</groupId>
			<artifactId>pig</artifactId>
			<version>0.14.0-SNAPSHOT</version>
		</dependency>

		<dependency>
			<groupId>joda-time</groupId>
			<artifactId>joda-time</artifactId>
			<version>2.8.1</version>
		</dependency>

		<dependency>
			<groupId>dk.brics.automaton</groupId>
			<artifactId>automaton</artifactId>
			<version>1.11-8</version>
		</dependency>

	</dependencies>

A noter, j'utilise la version pig 0.14, et ca plante sur une dépendance jetty, cela a été corriger sur la version pig 0.15,
mais j'ai du reporter le changement dans le build.xml de la 15 vers la 14, comparant les fichiers, ensuite
J'ai pu produire le jar 0.14.0-SNAPSHOT.

Maintenant que j'ai réglé ce problème j'ai pu lancer enfin un job distant sur mon cluster avec yarn, nouvelle tuile.

à noter l'application history voit maintenant mon job. ça progresse, avant je n'arrivait pas a lancer
la tâche dans le cluster. pig semble plus exigeant en mode remote que le map reduce.

bizarre, il manque quelque chose côté hadoop.

log du cluster, yarn log, ca plante sur fg ligne 0, pas de contrôle de tâche

2015-06-21 11:48:52,718 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application Attempt appattempt_1434868270867_0003_000002 is done. finalState=FAILED
2015-06-21 11:48:52,718 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application application_1434868270867_0003 failed 2 times due to AM Container for appattempt_1434868270867_0003_000002 exited with exitCode: 1
For more detailed output, check application tracking page:http://stargate:8088/proxy/applicati...0867_0003/Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1434868270867_0003_02_000001
Exit code: 1
Exception message: /bin/bash: ligne 0 : fg: pas de contrÃ´le de tÃ¢che

Stack trace: ExitCodeException exitCode=1: /bin/bash: ligne 0 : fg: pas de contrÃ´le de tÃ¢che

at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Container exited with a non-zero exit code 1
Failing this attempt. Failing the application.
2015-06-21 11:48:52,719 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo: Application application_1434868270867_0003 requests cleared

Mon code

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
import java.io.IOException;

import java.util.List;
import java.util.Properties;

import org.apache.pig.ExecType;
import org.apache.pig.PigServer;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.backend.executionengine.ExecJob;
	
public class TestPigRemote {

	public static class idmapreduce{
	   public static void main(String[] args) throws IOException {
	   
	   System.out.println("début");
		
		 Properties props = new Properties();
		 props.setProperty("fs.default.name", "hdfs://192.168.0.11:9000");
		 props.setProperty("mapred.job.tracker", "192.168.0.11:54311");
	
		 
		 PigServer pigServer = new PigServer(ExecType.MAPREDUCE, props);
		  
		  
	   try {
		   
	   
		  runIdQueryBatch(pigServer, "/tmp/consultant.txt");
		
	     Thread.sleep(30000);

	   }
	   catch(Exception e) {	
		   e.printStackTrace();
	   }
	   if (pigServer.isBatchOn() ) {
		   pigServer.shutdown();
	   }
	   
	   System.out.println("fin");
	}
	   

	   
	public static void runIdQueryBatch(PigServer pigServer, String inputFile) throws IOException {

		System.out.println("traitement lancement");
		pigServer.setJobName("test pig remoting");
		
		 pigServer.setBatchOn();
		    pigServer.debugOn();
		    pigServer.setValidateEachStatement(true);

	     runIdQuery( pigServer,  inputFile);
	     
	     List<ExecJob> jobs=pigServer.executeBatch(true);
	     
	     System.out.println("wait end job complete");
	     for ( ExecJob job : jobs  ) {
	    	 
		     while ( job.hasCompleted() == false ) {
		    	 try {
					Thread.sleep(10000);
				} catch (InterruptedException e) {
					continue;
				}
		    	
		     }

	    	 System.out.println("job="+job.getStatus());
	    	 System.out.println("job="+job.getConfiguration());
	    	 System.out.println("job="+job.hasCompleted());
	    	 
	    	 if (  job.getException() != null ) {
	    		 job.getException().printStackTrace();
	    	 }
	    	 
	     }

	     
	     System.out.println("traitement terminé");
	   }
	}
	
	   
	   public static void runIdQuery(PigServer pigServer, String inputFile) throws IOException {
			pigServer.setValidateEachStatement(true);
		     pigServer.registerQuery("A = load '"+inputFile+"' using PigStorage(',');");
		     pigServer.registerQuery("B = foreach A generate $0 as id;");
		     pigServer.store("B", "/tmp/idout.txt");
		   
		}
}

eclipse console log après execution de programme.

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
début
log4j:WARN No appenders could be found for logger (org.apache.hadoop.conf.Configuration.deprecation).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
2    [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine  - Connecting to hadoop file system at: hdfs://192.168.0.11:9000
711  [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine  - Connecting to map-reduce job tracker at: 192.168.0.11:54311
traitement lancement
854  [main] DEBUG org.apache.pig.parser.QueryParserDriver  - Original macro AST:
(QUERY (STATEMENT A (load '/tmp/consultant.txt' (FUNC PigStorage ','))))

854  [main] DEBUG org.apache.pig.parser.QueryParserDriver  - macro AST after import:
(QUERY (STATEMENT A (load '/tmp/consultant.txt' (FUNC PigStorage ','))))

854  [main] DEBUG org.apache.pig.parser.QueryParserDriver  - Resulting macro AST:
(QUERY (STATEMENT A (load '/tmp/consultant.txt' (FUNC PigStorage ','))))

5672 [main] DEBUG org.apache.pig.builtin.JsonMetadata  - Could not find schema file for /tmp/consultant.txt
5695 [main] DEBUG org.apache.pig.JVMReuseImpl  - Method cleanupStaticData in class class org.apache.pig.impl.util.UDFContext registered for static data cleanup
5696 [main] DEBUG org.apache.pig.parser.QueryParserDriver  - Original macro AST:
(QUERY (STATEMENT A (load '/tmp/consultant.txt' (FUNC PigStorage ','))))

5696 [main] DEBUG org.apache.pig.parser.QueryParserDriver  - macro AST after import:
(QUERY (STATEMENT A (load '/tmp/consultant.txt' (FUNC PigStorage ','))))

5696 [main] DEBUG org.apache.pig.parser.QueryParserDriver  - Resulting macro AST:
(QUERY (STATEMENT A (load '/tmp/consultant.txt' (FUNC PigStorage ','))))

5708 [main] DEBUG org.apache.pig.builtin.JsonMetadata  - Could not find schema file for /tmp/consultant.txt
5713 [main] DEBUG org.apache.pig.parser.QueryParserDriver  - Original macro AST:
(QUERY (STATEMENT A (load '/tmp/consultant.txt' (FUNC PigStorage ','))) (STATEMENT B (foreach A (FOREACH_PLAN_SIMPLE (generate $0 (FIELD_DEF id))))))

5713 [main] DEBUG org.apache.pig.parser.QueryParserDriver  - macro AST after import:
(QUERY (STATEMENT A (load '/tmp/consultant.txt' (FUNC PigStorage ','))) (STATEMENT B (foreach A (FOREACH_PLAN_SIMPLE (generate $0 (FIELD_DEF id))))))

5713 [main] DEBUG org.apache.pig.parser.QueryParserDriver  - Resulting macro AST:
(QUERY (STATEMENT A (load '/tmp/consultant.txt' (FUNC PigStorage ','))) (STATEMENT B (foreach A (FOREACH_PLAN_SIMPLE (generate $0 (FIELD_DEF id))))))

5731 [main] DEBUG org.apache.pig.builtin.JsonMetadata  - Could not find schema file for /tmp/consultant.txt
5758 [main] DEBUG org.apache.pig.parser.QueryParserDriver  - Original macro AST:
(QUERY (STATEMENT A (load '/tmp/consultant.txt' (FUNC PigStorage ','))) (STATEMENT B (foreach A (FOREACH_PLAN_SIMPLE (generate $0 (FIELD_DEF id))))))

5758 [main] DEBUG org.apache.pig.parser.QueryParserDriver  - macro AST after import:
(QUERY (STATEMENT A (load '/tmp/consultant.txt' (FUNC PigStorage ','))) (STATEMENT B (foreach A (FOREACH_PLAN_SIMPLE (generate $0 (FIELD_DEF id))))))

5759 [main] DEBUG org.apache.pig.parser.QueryParserDriver  - Resulting macro AST:
(QUERY (STATEMENT A (load '/tmp/consultant.txt' (FUNC PigStorage ','))) (STATEMENT B (foreach A (FOREACH_PLAN_SIMPLE (generate $0 (FIELD_DEF id))))))

5770 [main] DEBUG org.apache.pig.builtin.JsonMetadata  - Could not find schema file for /tmp/consultant.txt
5771 [main] DEBUG org.apache.pig.parser.QueryParserDriver  - Original macro AST:
(QUERY (STATEMENT A (load '/tmp/consultant.txt' (FUNC PigStorage ','))) (STATEMENT B (foreach A (FOREACH_PLAN_SIMPLE (generate $0 (FIELD_DEF id))))))

5771 [main] DEBUG org.apache.pig.parser.QueryParserDriver  - macro AST after import:
(QUERY (STATEMENT A (load '/tmp/consultant.txt' (FUNC PigStorage ','))) (STATEMENT B (foreach A (FOREACH_PLAN_SIMPLE (generate $0 (FIELD_DEF id))))))

5771 [main] DEBUG org.apache.pig.parser.QueryParserDriver  - Resulting macro AST:
(QUERY (STATEMENT A (load '/tmp/consultant.txt' (FUNC PigStorage ','))) (STATEMENT B (foreach A (FOREACH_PLAN_SIMPLE (generate $0 (FIELD_DEF id))))))

5782 [main] DEBUG org.apache.pig.builtin.JsonMetadata  - Could not find schema file for /tmp/consultant.txt
5795 [main] INFO  org.apache.pig.tools.pigstats.ScriptState  - Pig features used in the script: UNKNOWN
5832 [main] INFO  org.apache.pig.data.SchemaTupleBackend  - Key [pig.schematuple] was not set... will not generate code.
5858 [main] INFO  org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer  - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}
5903 [main] DEBUG org.apache.pig.JVMReuseImpl  - Method staticDataCleanup in class class org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator registered for static data cleanup
5922 [main] DEBUG org.apache.pig.impl.util.SpillableMemoryManager  - Found heap (Code Cache) of type Non-heap memory
5922 [main] DEBUG org.apache.pig.impl.util.SpillableMemoryManager  - Found heap (PS Eden Space) of type Heap memory
5923 [main] DEBUG org.apache.pig.impl.util.SpillableMemoryManager  - Found heap (PS Survivor Space) of type Heap memory
5923 [main] DEBUG org.apache.pig.impl.util.SpillableMemoryManager  - Found heap (PS Old Gen) of type Heap memory
5923 [main] DEBUG org.apache.pig.impl.util.SpillableMemoryManager  - Found heap (PS Perm Gen) of type Non-heap memory
5923 [main] DEBUG org.apache.pig.impl.util.SpillableMemoryManager  - Selected heap to monitor (PS Old Gen)
5925 [main] DEBUG org.apache.pig.impl.logicalLayer.schema.Schema$FieldSchema  - t: 50 Bag: 120 tuple: 110
5927 [main] DEBUG org.apache.pig.impl.logicalLayer.schema.Schema$FieldSchema  - t: 50 Bag: 120 tuple: 110
5929 [main] DEBUG org.apache.pig.data.SchemaTupleFrontend  - Registering Schema for generation [{bytearray}] with id [0] and context: FOREACH
5932 [main] DEBUG org.apache.pig.impl.logicalLayer.schema.Schema$FieldSchema  - t: 50 Bag: 120 tuple: 110
5950 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler  - File concatenation threshold: 100 optimistic? false
5962 [main] DEBUG org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.SampleOptimizer  - Not a sampling job.
5967 [main] DEBUG org.apache.pig.backend.hadoop.executionengine.util.SecondaryKeyOptimizerUtil  - Cannot find POLocalRearrange or POUnion in map leaf, skip secondary key optimizing
5974 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer  - MR plan size before optimization: 1
5974 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer  - MR plan size after optimization: 1
6182 [main] INFO  org.apache.pig.tools.pigstats.mapreduce.MRScriptState  - Pig script settings are added to the job
6187 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler  - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
6189 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler  - This job cannot be converted run in-process
6528 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler  - Added jar file:/C:/Users/bordi/.m2/repository/org/apache/pig/pig/0.14.0-SNAPSHOT/pig-0.14.0-SNAPSHOT.jar to DistributedCache through /tmp/temp625569534/tmp-921934698/pig-0.14.0-SNAPSHOT.jar
6566 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler  - Added jar file:/C:/Users/bordi/.m2/repository/dk/brics/automaton/automaton/1.11-8/automaton-1.11-8.jar to DistributedCache through /tmp/temp625569534/tmp1321168983/automaton-1.11-8.jar
6599 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler  - Added jar file:/C:/Users/bordi/.m2/repository/org/antlr/antlr-runtime/3.4/antlr-runtime-3.4.jar to DistributedCache through /tmp/temp625569534/tmp180981958/antlr-runtime-3.4.jar
6649 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler  - Added jar file:/C:/Users/bordi/.m2/repository/com/google/guava/guava/11.0.2/guava-11.0.2.jar to DistributedCache through /tmp/temp625569534/tmp484504265/guava-11.0.2.jar
6691 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler  - Added jar file:/C:/Users/bordi/.m2/repository/joda-time/joda-time/2.8.1/joda-time-2.8.1.jar to DistributedCache through /tmp/temp625569534/tmp512577559/joda-time-2.8.1.jar
6739 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler  - Setting up single store job
6744 [main] DEBUG org.apache.pig.data.SchemaTupleFrontend  - Temporary directory for generated code created: C:\Users\bordi\AppData\Local\Temp\1434884627289-0
6744 [main] INFO  org.apache.pig.data.SchemaTupleFrontend  - Key [pig.schematuple] is false, will not generate code.
6744 [main] INFO  org.apache.pig.data.SchemaTupleFrontend  - Starting process to move generated code to distributed cacche
6745 [main] INFO  org.apache.pig.data.SchemaTupleFrontend  - Setting key [pig.schematuple.classes] with classes to deserialize []
6769 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher  - 1 map-reduce job(s) waiting for submission.
6770 [JobControl] DEBUG org.apache.pig.backend.hadoop23.PigJobControl  - Checking state of job job name:	
job id:	job_pigexec_0
job state:	WAITING
job mapred id:	null
job message:	just initialized
job has no depending job:	

7026 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil  - Total input paths to process : 1
7040 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil  - Total input paths (combined) to process : 1
7463 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher  - HadoopJobId: job_1434868270867_0006
7464 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher  - Processing aliases A,B
7464 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher  - detailed locations: M: A[1,4],B[1,57] C:  R: 
7470 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher  - 0% complete
7470 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher  - Running jobs are [job_1434868270867_0006]
12463 [JobControl] DEBUG org.apache.pig.backend.hadoop23.PigJobControl  - Checking state of job job name:	N/A
job id:	job_pigexec_0
job state:	RUNNING
job mapred id:	job_1434868270867_0006
job message:	just initialized
job has no depending job:	

12480 [main] WARN  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher  - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure.
12480 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher  - job job_1434868270867_0006 has failed! Stop running all dependent jobs
12480 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher  - 100% complete
12554 [main] DEBUG org.apache.pig.tools.pigstats.PigStats  - unable to set backend exception
12554 [main] ERROR org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil  - 1 map reduce job(s) failed!
12559 [main] INFO  org.apache.pig.tools.pigstats.mapreduce.SimplePigStats  - Script Statistics: 

HadoopVersion	PigVersion	UserId	StartedAt	FinishedAt	Features
2.6.0	0.14.0-SNAPSHOT	bordi	2015-06-21 13:03:46	2015-06-21 13:03:53	UNKNOWN

Failed!

Failed Jobs:
JobId	Alias	Feature	Message	Outputs
job_1434868270867_0006	A,B	MAP_ONLY	Message: Job failed!	/tmp/idout.txt,

Input(s):
Failed to read data from "/tmp/consultant.txt"

Output(s):
Failed to produce result in "/tmp/idout.txt"

Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_1434868270867_0006


12559 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher  - Failed!
12562 [main] DEBUG org.apache.pig.parser.QueryParserDriver  - Original macro AST:
(QUERY (STATEMENT A (load '/tmp/consultant.txt' (FUNC PigStorage ','))) (STATEMENT B (foreach A (FOREACH_PLAN_SIMPLE (generate $0 (FIELD_DEF id))))))

12562 [main] DEBUG org.apache.pig.parser.QueryParserDriver  - macro AST after import:
(QUERY (STATEMENT A (load '/tmp/consultant.txt' (FUNC PigStorage ','))) (STATEMENT B (foreach A (FOREACH_PLAN_SIMPLE (generate $0 (FIELD_DEF id))))))

12562 [main] DEBUG org.apache.pig.parser.QueryParserDriver  - Resulting macro AST:
(QUERY (STATEMENT A (load '/tmp/consultant.txt' (FUNC PigStorage ','))) (STATEMENT B (foreach A (FOREACH_PLAN_SIMPLE (generate $0 (FIELD_DEF id))))))

12569 [main] DEBUG org.apache.pig.builtin.JsonMetadata  - Could not find schema file for /tmp/consultant.txt
wait end job complete
job=FAILED
job=true
traitement terminé
fin
42606 [Thread-0] DEBUG org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher  - Receive kill signal

Je sens que je ne suis pas loin, va falloir chercher, mais j'ai l'habitude, je reviens

Affaire à suivre

**bordi** · 21/06/2015, 14h24

Me revoilà

Bon, j'ai fini par trouver, pas si simple à expliquer dans le cadre de job pig. c'est bien un problème de configuration
mais à définir correctement du côté client et à reporter côté cluster, notament pour le classpath, j'avais rencontré le problème
avec le java map reduce en remoting, la il faut ajouter plus d'info pour les job pig en remoting.

D'abord il faut avoir en local les fichiers de configuration du cluster ou l'on souhaite se connecter

core-site.xml
hdfs-site.xml
log4j.properties
mapred-site.xml
yarn-site.xml

Mais il faut ajouter quelques éléments supplémentaire dans les config locale/cluster pour permettre d'executer des tâches distant dans le cluster

mapred-site.xml

il faut ajouter le classpath (local/cluster) et le support cross plateforme (client local)

<property>
<name>mapreduce.application.classpath</name>
<value>
/usr/local/hadoop/etc/hadoop/*,
/usr/local/hadoop/share/hadoop/common/*,
/usr/local/hadoop/share/hadoop/common/lib/*,
/usr/local/hadoop/share/hadoop/hdfs/*,
/usr/local/hadoop/share/hadoop/hdfs/lib/*,
/usr/local/hadoop/share/hadoop/mapreduce/*,
/usr/local/hadoop/share/hadoop/mapreduce/lib/*,
/usr/local/hadoop/share/hadoop/yarn/*,
/usr/local/hadoop/share/hadoop/yarn/lib/*
</value>
</property>

<property>
<name>mapred.remote.os</name>
<value>Linux</value>
<description>Remote MapReduce framework's OS, can be either Linux or
Windows
</description>
</property>

<property>
<name>mapreduce.app-submission.cross-platform</name>
<value>true</value>
</property>

yarn-site.xml

il faut ajouter le classpath

<property>
<name>yarn.application.classpath</name>
<value>
/usr/local/hadoop/etc/hadoop/*,
/usr/local/hadoop/share/hadoop/common/*,
/usr/local/hadoop/share/hadoop/common/lib/*,
/usr/local/hadoop/share/hadoop/hdfs/*,
/usr/local/hadoop/share/hadoop/hdfs/lib/*,
/usr/local/hadoop/share/hadoop/mapreduce/*,
/usr/local/hadoop/share/hadoop/mapreduce/lib/*,
/usr/local/hadoop/share/hadoop/yarn/*,
/usr/local/hadoop/share/hadoop/yarn/lib/*
</value>
</property>
</configuration>

Quand le conteneur yarn s'execute, il doit retrouver les jars qu'il a besoin.

Il faut faire attention au démarrage du job history dans le cluster sur tous les noeuds, sinon les tâches se plantent

voila mon code, un petit peu modifier

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
	import java.io.IOException;
	
	import java.util.List;
	import java.util.Properties;
	
	import org.apache.pig.ExecType;
	import org.apache.pig.PigServer;
	import org.apache.pig.backend.executionengine.ExecException;
	import org.apache.pig.backend.executionengine.ExecJob;
		
	public class TestPigRemote {
	
		public static class idmapreduce{
		   public static void main(String[] args) throws IOException {
		   
		   System.out.println("début");
			
			 Properties props = new Properties();
			 props.setProperty("fs.default.name", "hdfs://192.168.0.11:9000");
			 props.setProperty("mapred.job.tracker", "192.168.0.11:54311");
			 props.setProperty("mapreduce.jobhistory.address", "192.168.0.11:10020");
		
			 
			 PigServer pigServer = new PigServer(ExecType.MAPREDUCE, props);
			  
			  
		   try {
			   
		   
			  runIdQueryBatch(pigServer, "/tmp/consultant.txt");
			
		     Thread.sleep(30000);
	
		   }
		   catch(Exception e) {	
			   e.printStackTrace();
		   }
		   if (pigServer.isBatchOn() ) {
			   pigServer.shutdown();
		   }
		   
		   System.out.println("fin");
		}
		   
	
		   
		public static void runIdQueryBatch(PigServer pigServer, String inputFile) throws IOException {
	
			System.out.println("traitement lancement");
			pigServer.setJobName("testPigRemoting");
			
			 pigServer.setBatchOn();
			    pigServer.debugOn();
			    pigServer.setValidateEachStatement(true);
	
		     runIdQuery( pigServer,  inputFile);
		     
		     List<ExecJob> jobs=pigServer.executeBatch(true);
		     
		     System.out.println("wait end job complete");
		     for ( ExecJob job : jobs  ) {
		    	 
			     while ( job.hasCompleted() == false ) {
			    	 try {
						Thread.sleep(10000);
					} catch (InterruptedException e) {
						continue;
					}
			    	
			     }
	
		    	 System.out.println("job="+job.getStatus());
		    	 System.out.println("job="+job.getConfiguration());
		    	 System.out.println("job="+job.hasCompleted());
		    	 
		    	 if (  job.getException() != null ) {
		    		 job.getException().printStackTrace();
		    	 }
		    	 
		     }
	
		     
		     System.out.println("traitement terminé");
		   }
		}
		
		   
		   public static void runIdQuery(PigServer pigServer, String inputFile) throws IOException {
				pigServer.setValidateEachStatement(true);
			     pigServer.registerQuery("A = load '"+inputFile+"' using PigStorage(',');");
			     pigServer.registerQuery("B = foreach A generate $0 as id;");
			     pigServer.store("B", "/tmp/idout");
			   
			}
	}

logs succes

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
début
log4j:WARN No appenders could be found for logger (org.apache.hadoop.conf.Configuration.deprecation).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
2    [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine  - Connecting to hadoop file system at: hdfs://192.168.0.11:9000
696  [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine  - Connecting to map-reduce job tracker at: 192.168.0.11:54311
traitement lancement
838  [main] DEBUG org.apache.pig.parser.QueryParserDriver  - Original macro AST:
(QUERY (STATEMENT A (load '/tmp/consultant.txt' (FUNC PigStorage ','))))

838  [main] DEBUG org.apache.pig.parser.QueryParserDriver  - macro AST after import:
(QUERY (STATEMENT A (load '/tmp/consultant.txt' (FUNC PigStorage ','))))

838  [main] DEBUG org.apache.pig.parser.QueryParserDriver  - Resulting macro AST:
(QUERY (STATEMENT A (load '/tmp/consultant.txt' (FUNC PigStorage ','))))

6002 [main] DEBUG org.apache.pig.builtin.JsonMetadata  - Could not find schema file for /tmp/consultant.txt
6025 [main] DEBUG org.apache.pig.JVMReuseImpl  - Method cleanupStaticData in class class org.apache.pig.impl.util.UDFContext registered for static data cleanup
6025 [main] DEBUG org.apache.pig.parser.QueryParserDriver  - Original macro AST:
(QUERY (STATEMENT A (load '/tmp/consultant.txt' (FUNC PigStorage ','))))

6025 [main] DEBUG org.apache.pig.parser.QueryParserDriver  - macro AST after import:
(QUERY (STATEMENT A (load '/tmp/consultant.txt' (FUNC PigStorage ','))))

6025 [main] DEBUG org.apache.pig.parser.QueryParserDriver  - Resulting macro AST:
(QUERY (STATEMENT A (load '/tmp/consultant.txt' (FUNC PigStorage ','))))

6037 [main] DEBUG org.apache.pig.builtin.JsonMetadata  - Could not find schema file for /tmp/consultant.txt
6043 [main] DEBUG org.apache.pig.parser.QueryParserDriver  - Original macro AST:
(QUERY (STATEMENT A (load '/tmp/consultant.txt' (FUNC PigStorage ','))) (STATEMENT B (foreach A (FOREACH_PLAN_SIMPLE (generate $0 (FIELD_DEF id))))))

6043 [main] DEBUG org.apache.pig.parser.QueryParserDriver  - macro AST after import:
(QUERY (STATEMENT A (load '/tmp/consultant.txt' (FUNC PigStorage ','))) (STATEMENT B (foreach A (FOREACH_PLAN_SIMPLE (generate $0 (FIELD_DEF id))))))

6043 [main] DEBUG org.apache.pig.parser.QueryParserDriver  - Resulting macro AST:
(QUERY (STATEMENT A (load '/tmp/consultant.txt' (FUNC PigStorage ','))) (STATEMENT B (foreach A (FOREACH_PLAN_SIMPLE (generate $0 (FIELD_DEF id))))))

6060 [main] DEBUG org.apache.pig.builtin.JsonMetadata  - Could not find schema file for /tmp/consultant.txt
6087 [main] DEBUG org.apache.pig.parser.QueryParserDriver  - Original macro AST:
(QUERY (STATEMENT A (load '/tmp/consultant.txt' (FUNC PigStorage ','))) (STATEMENT B (foreach A (FOREACH_PLAN_SIMPLE (generate $0 (FIELD_DEF id))))))

6087 [main] DEBUG org.apache.pig.parser.QueryParserDriver  - macro AST after import:
(QUERY (STATEMENT A (load '/tmp/consultant.txt' (FUNC PigStorage ','))) (STATEMENT B (foreach A (FOREACH_PLAN_SIMPLE (generate $0 (FIELD_DEF id))))))

6087 [main] DEBUG org.apache.pig.parser.QueryParserDriver  - Resulting macro AST:
(QUERY (STATEMENT A (load '/tmp/consultant.txt' (FUNC PigStorage ','))) (STATEMENT B (foreach A (FOREACH_PLAN_SIMPLE (generate $0 (FIELD_DEF id))))))

6099 [main] DEBUG org.apache.pig.builtin.JsonMetadata  - Could not find schema file for /tmp/consultant.txt
6100 [main] DEBUG org.apache.pig.parser.QueryParserDriver  - Original macro AST:
(QUERY (STATEMENT A (load '/tmp/consultant.txt' (FUNC PigStorage ','))) (STATEMENT B (foreach A (FOREACH_PLAN_SIMPLE (generate $0 (FIELD_DEF id))))))

6100 [main] DEBUG org.apache.pig.parser.QueryParserDriver  - macro AST after import:
(QUERY (STATEMENT A (load '/tmp/consultant.txt' (FUNC PigStorage ','))) (STATEMENT B (foreach A (FOREACH_PLAN_SIMPLE (generate $0 (FIELD_DEF id))))))

6100 [main] DEBUG org.apache.pig.parser.QueryParserDriver  - Resulting macro AST:
(QUERY (STATEMENT A (load '/tmp/consultant.txt' (FUNC PigStorage ','))) (STATEMENT B (foreach A (FOREACH_PLAN_SIMPLE (generate $0 (FIELD_DEF id))))))

6110 [main] DEBUG org.apache.pig.builtin.JsonMetadata  - Could not find schema file for /tmp/consultant.txt
6123 [main] INFO  org.apache.pig.tools.pigstats.ScriptState  - Pig features used in the script: UNKNOWN
6160 [main] INFO  org.apache.pig.data.SchemaTupleBackend  - Key [pig.schematuple] was not set... will not generate code.
6186 [main] INFO  org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer  - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}
6232 [main] DEBUG org.apache.pig.JVMReuseImpl  - Method staticDataCleanup in class class org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator registered for static data cleanup
6251 [main] DEBUG org.apache.pig.impl.util.SpillableMemoryManager  - Found heap (Code Cache) of type Non-heap memory
6251 [main] DEBUG org.apache.pig.impl.util.SpillableMemoryManager  - Found heap (PS Eden Space) of type Heap memory
6251 [main] DEBUG org.apache.pig.impl.util.SpillableMemoryManager  - Found heap (PS Survivor Space) of type Heap memory
6251 [main] DEBUG org.apache.pig.impl.util.SpillableMemoryManager  - Found heap (PS Old Gen) of type Heap memory
6251 [main] DEBUG org.apache.pig.impl.util.SpillableMemoryManager  - Found heap (PS Perm Gen) of type Non-heap memory
6251 [main] DEBUG org.apache.pig.impl.util.SpillableMemoryManager  - Selected heap to monitor (PS Old Gen)
6253 [main] DEBUG org.apache.pig.impl.logicalLayer.schema.Schema$FieldSchema  - t: 50 Bag: 120 tuple: 110
6256 [main] DEBUG org.apache.pig.impl.logicalLayer.schema.Schema$FieldSchema  - t: 50 Bag: 120 tuple: 110
6258 [main] DEBUG org.apache.pig.data.SchemaTupleFrontend  - Registering Schema for generation [{bytearray}] with id [0] and context: FOREACH
6261 [main] DEBUG org.apache.pig.impl.logicalLayer.schema.Schema$FieldSchema  - t: 50 Bag: 120 tuple: 110
6278 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler  - File concatenation threshold: 100 optimistic? false
6291 [main] DEBUG org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.SampleOptimizer  - Not a sampling job.
6296 [main] DEBUG org.apache.pig.backend.hadoop.executionengine.util.SecondaryKeyOptimizerUtil  - Cannot find POLocalRearrange or POUnion in map leaf, skip secondary key optimizing
6303 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer  - MR plan size before optimization: 1
6303 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer  - MR plan size after optimization: 1
6511 [main] INFO  org.apache.pig.tools.pigstats.mapreduce.MRScriptState  - Pig script settings are added to the job
6517 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler  - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
6518 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler  - This job cannot be converted run in-process
6788 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler  - Added jar file:/C:/Users/bordi/.m2/repository/org/apache/pig/pig/0.14.0-SNAPSHOT/pig-0.14.0-SNAPSHOT.jar to DistributedCache through /tmp/temp321181890/tmp-517751113/pig-0.14.0-SNAPSHOT.jar
6826 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler  - Added jar file:/C:/Users/bordi/.m2/repository/dk/brics/automaton/automaton/1.11-8/automaton-1.11-8.jar to DistributedCache through /tmp/temp321181890/tmp-488374958/automaton-1.11-8.jar
6859 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler  - Added jar file:/C:/Users/bordi/.m2/repository/org/antlr/antlr-runtime/3.4/antlr-runtime-3.4.jar to DistributedCache through /tmp/temp321181890/tmp-1730105890/antlr-runtime-3.4.jar
6917 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler  - Added jar file:/C:/Users/bordi/.m2/repository/com/google/guava/guava/11.0.2/guava-11.0.2.jar to DistributedCache through /tmp/temp321181890/tmp796875788/guava-11.0.2.jar
6959 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler  - Added jar file:/C:/Users/bordi/.m2/repository/joda-time/joda-time/2.8.1/joda-time-2.8.1.jar to DistributedCache through /tmp/temp321181890/tmp436826936/joda-time-2.8.1.jar
6991 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler  - Setting up single store job
6997 [main] DEBUG org.apache.pig.data.SchemaTupleFrontend  - Temporary directory for generated code created: C:\Users\bordi\AppData\Local\Temp\1434888731852-0
6997 [main] INFO  org.apache.pig.data.SchemaTupleFrontend  - Key [pig.schematuple] is false, will not generate code.
6997 [main] INFO  org.apache.pig.data.SchemaTupleFrontend  - Starting process to move generated code to distributed cacche
6997 [main] INFO  org.apache.pig.data.SchemaTupleFrontend  - Setting key [pig.schematuple.classes] with classes to deserialize []
7021 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher  - 1 map-reduce job(s) waiting for submission.
7023 [JobControl] DEBUG org.apache.pig.backend.hadoop23.PigJobControl  - Checking state of job job name:	
job id:	job_pigexec_0
job state:	WAITING
job mapred id:	null
job message:	just initialized
job has no depending job:	

7277 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil  - Total input paths to process : 1
7291 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil  - Total input paths (combined) to process : 1
7698 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher  - HadoopJobId: job_1434886974841_0006
7698 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher  - Processing aliases A,B
7698 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher  - detailed locations: M: A[1,4],B[1,57] C:  R: 
7704 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher  - 0% complete
7705 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher  - Running jobs are [job_1434886974841_0006]
12698 [JobControl] DEBUG org.apache.pig.backend.hadoop23.PigJobControl  - Checking state of job job name:	N/A
job id:	job_pigexec_0
job state:	RUNNING
job mapred id:	job_1434886974841_0006
job message:	just initialized
job has no depending job:	

17597 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher  - 100% complete
17602 [main] INFO  org.apache.pig.tools.pigstats.mapreduce.SimplePigStats  - Script Statistics: 

HadoopVersion	PigVersion	UserId	StartedAt	FinishedAt	Features
2.6.0	0.14.0-SNAPSHOT	bordi	2015-06-21 14:12:11	2015-06-21 14:12:22	UNKNOWN

Success!

Job Stats (time in seconds):
JobId	Maps	Reduces	MaxMapTime	MinMapTime	AvgMapTime	MedianMapTime	MaxReduceTime	MinReduceTime	AvgReduceTime	MedianReducetime	Alias	Feature	Outputs
job_1434886974841_0006	1	0	2	2	2	2	0	0	0	0	A,B	MAP_ONLY	/tmp/idout,

Input(s):
Successfully read 3 records (466 bytes) from: "/tmp/consultant.txt"

Output(s):
Successfully stored 3 records (6 bytes) in: "/tmp/idout"

Counters:
Total records written : 3
Total bytes written : 6
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_1434886974841_0006


17632 [main] DEBUG org.apache.pig.backend.hadoop.executionengine.Launcher  - Error message from task (map) task_1434886974841_0006_m_000000
17656 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher  - Success!
17658 [main] DEBUG org.apache.pig.parser.QueryParserDriver  - Original macro AST:
(QUERY (STATEMENT A (load '/tmp/consultant.txt' (FUNC PigStorage ','))) (STATEMENT B (foreach A (FOREACH_PLAN_SIMPLE (generate $0 (FIELD_DEF id))))))

17658 [main] DEBUG org.apache.pig.parser.QueryParserDriver  - macro AST after import:
(QUERY (STATEMENT A (load '/tmp/consultant.txt' (FUNC PigStorage ','))) (STATEMENT B (foreach A (FOREACH_PLAN_SIMPLE (generate $0 (FIELD_DEF id))))))

17658 [main] DEBUG org.apache.pig.parser.QueryParserDriver  - Resulting macro AST:
(QUERY (STATEMENT A (load '/tmp/consultant.txt' (FUNC PigStorage ','))) (STATEMENT B (foreach A (FOREACH_PLAN_SIMPLE (generate $0 (FIELD_DEF id))))))

17667 [main] DEBUG org.apache.pig.builtin.JsonMetadata  - Could not find schema file for /tmp/consultant.txt
wait end job complete
job=COMPLETED
job=true
traitement terminé
fin
47701 [Thread-0] DEBUG org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher  - Receive kill signal

L'interrogation que cela suscite , c'est le fait qu'il a besoin de connaitre les fichiers de configurations du cluster au niveau local pour la retransmettre
au cluster pour executer le job dans son conteneur, y compris les classpath des jar hadoop. c'est assez bizarre le truc.

il me reste plus que hbase et cela sera fini pour ce sujet.
Affaire à suivre

Hadoop ecosysteme - Hive - hbase - Pig - Map reduce

Hadoop & co

Discussions similaires

Partager

Partager