Hue, l'interface Web pour utiliser Hadoop plus facilement vient de sortir en version 3.9

**Romain.net** · 24/08/2015, 20h45

Un article en Francais detaille toutes les ameliorations de cette nouvelle version du project open source, en particulier l'application pour Spark et le moteur de recherche pour Solr:

http://gethue.com/hue-3-9-avec-ses-a...sorti/?lang=fr

Soyez libre de commenter vos retours ici!

**bordi** · 25/08/2015, 10h29

Merci, je suis en 3.7.1 mais vu qu'il integre R, spark, et solr

c'est une bonne nouvelle, je vais l'integrer sur mon cluster apache hadoop

**bordi** · 05/09/2015, 16h24

bon j'ai installé la 3.9.0, y a pas trop de dégats

D'abord j'ai fait mes sauvegarde de ma 3.7.1 hue.ini et mes scripts pigs/hive,etc...

comme je m'y attendais après installation faut que je remette en place, j'ai utilisé beyond compare
pour mettre a niveau le hue.ini 3.9.0 avec mon hue.ini 3.7.1 et ma conf hadoop.

j'ai eu néanmoins un probleme, Grouink ! Re goruink ! scrhcoupmph, ca m'aurai étonné

lors des l'execution

PREFIXE=/home/hue make install

il ne trouvait pas l'include "gmp.h"

il faut donc faire un install d'une librairie sur mon ubuntu 14.04

sudo apt-get install libgmp3-dev

après ca marche mieux,

puis j'ai redemarré, j'ai perdu mes scripts pig, je vais reintaller, j'ai perdu le connecteur sqoop, tiens ? c'est suspect ? le reste semble la et communique, faut que je lance quelques tests, j'ai bien la nouvelle options avec spark et ses satellites. ah ! c'est du béta, mouai! a voir, je ne sais pas si je vais avoir des soucis avec la 1.4.1 de spark.

Je reviendrai si j'ai des suchis dans le potage, la 3.8 m'avait pas laissée de bons souvenirs. j'ai sonné la retraite en faisant le rollback sur ma 3.7.1

faut que j'install solr pour mieux voir un peut tout ca et que je fasse le tour de ce qui marche, ca prendra un petit peu de temps.

celle la, le premier contact me semble sympatique, je finirai de configurer dans cette version, affaire a suivre

**bordi** · 05/09/2015, 19h55

j'ai installé solr, j'ai démarré le cloud avec 3 nodes, ils pointent sur stockage hdfs, j'accède dans
hue à l'index de ma collection de mon solr via port 8983,

mais bizarement je plante en http 404 page not found sur les options demo
weblog, twiter, yelp-review. Gné ?

Error 404 Not Found HTTP ERROR 404 Problem accessing /solr/twitter_demo/select. Reason: Not FoundPowered by Jetty:// (error 404)

comme si c'était pas déployé, y a comme une bleme qui sent le poisson quelque part, va falloir chercher, pas beaucoup d'info.

edit
En plus ca sent la toutouille cloudera, dans le hue.ini y a l'appel dans la partie indexer d'un outil de deploiment solrctl (CDH 5) pour gerer les appli qui semble lui appartenir, j'espère que c'est pas proprietaire comme impala vu que je suis en apache hadoop et que je peux contourner ce truc.

Ca c'est marrant apache solr et un open source (hue) lié par un truc propriétaire de cloudera, ^^
autant dire que je suis presque marron. impala je peux comprendre je le considère comme une extension optionelle vis à vis de la distribution cloudera, mais apache solr, dur, pas genial, a moins que hue devient une gui orienté cloudera. cette version n'est pas 100% compatible avec tous les hadoop.

###################################################
# Settings to configure Solr Search
###################################################
[search]

# URL of the Solr Server
solr_url=http://stargate:8983/solr/

# Requires FQDN in solr_url if enabled
security_enabled=false

## Query sent when no term is entered
empty_query=*:*

# Use latest Solr 5.2+ features.
latest=false
###################################################
# Settings to configure Solr Indexer
###################################################

[indexer]
# Location of the solrctl binary.
## solrctl_path=/usr/bin/solrctl

bon, en attendant de trouver une solution, demain je me lance à l'assaut de la partie spark/R/pyspark/python et de la gui hue. j'espère avoir plus de chance.

**bordi** · 06/09/2015, 12h38

bon, me revoila, j'ai commencé une partie de spark editor dans Hue, humm, heureusement qu'il est précisé que c'est béta, rien que ca, ca m'inquiéte d'avance, j'ai commencé par quelque ligne de scalla avec spark

ca semblait pas mal au début, avec quelques lignes simple, puis des qu'on commence à complexifier les expression ca fini par se bloquer, ou ca ne repond rien, surtout quand on déclenche l'événement des actions sur les RDD tel qu'un simple count() sur le cluster spark.

sur pyspark dans l'editor hue, si on se plante dans le code, ca fini par se bloquer
sans console d'ereur, je suis obligé de tuer hue, dans pyspark/ipthon notebook ca marche bien

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
lines = sc.textFile('hdfs://stargate:9000/user/hduser/MERCH_DIM.csv')
lines_nonempty = lines.filter( lambda x: len(x) > 0 )
lines_nonempty.count()

en bref ca marchote.
au vu du peu de code que j'utilise, c'est trop instable pour être utilisable avec apache hadoop 2.6.0/spark 1.4.1.

pour scala et spark, meme genre de chose

val merch_cat_dim_RDD = sc.textFile("/user/hduser/MERCH_CAT_DIM.csv").filter(line
=> line.split(",")(0).forall(_.isDigit)).map(line => line.split(","))
val merch_dim_RDD = sc.textFile("/user/hduser/MERCH_DIM.csv").filter(line =>
line.split(",")(0).forall(_.isDigit)).map(line => line.split(","))
val merch_dim_cat_map_RDD = merch_dim_RDD.map(values => (values(3),values))
val merch_cat_map_RDD = merch_cat_dim_RDD.map( values => (values(0),values))

val joinedRDD = merch_dim_cat_map_RDD.join(merch_cat_map_RDD)
val map_cat_names_RDD = joinedRDD.map(k => (k._2._2(1),1))
map_cat_names_RDD.count()

sur hue, scala spark
des qu'on fait count() y a plus personne, Houston we got problem

voila ce que la donne le meme code sur une console spark-shell normale

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
scala> val merch_cat_dim_RDD = sc.textFile("/user/hduser/MERCH_CAT_DIM.csv").filter(line
     | => line.split(",")(0).forall(_.isDigit)).map(line => line.split(","))
15/09/06 11:52:45 INFO MemoryStore: ensureFreeSpace(244456) called with curMem=0, maxMem=1111794647
15/09/06 11:52:45 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 238.7 KB, free 1060.1 MB)
15/09/06 11:52:46 INFO MemoryStore: ensureFreeSpace(20627) called with curMem=244456, maxMem=1111794647
15/09/06 11:52:46 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 20.1 KB, free 1060.0 MB)
15/09/06 11:52:46 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.0.11:57011 (size: 20.1 KB, free: 1060.3 MB)
15/09/06 11:52:46 INFO SparkContext: Created broadcast 0 from textFile at <console>:21
merch_cat_dim_RDD: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[3] at map at <console>:22

scala> val merch_dim_RDD = sc.textFile("/user/hduser/MERCH_DIM.csv").filter(line =>
     | line.split(",")(0).forall(_.isDigit)).map(line => line.split(","))
15/09/06 11:52:46 INFO MemoryStore: ensureFreeSpace(244496) called with curMem=265083, maxMem=1111794647
15/09/06 11:52:46 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 238.8 KB, free 1059.8 MB)
15/09/06 11:52:46 INFO MemoryStore: ensureFreeSpace(20627) called with curMem=509579, maxMem=1111794647
15/09/06 11:52:46 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 20.1 KB, free 1059.8 MB)
15/09/06 11:52:46 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.0.11:57011 (size: 20.1 KB, free: 1060.3 MB)
15/09/06 11:52:46 INFO SparkContext: Created broadcast 1 from textFile at <console>:21
merch_dim_RDD: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[7] at map at <console>:22

scala> val merch_dim_cat_map_RDD = merch_dim_RDD.map(values => (values(3),values))
merch_dim_cat_map_RDD: org.apache.spark.rdd.RDD[(String, Array[String])] = MapPartitionsRDD[8] at map at <console>:23

scala> val merch_cat_map_RDD = merch_cat_dim_RDD.map( values => (values(0),values))
merch_cat_map_RDD: org.apache.spark.rdd.RDD[(String, Array[String])] = MapPartitionsRDD[9] at map at <console>:23

scala>

scala> val joinedRDD = merch_dim_cat_map_RDD.join(merch_cat_map_RDD)
15/09/06 11:52:47 INFO FileInputFormat: Total input paths to process : 1
15/09/06 11:52:47 INFO FileInputFormat: Total input paths to process : 1
joinedRDD: org.apache.spark.rdd.RDD[(String, (Array[String], Array[String]))] = MapPartitionsRDD[12] at join at <console>:29

scala> val map_cat_names_RDD = joinedRDD.map(k => (k._2._2(1),1))
map_cat_names_RDD: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[13] at map at <console>:31

scala> map_cat_names_RDD.count()
15/09/06 11:52:49 INFO SparkContext: Starting job: count at <console>:34
15/09/06 11:52:49 INFO DAGScheduler: Registering RDD 8 (map at <console>:23)
15/09/06 11:52:49 INFO DAGScheduler: Registering RDD 9 (map at <console>:23)
15/09/06 11:52:49 INFO DAGScheduler: Got job 0 (count at <console>:34) with 2 output partitions (allowLocal=false)
15/09/06 11:52:49 INFO DAGScheduler: Final stage: ResultStage 2(count at <console>:34)
15/09/06 11:52:49 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0, ShuffleMapStage 1)
15/09/06 11:52:49 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0, ShuffleMapStage 1)
15/09/06 11:52:49 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[8] at map at <console>:23), which has no missing parents
15/09/06 11:52:49 INFO MemoryStore: ensureFreeSpace(3952) called with curMem=530206, maxMem=1111794647
15/09/06 11:52:49 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 3.9 KB, free 1059.8 MB)
15/09/06 11:52:49 INFO MemoryStore: ensureFreeSpace(2188) called with curMem=534158, maxMem=1111794647
15/09/06 11:52:49 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 2.1 KB, free 1059.8 MB)
15/09/06 11:52:49 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.0.11:57011 (size: 2.1 KB, free: 1060.2 MB)
15/09/06 11:52:49 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:874
15/09/06 11:52:49 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[8] at map at <console>:23)
15/09/06 11:52:49 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
15/09/06 11:52:49 INFO DAGScheduler: Submitting ShuffleMapStage 1 (MapPartitionsRDD[9] at map at <console>:23), which has no missing parents
15/09/06 11:52:49 INFO MemoryStore: ensureFreeSpace(3960) called with curMem=536346, maxMem=1111794647
15/09/06 11:52:49 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 3.9 KB, free 1059.8 MB)
15/09/06 11:52:49 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 192.168.0.11, ANY, 1412 bytes)
15/09/06 11:52:49 INFO MemoryStore: ensureFreeSpace(2194) called with curMem=540306, maxMem=1111794647
15/09/06 11:52:49 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 2.1 KB, free 1059.8 MB)
15/09/06 11:52:49 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 192.168.0.31, ANY, 1412 bytes)
15/09/06 11:52:49 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 192.168.0.11:57011 (size: 2.1 KB, free: 1060.2 MB)
15/09/06 11:52:49 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:874
15/09/06 11:52:49 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 1 (MapPartitionsRDD[9] at map at <console>:23)
15/09/06 11:52:49 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
15/09/06 11:52:49 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, 192.168.0.11, ANY, 1416 bytes)
15/09/06 11:52:49 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, 192.168.0.31, ANY, 1416 bytes)
15/09/06 11:52:50 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 192.168.0.11:55594 (size: 2.1 KB, free: 1589.8 MB)
15/09/06 11:52:50 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.0.11:55594 (size: 2.1 KB, free: 1589.8 MB)
15/09/06 11:52:50 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.0.11:55594 (size: 20.1 KB, free: 1589.7 MB)
15/09/06 11:52:50 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.0.31:49446 (size: 2.1 KB, free: 1589.8 MB)
15/09/06 11:52:50 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.0.11:55594 (size: 20.1 KB, free: 1589.7 MB)
15/09/06 11:52:51 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 192.168.0.31:49446 (size: 2.1 KB, free: 1589.8 MB)
15/09/06 11:52:51 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.0.31:49446 (size: 20.1 KB, free: 1589.7 MB)
15/09/06 11:52:51 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 1595 ms on 192.168.0.11 (1/2)
15/09/06 11:52:51 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1655 ms on 192.168.0.11 (1/2)
15/09/06 11:52:51 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.0.31:49446 (size: 20.1 KB, free: 1589.7 MB)
15/09/06 11:52:54 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 5098 ms on 192.168.0.31 (2/2)
15/09/06 11:52:54 INFO DAGScheduler: ShuffleMapStage 1 (map at <console>:23) finished in 5,101 s
15/09/06 11:52:54 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
15/09/06 11:52:54 INFO DAGScheduler: looking for newly runnable stages
15/09/06 11:52:54 INFO DAGScheduler: running: Set(ShuffleMapStage 0)
15/09/06 11:52:54 INFO DAGScheduler: waiting: Set(ResultStage 2)
15/09/06 11:52:54 INFO DAGScheduler: failed: Set()
15/09/06 11:52:54 INFO DAGScheduler: Missing parents for ResultStage 2: List(ShuffleMapStage 0)
15/09/06 11:52:55 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 5334 ms on 192.168.0.31 (2/2)
15/09/06 11:52:55 INFO DAGScheduler: ShuffleMapStage 0 (map at <console>:23) finished in 5,347 s
15/09/06 11:52:55 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/09/06 11:52:55 INFO DAGScheduler: looking for newly runnable stages
15/09/06 11:52:55 INFO DAGScheduler: running: Set()
15/09/06 11:52:55 INFO DAGScheduler: waiting: Set(ResultStage 2)
15/09/06 11:52:55 INFO DAGScheduler: failed: Set()
15/09/06 11:52:55 INFO DAGScheduler: Missing parents for ResultStage 2: List()
15/09/06 11:52:55 INFO DAGScheduler: Submitting ResultStage 2 (MapPartitionsRDD[13] at map at <console>:31), which is now runnable
15/09/06 11:52:55 INFO MemoryStore: ensureFreeSpace(2856) called with curMem=542500, maxMem=1111794647
15/09/06 11:52:55 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 2.8 KB, free 1059.8 MB)
15/09/06 11:52:55 INFO MemoryStore: ensureFreeSpace(1551) called with curMem=545356, maxMem=1111794647
15/09/06 11:52:55 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 1551.0 B, free 1059.8 MB)
15/09/06 11:52:55 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on 192.168.0.11:57011 (size: 1551.0 B, free: 1060.2 MB)
15/09/06 11:52:55 INFO SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:874
15/09/06 11:52:55 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 2 (MapPartitionsRDD[13] at map at <console>:31)
15/09/06 11:52:55 INFO TaskSchedulerImpl: Adding task set 2.0 with 2 tasks
15/09/06 11:52:55 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 4, 192.168.0.31, PROCESS_LOCAL, 1238 bytes)
15/09/06 11:52:55 INFO TaskSetManager: Starting task 1.0 in stage 2.0 (TID 5, 192.168.0.32, PROCESS_LOCAL, 1238 bytes)
15/09/06 11:52:55 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on 192.168.0.31:49446 (size: 1551.0 B, free: 1589.7 MB)
15/09/06 11:52:55 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to 192.168.0.31:46073
15/09/06 11:52:55 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 163 bytes
15/09/06 11:52:55 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 1 to 192.168.0.31:46073
15/09/06 11:52:55 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 1 is 163 bytes
15/09/06 11:52:55 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on 192.168.0.32:58863 (size: 1551.0 B, free: 1589.8 MB)
15/09/06 11:52:55 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 4) in 672 ms on 192.168.0.31 (1/2)
15/09/06 11:52:56 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to 192.168.0.32:43956
15/09/06 11:52:56 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 1 to 192.168.0.32:43956
15/09/06 11:52:57 INFO TaskSetManager: Finished task 1.0 in stage 2.0 (TID 5) in 2735 ms on 192.168.0.32 (2/2)
15/09/06 11:52:57 INFO DAGScheduler: ResultStage 2 (count at <console>:34) finished in 2,739 s
15/09/06 11:52:57 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
15/09/06 11:52:57 INFO DAGScheduler: Job 0 finished: count at <console>:34, took 8,186017 s
res0: Long = 2000

scala>

en clair, pour ce que j'en ai testé, y a de l'idée en terme de fonctionnalité, même si cela représente du travail, c'est decevant en pratique, c'est trop dépendant de cloudera pour apache solr, c'est trop instable pour être exploitable avec l'editeur spark/scala/pyspark de Hue, ca marchote ou ca coince, pas la peine que je regarde spark R , dommage, on verra à la prochaine version 3.9.1,

, il y a encore du travail pour la team de Hue.

pin pon, pin pon, dsl, je fais un rollback sur mon hue 3.7.1.

config : cluster 1 master + 2 nodes esclaves, 16GB ram par node esclave, 24 GB ram, master
apache hadoop 2.6, spark 1.4.1 mode cluster partagé avec hadoop

c'est tout pour moi

Hue, l'interface Web pour utiliser Hadoop plus facilement vient de sortir en version 3.9

Hadoop & co

Discussions similaires

Partager

Partager