Spark, connecteur Cassandra et Jupyter sur Google Cloud

**fatbob** · 16/06/2018, 20h49

Bonjour

J'essaie de faire fonctionner jupyter sur un cluster Spark/Cassandra hébergé sur Google Cloud platform.
Le cluster Spark a été installé via Google dataproc et cassandra et jupyter par des scripts d'initialisation.

Lorsque je passe par ssh, pas de problème. Je lance "pyspark --packages datastax:spark-cassandra-connector:2.3.0-s_2.11" et tout semble fonctionner correctement.

Mais je n'arrive pas à lancer jupyter pour que le kernel pyspark utilise le connecteur cassandra.

J'ai essayé de modifier le kernel.json

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
    {
     "argv": [
        "bash",
        "-c",
        "PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS='kernel -f {connection_file}' pyspark"],
     "env": {
        "PYSPARK_SUBMIT_ARGS": "--master local[*] pyspark-shell --packages datastax:spark-cassandra-connector:2.3.0-s_2.11"
     },
     "display_name": "PySpark",
     "language": "python"
    }

Mais ça n'a pas l'air de fonctionner. Dans jupyter, je n'arrive pas à trouver quoi que ce soit en rapport avec Cassandra. J'ai des exceptions du genre :

java.lang.ClassNotFoundException: Failed to find data source: pyspark.sql.cassandra.

(J'ai essayé d'autre trucs dans PYSPARK_SUBMIT_ARGS et j'ai aussi essayé d'ajouter le --package dans PYSPARK_DRIVER_PYTHON_OPTS, mais ça ne marche pas non plus).

Quelqu'un aurait-il une idée ?

**fatbob** · 17/06/2018, 13h16

En fait, quand je lance "pyspark --packages datastax:spark-cassandra-connector:2.3.0-s_2.11", tout ne fonctionne pas correctement.
C'est quand je lance "spark-shell --packages datastax:spark-cassandra-connector:2.3.0-s_2.11" que tout va bien.

J'ai essayé avec "com.datastax.spark:spark-cassandra-connector_2.11:2.3.0" mais c'est exactement pareil, à la fois dans pyspark et dans jupyter : je n'arrive pas à utiliser le connecteur

A quelques WARN prêts, pourtant, tout me semble se passer correctement.

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
myuserhome@spark-cluster-m:~$ pyspark --packages com.datastax.spark:spark-cassandra-connector_2.11:2.3.0
Python 2.7.9 (default, Jun 29 2016, 13:08:31) 
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Ivy Default Cache set to: /home/myuserhome/.ivy2/cache
The jars for the packages stored in: /home/myuserhome/.ivy2/jars
:: loading settings :: url = jar:file:/usr/lib/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.datastax.spark#spark-cassandra-connector_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
        confs: [default]
        found com.datastax.spark#spark-cassandra-connector_2.11;2.3.0 in central
        found com.twitter#jsr166e;1.1.0 in central
        found commons-beanutils#commons-beanutils;1.9.3 in central
        found commons-collections#commons-collections;3.2.2 in central
        found joda-time#joda-time;2.3 in central
        found org.joda#joda-convert;1.2 in central
        found io.netty#netty-all;4.0.33.Final in central
        found org.scala-lang#scala-reflect;2.11.8 in central
:: resolution report :: resolve 2615ms :: artifacts dl 86ms
        :: modules in use:
        com.datastax.spark#spark-cassandra-connector_2.11;2.3.0 from central in [default]
        com.twitter#jsr166e;1.1.0 from central in [default]
        commons-beanutils#commons-beanutils;1.9.3 from central in [default]
        commons-collections#commons-collections;3.2.2 from central in [default]
        io.netty#netty-all;4.0.33.Final from central in [default]
        joda-time#joda-time;2.3 from central in [default]
        org.joda#joda-convert;1.2 from central in [default]
        org.scala-lang#scala-reflect;2.11.8 from central in [default]
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   8   |   0   |   0   |   0   ||   8   |   0   |
        ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
        confs: [default]
        0 artifacts copied, 8 already retrieved (0kB/76ms)
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/06/17 11:08:22 WARN org.apache.hadoop.hdfs.DataStreamer: Caught exception
java.lang.InterruptedException
        at java.lang.Object.wait(Native Method)
        at java.lang.Thread.join(Thread.java:1252)
        at java.lang.Thread.join(Thread.java:1326)
        at org.apache.hadoop.hdfs.DataStreamer.closeResponder(DataStreamer.java:973)
        at org.apache.hadoop.hdfs.DataStreamer.endBlock(DataStreamer.java:624)
        at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:801)
18/06/17 11:08:23 WARN org.apache.hadoop.hdfs.DataStreamer: Caught exception
java.lang.InterruptedException
        at java.lang.Object.wait(Native Method)
        at java.lang.Thread.join(Thread.java:1252)
        at java.lang.Thread.join(Thread.java:1326)
        at org.apache.hadoop.hdfs.DataStreamer.closeResponder(DataStreamer.java:973)
        at org.apache.hadoop.hdfs.DataStreamer.endBlock(DataStreamer.java:624)
        at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:801)
18/06/17 11:08:23 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:/home/myuserhome/.ivy2/jars/com.datastax.spark_spark-cassandra-connector_2.11-2.3.0.jar added multiple times to distributed cache.
18/06/17 11:08:23 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:/home/myuserhome/.ivy2/jars/com.twitter_jsr166e-1.1.0.jar added multiple times to distributed cache.
18/06/17 11:08:23 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:/home/myuserhome/.ivy2/jars/commons-beanutils_commons-beanutils-1.9.3.jar added multiple times to distributed cache.
18/06/17 11:08:23 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:/home/myuserhome/.ivy2/jars/joda-time_joda-time-2.3.jar added multiple times to distributed cache.
18/06/17 11:08:23 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:/home/myuserhome/.ivy2/jars/org.joda_joda-convert-1.2.jar added multiple times to distributed cache.
18/06/17 11:08:23 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:/home/myuserhome/.ivy2/jars/io.netty_netty-all-4.0.33.Final.jar added multiple times to distributed cache.
18/06/17 11:08:23 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:/home/myuserhome/.ivy2/jars/org.scala-lang_scala-reflect-2.11.8.jar added multiple times to distributed cache.
18/06/17 11:08:23 WARN org.apache.spark.deploy.yarn.Client: Same path resource file:/home/myuserhome/.ivy2/jars/commons-collections_commons-collections-3.2.2.jar added multiple times to distributed cache.
18/06/17 11:08:24 WARN org.apache.hadoop.hdfs.DataStreamer: Caught exception
java.lang.InterruptedException
        at java.lang.Object.wait(Native Method)
        at java.lang.Thread.join(Thread.java:1252)
        at java.lang.Thread.join(Thread.java:1326)
        at org.apache.hadoop.hdfs.DataStreamer.closeResponder(DataStreamer.java:973)
        at org.apache.hadoop.hdfs.DataStreamer.endBlock(DataStreamer.java:624)
        at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:801)
ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,/etc/hive/conf.dist/ivysettings.xml will be used
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.2.1
      /_/

Using Python version 2.7.9 (default, Jun 29 2016 13:08:31)
SparkSession available as 'spark'.
>>> import org.apache.spark.sql.cassandra
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named org.apache.spark.sql.cassandra
>>> import pyspark.sql.cassandra
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named cassandra

Si quelqu'un a une piste, je suis preneur...

Spark, connecteur Cassandra et Jupyter sur Google Cloud

Hadoop & co

Discussions similaires

Partager

Partager