org.apache.spark.SparkException: Task not serializable

**Djam75** · 07/04/2021, 16h32

Bonjour
J'ai besoin de votre aide pour résoudre mon problème, ça m'empêche de rendre mon projet de fin d'étude et valider le diplome

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
       val negReviews2vec = negReviews.filter(sentence => sentence.length >= 1).map(sentence => sentence.toLowerCase.split("\\W+") 
               ).map(wordSeq => {
                    var vSum = Vectors.zeros(vectSize)
                    var vNb = 0
                    wordSeq.foreach { word =>
                        if(!(bStopWords.value)(word) & (word.length >= 2)) {
                            bVectors.value.get(word).foreach { v =>
                                vSum = add(v, vSum)
                                vNb += 1
                            }
                        }
 
                    }
                    if (vNb != 0) {
                        vSum = scalarMultiply(1.0 / vNb, vSum)
                    }
                    vSum
                }).filter(vec => Vectors.norm(vec, 1.0) > 0.0).persist()

Ce code me provoque une erreur avec la trace :

org.apache.spark.SparkException: Task not serializable
Caused by: java.io.NotSerializableException: org.apache.spark.SparkContext
Serialization stack:
- object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext@13c0ccc5)

Merci d'avance de votre aide.
Djam

**Pyramidev** · 07/04/2021, 23h58

Bonjour,

Dans le code que tu nous montres, parmi les variables vectSize et bStopWords et la méthode scalarMultiply, est-ce que l'une d'entre elles est un membre de la classe ? Si oui, alors Spark va essayer de sérialiser l'instance de classe toute entière et donc sérialiser plus de choses que nécessaire.

Je n'ai pas encore utilisé Spark en Scala, mais j'ai été témoin de ce genre de chose avec PySpark (l'API Python pour Spark).

Voici un extrait de la documentation :

Note that while it is also possible to pass a reference to a method in a class instance (as opposed to a singleton object), this requires sending the object that contains that class along with the method. For example, consider:

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
class MyClass {
  def func1(s: String): String = { ... }
  def doStuff(rdd: RDD[String]): RDD[String] = { rdd.map(func1) }
}

Here, if we create a new MyClass instance and call doStuff on it, the map inside there references the func1 method of that MyClass instance, so the whole object needs to be sent to the cluster. It is similar to writing rdd.map(x => this.func1(x)).

In a similar way, accessing fields of the outer object will reference the whole object:

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
class MyClass {
  val field = "Hello"
  def doStuff(rdd: RDD[String]): RDD[String] = { rdd.map(x => field + x) }
}

is equivalent to writing rdd.map(x => this.field + x), which references all of this. To avoid this issue, the simplest way is to copy field into a local variable instead of accessing it externally:

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
def doStuff(rdd: RDD[String]): RDD[String] = {
  val field_ = this.field
  rdd.map(x => field_ + x)
}

Lien vers la doc : https://spark.apache.org/docs/latest...tions-to-spark

org.apache.spark.SparkException: Task not serializable

Scala Java

Discussions similaires

Partager

Partager