Article - CS320043

Training job fails with error "org.apache.spark.SparkException: Job aborted due to stage failure" in ThingWorx Analytics Server

Modified: 28-Jan-2020   


Applies To

  • ThingWorx Analytics 8.3.3

Description

  • Executing a Random Forest learner with 150 trees and 24 depth fails with error
Training on SparkRandomForestTrainer seems to have failed.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1473.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1473.0 (TID 4111, localhost, executor driver): ExecutorLostFailure (executor driver exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 235516 ms
Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
    at scala.Option.foreach(Option.scala:257)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
    at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$collectAsMap$1.apply(PairRDDFunctions.scala:746)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$collectAsMap$1.apply(PairRDDFunctions.scala:745)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
    at org.apache.spark.rdd.PairRDDFunctions.collectAsMap(PairRDDFunctions.scala:745)
    at org.apache.spark.ml.tree.impl.RandomForest$.findBestSplits(RandomForest.scala:563)
    at org.apache.spark.ml.tree.impl.RandomForest$.run(RandomForest.scala:198)
    at org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:94)
    at org.apache.spark.mllib.tree.RandomForest$.trainRegressor(RandomForest.scala:218)
    at org.apache.spark.mllib.tree.RandomForest$.trainRegressor(RandomForest.scala:258)
    at org.apache.spark.mllib.tree.RandomForest$.trainRegressor(RandomForest.scala:274)
    at org.apache.spark.mllib.tree.RandomForest.trainRegressor(RandomForest.scala)
    at com.thingworx.analytics.training.trees.SparkRandomForestTrainer.train(SparkRandomForestTrainer.java:48)
    at com.thingworx.analytics.training.trees.SparkRandomForestTrainer.train(SparkRandomForestTrainer.java:21)
    at com.thingworx.analytics.training.trees.SparkTreeTrainer.trainModel(SparkTreeTrainer.java:63)
    at com.thingworx.analytics.training.Learner.internalTrainModel(Learner.java:101)

and

com.thingworx.analytics.training.TrainingFailedException: Error training Learner [trainer=com.thingworx.analytics.training.trees.SparkRandomForestTrainer@65df41a, transformer=DecisionTreeTransformerFactory [maxNumberOfMiningFields=25 useRedundancyFilter=false expanding=false]]
    at com.thingworx.analytics.training.Learner.internalTrainModel(Learner.java:109)
    at com.thingworx.analytics.training.ensemble.AbstractEnsembleModel.lambda$executeTrainingOnLearners$0(AbstractEnsembleModel.java:72)
    at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
    at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
    at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
    at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
    at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:545)
    at java.util.stream.AbstractPipeline.evaluateToArrayNode(AbstractPipeline.java:260)
    at java.util.stream.ReferencePipeline.toArray(ReferencePipeline.java:438)
    at com.thingworx.analytics.training.ensemble.AbstractEnsembleModel.executeTrainingOnLearners(AbstractEnsembleModel.java:78)
    at com.thingworx.analytics.training.ensemble.AbstractEnsembleModel.trainAllTrainers(AbstractEnsembleModel.java:100)
    at com.thingworx.analytics.training.ensemble.EliteAverageEnsembleModel.trainModel(EliteAverageEnsembleModel.java:60)
    at com.thingworx.analytics.training.Learner.internalTrainModel(Learner.java:101)
    at com.thingworx.analytics.training.MultiGoalTrainer.trainMultipleModels(MultiGoalTrainer.java:53)
    at com.thingworx.analytics.training.MultiGoalTrainer.trainModel(MultiGoalTrainer.java:38)
    at com.thingworx.analytics.training.Learner.internalTrainModel(Learner.java:101)
    at com.thingworx.analytics.training.Learner.trainForGoals(Learner.java:186)
    at com.thingworx.analytics.training.Learner.trainByParameters(Learner.java:133)
This is a printer-friendly version of Article 320043 and may be out of date. For the latest version click CS320043