Recently it was an intolerable pain for me to build a so-called fat jar
‘s for my Spark
jobs in Scala using sbt
. Well, what’s was happening there? (NOTE: if you’re don’t use fat jars to submit Spark jobs and —this is a better way to deal with Spark though– then this article won’t be helpful at all)
When it comes to assembling projects to fat jars, most discussions on the Internet say it is a good practice to discard the META-INF
folder. But why? This folder rarely contains valuable data in the context of your end project and can bring you a huge mess of messy files. Some of them are quite dangerous (especially *.DSA
or *.SF
files that came from Bouncy Castle
dependencies). But it all changes dramatically when you use Apache Spark
Structured Streaming for Kafka.
Let’s take a closer look:
val dfDrools = spark .readStream .format("kafka") // ??? .option("kafka.bootstrap.servers", kafkaConfig.getString("bootstrap.server")) .option("subscribe", kafkaConfig.getString("topicEventsIn")) .option("includeTimestamp", value = true) .load()
When you define kafka
as an input format, your application uses DataSourceRegister
file which contains mappings between format short names and actual implementation classes. The file should be placed under the META-INF/services/org.apache.spark.sql.sources.DataSourceRegister
folder. That means you have to provide a valid assemblyMergeStrategy
in your build.sbt
file. Otherwise, you will get plenty of errors while trying to submit such jar
to the Spark cluster.
After hours of searchings I found a “silver bullet” against merging issues:
assemblyMergeStrategy in assembly := { case "META-INF/services/org.apache.spark.sql.sources.DataSourceRegister" => MergeStrategy.concat case PathList("META-INF", xs @ _*) => xs map {_.toLowerCase} match { case ("manifest.mf" :: Nil) | ("index.list" :: Nil) | ("dependencies" :: Nil) => MergeStrategy.discard case "services" :: _ => MergeStrategy.filterDistinctLines case ps @ (x :: xs) if ps.last.endsWith(".sf") || ps.last.endsWith(".dsa") => MergeStrategy.discard case _ => MergeStrategy.first } case _ => MergeStrategy.first }
This merge strategy concatenates all DataSourceRegister
files from any given dependency folders, discard any manifest.mf
, index.list
or dependencies
files from the META-INF
along with any .dsa
and .sf
. For other files the MergeStrategy.first
will be used.
Conclusion
An assembly process of any fat jar may turn up a huge problem considering managing dependency conflicts during merging. It is better to pay an attention to the strategies given by sbt
and don’t hesitate to look up the best practices.