Literally a silver bullet for sbt merge strategies in projects using Spark structured streaming and Kafka

Recently it was an intolerable pain for me to build a so-called fat jar‘s for my Spark jobs in Scala using sbt. Well, what’s was happening there? (NOTE: if you’re don’t use fat jars to submit Spark jobs and —this is a better way to deal with Spark though– then this article won’t be helpful at all)

When it comes to assembling projects to fat jars, most discussions on the Internet say it is a good practice to discard the META-INF folder. But why? This folder rarely contains valuable data in the context of your end project and can bring you a huge mess of messy files. Some of them are quite dangerous (especially *.DSA or *.SF files that came from Bouncy Castle dependencies). But it all changes dramatically when you use Apache Spark Structured Streaming for Kafka.

Let’s take a closer look:

val dfDrools = spark
      .readStream
      .format("kafka") // ???
      .option("kafka.bootstrap.servers", kafkaConfig.getString("bootstrap.server"))
      .option("subscribe", kafkaConfig.getString("topicEventsIn"))
      .option("includeTimestamp", value = true)
      .load()

When you define kafka as an input format, your application uses DataSourceRegister file which contains mappings between format short names and actual implementation classes. The file should be placed under the META-INF/services/org.apache.spark.sql.sources.DataSourceRegister folder. That means you have to provide a valid assemblyMergeStrategy in your build.sbt file. Otherwise, you will get plenty of errors while trying to submit such jar to the Spark cluster.

After hours of searchings I found a “silver bullet” against merging issues:

assemblyMergeStrategy in assembly := {
  case "META-INF/services/org.apache.spark.sql.sources.DataSourceRegister" => MergeStrategy.concat
  case PathList("META-INF", xs @ _*) =>
    xs map {_.toLowerCase} match {
      case ("manifest.mf" :: Nil) | ("index.list" :: Nil) | ("dependencies" :: Nil) =>
        MergeStrategy.discard
                case "services" :: _ =>  MergeStrategy.filterDistinctLines
      case ps @ (x :: xs) if ps.last.endsWith(".sf") || ps.last.endsWith(".dsa") =>
        MergeStrategy.discard
      case _ => MergeStrategy.first
    }
  case _ => MergeStrategy.first
}

This merge strategy concatenates all DataSourceRegister files from any given dependency folders, discard any manifest.mf, index.list or dependencies files from the META-INF along with any .dsa and .sf. For other files the MergeStrategy.first will be used.

Conclusion

An assembly process of any fat jar may turn up a huge problem considering managing dependency conflicts during merging. It is better to pay an attention to the strategies given by sbt and don’t hesitate to look up the best practices.