Change output filename prefix for DataFrame.write()

You cannot change the “part” prefix while using any of the standard output formats (like Parquet). See this snippet from ParquetRelation source code:

private val recordWriter: RecordWriter[Void, InternalRow] = {
  val outputFormat = {
    new ParquetOutputFormat[InternalRow]() {
      // ...
      override def getDefaultWorkFile(context: TaskAttemptContext, extension: String): Path = {
        // ..
        //  prefix is hard-coded here:
        new Path(path, f"part-r-$split%05d-$uniqueWriteJobId$bucketString$extension")
    }
  }
}

If you really must control the part file names, you’ll probably have to implement a custom FileOutputFormat and use one of Spark’s save methods that accept a FileOutputFormat class (e.g. saveAsHadoopFile).

More Related Contents:

Leave a Comment Cancel reply