How to skip CSV header in Hive External Table?

As of Hive v0.13.0, you can use skip.header.line.count table property: create external table testtable (name string, message string) row format delimited fields terminated by ‘\t’ lines terminated by ‘\n’ location ‘/testtable’ TBLPROPERTIES (“skip.header.line.count”=”1”); Use ALTER TABLE for an existing table: ALTER TABLE tablename SET TBLPROPERTIES (“skip.header.line.count”=”1”); Please note that while it works it comes with … Read more

Job queue for Hive action in oozie

A. Oozie specifics Oozie propagates the “regular” Hadoop properties to a “regular” MapReduce Action. But for other types of Action (Shell, Hive, Java, etc.) where Oozie runs a single Mapper task in YARN, it does not consider that it’s a real MapReduce job. Hence it uses a different set of undocumented properties always prefixed with … Read more

Create HIVE Table with multi character delimiter

FILELDS TERMINATED BY does not support multi-character delimiters. The easiest way to do this is to use RegexSerDe: CREATE EXTERNAL TABlE tableex(id INT, name STRING) ROW FORMAT ‘org.apache.hadoop.hive.contrib.serde2.RegexSerDe’ WITH SERDEPROPERTIES ( “input.regex” = “^(\\d+)~\\*(.*)$” ) STORED AS TEXTFILE LOCATION ‘/user/myusername’;

Export as csv in beeline hive

When hive version is at least 0.11.0 you can execute: INSERT OVERWRITE LOCAL DIRECTORY ‘/tmp/directoryWhereToStoreData’ ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ LINES TERMINATED BY “\n” SELECT * FROM yourTable; from hive/beeline to store the table into a directory on the local filesystem. Alternatively, with beeline, save your SELECT query in yourSQLFile.sql and run: beeline … Read more

Overwrite only some partitions in a partitioned spark Dataset

Since Spark 2.3.0 this is an option when overwriting a table. To overwrite it, you need to set the new spark.sql.sources.partitionOverwriteMode setting to dynamic, the dataset needs to be partitioned, and the write mode overwrite. Example in scala: spark.conf.set( “spark.sql.sources.partitionOverwriteMode”, “dynamic” ) data.write.mode(“overwrite”).insertInto(“partitioned_table”) I recommend doing a repartition based on your partition column before writing, … Read more