Parse a csv using awk and ignoring commas inside a field

gawk -vFPAT='[^,]*|”[^”]*”‘ ‘{print $1 “,” $3}’ | sort | uniq This is an awesome GNU Awk 4 extension, where you define a field pattern instead of a field-separator pattern. Does wonders for CSV. (docs) ETA (thanks mitchus): To remove the surrounding quotes, gsub(“^\”|\”$”,””,$3); if there’s more fields than just $3 to process that way, just … Read more

Reading csv files with quoted fields containing embedded commas

I noticed that your problematic line has escaping that uses double quotes themselves: “32 XIY “”W”” JK, RE LK” which should be interpreter just as 32 XIY “W” JK, RE LK As described in RFC-4180, page 2 – If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped … Read more

How to export a Hive table into a CSV file?

or use this hive -e ‘select * from your_Table’ | sed ‘s/[\t]/,/g’ > /home/yourfile.csv You can also specify property set hive.cli.print.header=true before the SELECT to ensure that header along with data is created and copied to file. For example: hive -e ‘set hive.cli.print.header=true; select * from your_Table’ | sed ‘s/[\t]/,/g’ > /home/yourfile.csv If you don’t … Read more

Spark dataframe save in single file on hdfs location [duplicate]

It’s not possible using standard spark library, but you can use Hadoop API for managing filesystem – save output in temporary directory and then move file to the requested path. For example (in pyspark): df.coalesce(1) \ .write.format(“com.databricks.spark.csv”) \ .option(“header”, “true”) \ .save(“mydata.csv-temp”) from py4j.java_gateway import java_import java_import(spark._jvm, ‘org.apache.hadoop.fs.Path’) fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration()) file = fs.globStatus(sc._jvm.Path(‘mydata.csv-temp/part*’))[0].getPath().getName() fs.rename(sc._jvm.Path(‘mydata.csv-temp/’ … Read more

Can you encode CR/LF in into CSV files?

Yes, you need to wrap in quotes: “some value over two lines”,some other value From this document, which is the generally-accepted CSV standard: Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes

csv to array in d3.js

d3.csv is an asynchronous method. This means that code inside the callback function is run when the data is loaded, but code after and outside the callback function will be run immediately after the request is made, when the data is not yet available. In other words: first(); d3.csv(“path/to/file.csv”, function(rows) { third(); }); second(); If … Read more

How to load jar dependenices in IPython Notebook

You can simply pass it in the PYSPARK_SUBMIT_ARGS variable. For example: export PACKAGES=”com.databricks:spark-csv_2.11:1.3.0″ export PYSPARK_SUBMIT_ARGS=”–packages ${PACKAGES} pyspark-shell” These property can be also set dynamically in your code before SparkContext / SparkSession and corresponding JVM have been started: packages = “com.databricks:spark-csv_2.11:1.3.0” os.environ[“PYSPARK_SUBMIT_ARGS”] = ( “–packages {0} pyspark-shell”.format(packages) )