How to convert a JSON file to an SQLite database

I found the easiest way to do this is by using jq and CSV as an intermediary format.

Getting the CSV

First write your data to a file.
I will assume data.json here.

Then construct the header using jq:

% head -1 data.json | jq -r 'keys_unsorted | @csv'
"uri","user_agent"

The head -1 is because we only want one line.
jq‘s -r makes the output a plain string instead of a JSON-String wrapping the CSV.
We then call the internal function keys to get the keys of the input as an array.
This we send to the @csv formatter which outputs us a single string with the headers in quoted CSV format.

We then need to construct the data.

% jq -r 'map(tostring) | @csv' < data.json
"/","example1"
"/foobar","example1"
"/","example2"
"/foobar","example3"

We now take the whole input and deconstruct the associative array (map) using .[] and then put it back into a simple array […].
This basically converts our dictionary to an array of keys.
Sent to the @csv formatter, we again get some CSV.

Putting it all together we get a single one-liner in the form of:

% (head -1 data.json | jq -r 'keys | @csv' && jq -r 'map(tostring) | @csv' < data.json) > data.csv

If you need to convert the data on the fly, i.e. without a file, try this:

% cat data.json | (read -r first && jq -r '(keys | @csv),(map(tostring) | @csv)' <<<"${first}" && jq -r 'map(tostring) | @csv')

Loading it into SQLite

Open an SQLite database:

sqlite3 somedb.sqlite

Now in the interactive shell do the following (assuming you wrote the CSV to data.csv and want it in a table called my_table):

.mode csv
.import data.csv my_table

Now close the shell and open it again for a clean environment.
You can now easily SELECT from the database and do whatever you want to.

Putting it all together

Have an asciinema recording right there:

asciicast

Edits

Edit:
As pointed out (thanks @Leo), the original question did show newline delimited JSON objects, which each on their own conform to rfc4627, but not all together in that format.
jq can handle a single JSON array of objects much the same way though by preprocessing the file using jq '.[]' <input.json >preprocessed.json.
If you happen to be dealing with JSON text sequences (rfc7464) luckily jq has got your back too with the --seq parameter.

Edit 2:
Both the newline separated JSON and the JSON text sequences have one important advantage; they reduce memory requirements down to O(1), meaning your total memory requirement is only dependent on your longest line of input, whereas putting the entire input in a single array requires that either your parser can handle late errors (i.e. after the first 100k elements there’s a syntax error), which generally isn’t the case to my knowledge, or it will have to parse the entire file twice (first validating syntax, then parsing, in the process discarding previous elements, as is the case with jq --stream) which also happens rarely to my knowledge, or it will try to parse the whole input at once and return the result in one step (think of receiving a Python dict which contains the entirety of your say 50G input data plus overhead) which is usually memory backed, hence raising your memory footprint by just about your total data size.

Edit 3:
If you hit any obstacles, try using keys_unsorted instead of keys.
I haven’t tested that myself (I kind of assume my columns were already sorted), however @Kyle Barron reports that this was needed.

Edit 4:
As pointed out by youngminz in the comment below the original command fails when working with non-{number,string} values like nested lists.
The command has been updated (with a slightly adapted version from the comment, map() – unlike map_values() converts objects to their keys the same as [.[]], making the map more readable).
Keys remain unaffected, if you really have complex types as keys (which may not even conform to JSON, but I’m too lazy to look it up right now) you can do the same for the key-related mappings.

Leave a Comment