HTML5 File API read as text and binary

Question

2022 update: See explanation below for why the OP was seeing what they were seeing, but the code there is outdated. In modern environments, you’d use the methods on the Blob interface (which File inherits):

arrayBuffer for reading binary data (which you can then access via any of the typed arrays)
text to read textual data
stream for getting a ReadableStream for handling data via streaming (which allows you to do multiple transformations on the data without making multiple passes through it and/or use the data without having to keep all of it in memory

Once you have the file from the file input (const file = fileInput.files[0] or similar), it’s literally just a matter of:

await file.text(); // To read its text
// or
await file.arrayBuffer(); // To read its contents into an array buffer

(See ReadableStream for an example of streams.)

You might access the array buffer via a Uint8Array (new Uint8Array(buffer)).

Here’s an example of text and arrayBuffer:

const $ = id => document.getElementById(id);

const fileInput = $("fileInput");
const btnRead = $("btnRead");
const rdoText = $("rdoText");
const contentsDiv = $("contents");

const updateButton = () => {
    btnRead.disabled = fileInput.files.length === 0;
};

const readTextFile = async (file) => {
    const text = await file.text();
    contentsDiv.textContent = text;
    contentsDiv.classList.add("text");
    contentsDiv.classList.remove("binary");
    console.log("Done reading text file");
};

const readBinaryFile = async (file) => {
    // Read into an array buffer, create
    const buffer = await file.arrayBuffer();
    // Get a byte array for that buffer
    const bytes = new Uint8Array(buffer);
    // Show it as hex text
    const lines = [];
    let line = [];
    bytes.forEach((byte, index) => {
        const hex = byte.toString(16).padStart(2, "0");
        line.push(hex);
        if (index % 16 === 15) {
            lines.push(line.join(" "));
            line = [];
        }
    });
    contentsDiv.textContent = lines.join("\n");
    contentsDiv.classList.add("binary");
    contentsDiv.classList.remove("text");
    console.log(`Done reading binary file (length: ${bytes.length})`);
};

updateButton();

fileInput.addEventListener("input", updateButton);

btnRead.addEventListener("click", () => {
    const file = fileInput.files[0];
    if (!file) {
        return;
    }
    const readFile = rdoText.checked ? readTextFile : readBinaryFile;
    readFile(fileInput.files[0])
    .catch(error => {
        console.error(`Error reading file:`, error);
    });
});

body {
    font-family: sans-serif;
}
#contents {
    font-family: monospace;
    white-space: pre;
}

<form>
    <div>
        <label>
            <span>File:</span>
            <input type="file" id="fileInput">
        </label>
    </div>
    <div>
        <label>
            <input id="rdoText" type="radio" name="format" value="text" checked>
            Text
        </label>
        <label>
            <input id="rdoBinary" type="radio" name="format" value="binary">
            Binary
        </label>
    </div>
    <div>
        <input id="btnRead" type="button" value="Read File">
    </div>
</form>
<div id="contents"></div>

Note in 2018: readAsBinaryString is outdated. For use cases where previously you’d have used it, these days you’d use readAsArrayBuffer (or in some cases, readAsDataURL) instead.

readAsBinaryString says that the data must be represented as a binary string, where:

…every byte is represented by an integer in the range [0..255].

JavaScript originally didn’t have a “binary” type (until ECMAScript 5’s WebGL support of Typed Array* (details below) — it has been superseded by ECMAScript 2015’s ArrayBuffer) and so they went with a String with the guarantee that no character stored in the String would be outside the range 0..255. (They could have gone with an array of Numbers instead, but they didn’t; perhaps large Strings are more memory-efficient than large arrays of Numbers, since Numbers are floating-point.)

If you’re reading a file that’s mostly text in a western script (mostly English, for instance), then that string is going to look a lot like text. If you read a file with Unicode characters in it, you should notice a difference, since JavaScript strings are UTF-16** (details below) and so some characters will have values above 255, whereas a “binary string” according to the File API spec wouldn’t have any values above 255 (you’d have two individual “characters” for the two bytes of the Unicode code point).

If you’re reading a file that’s not text at all (an image, perhaps), you’ll probably still get a very similar result between readAsText and readAsBinaryString, but with readAsBinaryString you know that there won’t be any attempt to interpret multi-byte sequences as characters. You don’t know that if you use readAsText, because readAsText will use an encoding determination to try to figure out what the file’s encoding is and then map it to JavaScript’s UTF-16 strings.

You can see the effect if you create a file and store it in something other than ASCII or UTF-8. (In Windows you can do this via Notepad; the “Save As” as an encoding drop-down with “Unicode” on it, by which looking at the data they seem to mean UTF-16; I’m sure Mac OS and *nix editors have a similar feature.) Here’s a page that dumps the result of reading a file both ways:

<!DOCTYPE HTML>
<html>
<head>
<meta http-equiv="Content-type" content="text/html;charset=UTF-8">
<title>Show File Data</title>
<style type="text/css">
body {
    font-family: sans-serif;
}
</style>
<script type="text/javascript">

    function loadFile() {
        var input, file, fr;

        if (typeof window.FileReader !== 'function') {
            bodyAppend("p", "The file API isn't supported on this browser yet.");
            return;
        }

        input = document.getElementById('fileinput');
        if (!input) {
            bodyAppend("p", "Um, couldn't find the fileinput element.");
        }
        else if (!input.files) {
            bodyAppend("p", "This browser doesn't seem to support the `files` property of file inputs.");
        }
        else if (!input.files[0]) {
            bodyAppend("p", "Please select a file before clicking 'Load'");
        }
        else {
            file = input.files[0];
            fr = new FileReader();
            fr.onload = receivedText;
            fr.readAsText(file);
        }

        function receivedText() {
            showResult(fr, "Text");

            fr = new FileReader();
            fr.onload = receivedBinary;
            fr.readAsBinaryString(file);
        }

        function receivedBinary() {
            showResult(fr, "Binary");
        }
    }

    function showResult(fr, label) {
        var markup, result, n, aByte, byteStr;

        markup = [];
        result = fr.result;
        for (n = 0; n < result.length; ++n) {
            aByte = result.charCodeAt(n);
            byteStr = aByte.toString(16);
            if (byteStr.length < 2) {
                byteStr = "0" + byteStr;
            }
            markup.push(byteStr);
        }
        bodyAppend("p", label + " (" + result.length + "):");
        bodyAppend("pre", markup.join(" "));
    }

    function bodyAppend(tagName, innerHTML) {
        var elm;

        elm = document.createElement(tagName);
        elm.innerHTML = innerHTML;
        document.body.appendChild(elm);
    }

</script>
</head>
<body>
<form action='#' onsubmit="return false;">
<input type="file" id='fileinput'>
<input type="button" id='btnLoad' value="Load" onclick='loadFile();'>
</form>
</body>
</html>

If I use that with a “Testing 1 2 3” file stored in UTF-16, here are the results I get:

Text (13):

54 65 73 74 69 6e 67 20 31 20 32 20 33

Binary (28):

ff fe 54 00 65 00 73 00 74 00 69 00 6e 00 67 00 20 00 31 00 20 00 32 00 20 00 33 00

As you can see, readAsText interpreted the characters and so I got 13 (the length of “Testing 1 2 3”), and readAsBinaryString didn’t, and so I got 28 (the two-byte BOM plus two bytes for each character).

* XMLHttpRequest.response with responseType = "arraybuffer" is supported in HTML 5.

** “JavaScript strings are UTF-16” may seem like an odd statement; aren’t they just Unicode? No, a JavaScript string is a series of UTF-16 code units; you see surrogate pairs as two individual JavaScript “characters” even though, in fact, the surrogate pair as a whole is just one character. See the link for details.

More Related Contents:

Leave a Comment Cancel reply