Process big files in Deno

Mayank Choubey
Tech Tonic
Published in
4 min readApr 7, 2021

--

Purpose

As Deno is a backend runtime, there may be situations when a big file (500M, 1G, 2G, 5G, etc.) needs to be processed. The question is:

How to go about processing a large file?

An obvious answer would be to process the file line-by-line. The answer is right, but for files that have lines. What if the large file doesn’t have lines. For example — a file that holds more than a gig of some comma-separated readings. All the readings would be present in the same line. Processing line-by-line doesn’t work in such cases.

There are two ways to process a file:

The first way is:

  1. Load the entire file in memory
  2. Process it

It’s easy, but won’t be suitable for large files (>500M). Also, there would be a memory spike till the variable holding file is garbage collected.

The second way is:

  1. Load a part of the file in memory
  2. Process it
  3. Continue to 1 till EOF is reached

The second way is way more efficient as it won’t cause a memory spike. The maximum memory used would be the same as the chunk size.

In this article, we’ll see both ways to process a large file.

This is our problem statement:

Process a 1G/2G/5G file containing comma-separated readings, and return the number of readings

ReadAll

The first and very easy way is to load the entire file in memory and then process it. It’s not that loading a file in memory is a bad idea. It is a decent idea if the file size is decent too. If the file size is huge (like in GBs), it’ll cause a memory spike in the Deno process till memory is garbage collected. That’s only if Deno’s internal allocators could handle that big data. In some cases, exceptions get raised.

To read the entire file in memory, a readAll function is provided. It’s part of the core runtime, therefore no need to import anything.

Here is a piece of code that loads the entire file (100M in size) and counts the number of readings.

const fileName='readings100M.txt';async function getReadingsA() {
const myFileContent = await Deno.readTextFile(fileName);
return myFileContent.split(",").length;
}
console.log('Total readings='+await getReadingsA());
prompt(); //To keep the process running
//Total readings=20971521

This approach works, but for small files (~100MB). Even though it works, there is a memory spike in the Deno process. The process ended up using around 700M of memory to process a 100M of file.

du -kh readings100M.txt
100M readings100M.txt
816M deno run --allow-all deno_read_big_file.ts

If a file of size 500M or 1G is given as input, the Deno process raises an exception:

error: Uncaught (in promise) RangeError: string too long
const myFileContent = await Deno.readTextFile(fileName);

The approach of reading the entire file in memory clearly doesn’t work for huge files. In the places it works, the file size can’t go beyond 500M. That’s a big limitation!

Fortunately, there is another easy way! In the next section, let’s see how to process big files (like 5G) efficiently.

Read in chunks

The second, efficient, and working way is to read files in smaller chunks and process the chunks. This sounds tedious when compared to reading the entire file, but in reality, it isn’t tedious at all.

The rough algorithm is:

  1. Open file
  2. Read a chunk from the file
  3. Process the chunk (In the example, get the number of readings)
  4. Move read pointer ahead by chunk size
  5. Continue to step 2 till the end of the file is reached

Yes, the number of steps is more, but this is way more flexible. It can process big files like 5G, while the earlier approach couldn’t go beyond a hundred MBs.

Here is the code that implements the above algorithm:

async function getReadingsB() {
const file = await Deno.open(fileName, {read: true});
let totalReadings=0, readBlockSize=100000;
for(let i=0; i<100000; i++) {
const buf = new Uint8Array(readBlockSize);
await Deno.seek(file.rid, readBlockSize, Deno.SeekMode.Current)
const numberOfBytesRead = await Deno.read(file.rid, buf);
if(!numberOfBytesRead) break;
const text = new TextDecoder().decode(buf);
totalReadings+=text.split(",").length;
}
return totalReadings;
}
console.log('Total readings='+await getReadingsB());
prompt();

Let’s go over how it works:

  • Come up with block size or chunk size (100000)
  • Open the file
  • If this is the first iteration, do not seek
  • If this iteration is not the first one, move the file’s read pointer by block size (Deno.seek)
  • Read block size of data from the file in the block (Deno.read)
  • If read call returns null, break the loop
  • Otherwise, continue looping

This method is very efficient. Here are some details of the run:

du -kh readings1G.txt 
955M readings1G.txt
deno run --allow-all deno_read_big_file.ts
Total readings=28166135
61M deno run --allow-all deno_read_big_file.ts

There was never a memory spike. The process reached a max of 61M in memory usage.

Let’s see the same run for a file of size 5G:

du -kh readings5G.txt 
4.7G readings5G.txt
deno run --allow-all deno_read_big_file.ts
Total readings=140820843
61M deno run --allow-all deno_read_big_file.ts

The memory usage is the same (61M) even though we processed a file five times the size.

This is amazing!

--

--