⎕CSV unreasonably slow (16.0.30863)

AntonioDiNarzo · Post by **AntonioDiNarzo** » Mon Nov 06, 2017 12:09 pm

In a nutshell, ⎕CSV spins a cpu to 100% for >8secs to load a single line from a (very) long csv file.

I'm not sure I'm not doing anything silly, or there's really a glitch in ⎕CSV.

Steps to reproduce:
1. generate a file with 1E8 rows (~850Mb) in the current working directory.
Under linux I did e.g. this from the shell prompt:

Code: Select all

$ seq 1e8 > test.txt

You APL wizard sure know how to do it without leaving the IDE :)

2. fire up dyalog apl (16.0.30863), and type:

Code: Select all

)copy dfns
f←{(⎕CSV⍠'Records'⍵)'test.txt'}
cmpx 'f 1' 'f 100' 'f 10000'

On my linux box I get:

Code: Select all

  f 1     → 9.3E0  |   0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
* f 100   → 8.5E0  |  -9% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕   
* f 10000 → 8.4E0  | -10% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

As the cpu is capped at 100% for the entire time, I don't believe the OS/file system should matter much, but please let me know if you need further details on my setup.

Am I doing this the wrong way? How do you guys deal with processing large csv files?

Richard|Dyalog · Post by **Richard|Dyalog** » Mon Nov 06, 2017 2:55 pm

Many thanks for your detailed analysis - I can explain the behaviour and offer a faster alternative.

Firstly - although not explicitly noted, your timings suggest a performance improvement as more data is processed. I see that behaviour too:

      cmpx 'f 1' 'f 100' 'f 10000'
  f 1     → 6.6E0  |   0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
* f 100   → 6.1E0  |  -9% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
* f 10000 → 5.9E0  | -12% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

I believe this is simply the effects of (a) the workspace resizing itself and (b) Linux caching the file in memory. If I perform the experiment a second time immediately afterwards I get near identical times for each case:

      cmpx 'f 1' 'f 100' 'f 10000'
  f 1     → 5.8E0  |  0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
* f 100   → 5.8E0  | -1% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
* f 10000 → 5.7E0  | -1% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

Secondly, when you specify the file by name, the entire file is read into the workspace using ⎕NGET before any CSV processing is done, and because the file contains so many more lines than are ultimately processed this is very wasteful.

If you specify the file by tie number then the file is read directly by ⎕CSV in smaller "chunks" and this has a significant performance improvement.

      g←{t←'test.txt' ⎕NTIE 0 ⋄ (⎕CSV⍠'Records'⍵) t ⋄ ⎕NUNTIE t}

      cmpx'f 1' 'f 100' 'f 10000' 'g 1' 'g 100' 'g 10000'
  f 1     → 5.8E0  |    0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
* f 100   → 6.0E0  |   +4% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
* f 10000 → 5.9E0  |   +1% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
  g 1     → 0.0E0  | -100%
* g 100   → 0.0E0  | -100%
* g 10000 → 1.0E¯3 | -100%

The fact that named files are read in their entirety is not obvious from the documentation and I will report an issue against it to make it clearer. The 'Records' option exists to allow a very large file to be read and processed in sections and in this context it can only be meaningfully used with a tied file because that leaves the file offset in the correct position to allow subsequent processing to continue from where it left off. Naming the file allows ⎕CSV to be used very simply.

One of the reasons ⎕NGET does not read files in part is that it can automatically deduce its text encoding and it may need to scan the entire file to do this. It is for this reason that file encoding is not automatically deduced when ⎕CSV reads from a tied file.

Dyalog Forums

⎕CSV unreasonably slow (16.0.30863)

⎕CSV unreasonably slow (16.0.30863)

Re: ⎕CSV unreasonably slow (16.0.30863)