Reading UTF 8 Files

General APL language issues
Post Reply
paulmansour
Posts: 431
Joined: Fri Oct 03, 2008 4:14 pm

Reading UTF 8 Files

Post by paulmansour »

Dyalog provides a function in the documentation to read a UTF-8 file, where the heart of the code is:

      'UTF-8' ⎕UCS {⍵+256×⍵<0} i


where i are ints read from a file using type 83.

However, you can also, I think, do:

      'UTF-8'⎕UCS ⎕UCS c


where c are the chars read from the file using type 80.

My question is, are these always equivalent? If so, any reason to prefer one over the other? I have not taken the time to compare the speeds yet.
User avatar
Phil Last
Posts: 628
Joined: Thu Jun 18, 2009 6:29 pm
Location: Wessex

Re: Reading UTF 8 Files

Post by Phil Last »

Reading a 4.4MB UTF-8 file (the Project Gutenberg King James' Bible with a "⍠" added to line[0]) once you have i and c in the ws doing the algorithms posted is perhaps 2% quicker for c than i. But isolating the difference (⎕UCS c) vs ({⍵-256×⍵<0}i) makes working with c about 400% faster.

This huge difference is presumably because the UTF-8 conversion is about 98% of the entire process.

I found no difference in the results.
User avatar
Morten|Dyalog
Posts: 460
Joined: Tue Sep 09, 2008 3:52 pm

Re: Reading UTF 8 Files

Post by Morten|Dyalog »

paulmansour wrote:My question is, are these always equivalent? If so, any reason to prefer one over the other?


I believe they are equivalent. I would suggest:
1) Always prefer the simpler expression: It is easier to read, more likely to run fast, and more likely to be recognized as an idiom or by other future optimization mechanisms that Dyalog may introduce.
2) If simpler expressions turn out to run slower than equivalent complicated ones, complain to Dyalog and help increase the likelihood of the benefits mentioned above :-).
PMH
Posts: 7
Joined: Fri Nov 27, 2009 8:48 am

Re: Reading UTF 8 Files

Post by PMH »

You can replace

Code: Select all

{⍵+256×⍵<0}

with

Code: Select all

256|
DanB|Dyalog

Re: Reading UTF 8 Files

Post by DanB|Dyalog »

Just a note:
you can do any of
'UTF-8'⎕ucs {⍵+256×⍵<0}i
'UTF-8'⎕ucs 256|i
'UTF-8'⎕ucs ⎕ucs 80 ⎕dr i
'UTF-8'⎕ucs ⎕ucs c

Timings reveal that 256| is better

]cpu 256|v {⍵+256×⍵<0}v "⎕ucs 80 ⎕dr v" "⎕ucs c" -compare

256|v → 1.9E¯5 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
{⍵+256×⍵<0}v → 4.6E¯5 | +149% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
⎕ucs 80 ⎕dr v → 4.0E¯5 | +114% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
⎕ucs c → 3.8E¯5 | +106% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

but this is only true in 13.2 now that Roger has sped up some functions. Before 13.2 256| is slower.
paulmansour
Posts: 431
Joined: Fri Oct 03, 2008 4:14 pm

Re: Reading UTF 8 Files

Post by paulmansour »

Thanks all for responses.

PMH, you humble me. I think about how much code I could throw away if I knew what I was doing!

Dan, thanks for the research. I'm going to 13.2 as we speak and looking forward to performance improvements.
Post Reply