Page 1 of 1

Reading UTF 8 Files

Posted: Wed Feb 13, 2013 2:27 pm
by paulmansour
Dyalog provides a function in the documentation to read a UTF-8 file, where the heart of the code is:

      'UTF-8' ⎕UCS {⍵+256×⍵<0} i


where i are ints read from a file using type 83.

However, you can also, I think, do:

      'UTF-8'⎕UCS ⎕UCS c


where c are the chars read from the file using type 80.

My question is, are these always equivalent? If so, any reason to prefer one over the other? I have not taken the time to compare the speeds yet.

Re: Reading UTF 8 Files

Posted: Wed Feb 13, 2013 5:13 pm
by Phil Last
Reading a 4.4MB UTF-8 file (the Project Gutenberg King James' Bible with a "⍠" added to line[0]) once you have i and c in the ws doing the algorithms posted is perhaps 2% quicker for c than i. But isolating the difference (⎕UCS c) vs ({⍵-256×⍵<0}i) makes working with c about 400% faster.

This huge difference is presumably because the UTF-8 conversion is about 98% of the entire process.

I found no difference in the results.

Re: Reading UTF 8 Files

Posted: Thu Feb 14, 2013 8:56 am
by Morten|Dyalog
paulmansour wrote:My question is, are these always equivalent? If so, any reason to prefer one over the other?


I believe they are equivalent. I would suggest:
1) Always prefer the simpler expression: It is easier to read, more likely to run fast, and more likely to be recognized as an idiom or by other future optimization mechanisms that Dyalog may introduce.
2) If simpler expressions turn out to run slower than equivalent complicated ones, complain to Dyalog and help increase the likelihood of the benefits mentioned above :-).

Re: Reading UTF 8 Files

Posted: Thu Feb 14, 2013 5:38 pm
by PMH
You can replace

Code: Select all

{⍵+256×⍵<0}

with

Code: Select all

256|

Re: Reading UTF 8 Files

Posted: Thu Feb 14, 2013 8:47 pm
by DanB|Dyalog
Just a note:
you can do any of
'UTF-8'⎕ucs {⍵+256×⍵<0}i
'UTF-8'⎕ucs 256|i
'UTF-8'⎕ucs ⎕ucs 80 ⎕dr i
'UTF-8'⎕ucs ⎕ucs c

Timings reveal that 256| is better

]cpu 256|v {⍵+256×⍵<0}v "⎕ucs 80 ⎕dr v" "⎕ucs c" -compare

256|v → 1.9E¯5 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
{⍵+256×⍵<0}v → 4.6E¯5 | +149% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
⎕ucs 80 ⎕dr v → 4.0E¯5 | +114% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
⎕ucs c → 3.8E¯5 | +106% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

but this is only true in 13.2 now that Roger has sped up some functions. Before 13.2 256| is slower.

Re: Reading UTF 8 Files

Posted: Sat Feb 16, 2013 3:51 pm
by paulmansour
Thanks all for responses.

PMH, you humble me. I think about how much code I could throw away if I knew what I was doing!

Dan, thanks for the research. I'm going to 13.2 as we speak and looking forward to performance improvements.