Reading UTF 8 Files

paulmansour · Post by **paulmansour** » Wed Feb 13, 2013 2:27 pm

Dyalog provides a function in the documentation to read a UTF-8 file, where the heart of the code is:

      'UTF-8' ⎕UCS {⍵+256×⍵<0} i

where i are ints read from a file using type 83.

However, you can also, I think, do:

      'UTF-8'⎕UCS ⎕UCS c

where c are the chars read from the file using type 80.

My question is, are these always equivalent? If so, any reason to prefer one over the other? I have not taken the time to compare the speeds yet.

Phil Last · Post by **Phil Last** » Wed Feb 13, 2013 5:13 pm

Reading a 4.4MB UTF-8 file (the Project Gutenberg King James' Bible with a "⍠" added to line[0]) once you have i and c in the ws doing the algorithms posted is perhaps 2% quicker for c than i. But isolating the difference (⎕UCS c) vs ({⍵-256×⍵<0}i) makes working with c about 400% faster.

This huge difference is presumably because the UTF-8 conversion is about 98% of the entire process.

I found no difference in the results.

Morten|Dyalog · Post by **Morten|Dyalog** » Thu Feb 14, 2013 8:56 am

paulmansour wrote:My question is, are these always equivalent? If so, any reason to prefer one over the other?

I believe they are equivalent. I would suggest:
1) Always prefer the simpler expression: It is easier to read, more likely to run fast, and more likely to be recognized as an idiom or by other future optimization mechanisms that Dyalog may introduce.
2) If simpler expressions turn out to run slower than equivalent complicated ones, complain to Dyalog and help increase the likelihood of the benefits mentioned above :-).

PMH · Post by **PMH** » Thu Feb 14, 2013 5:38 pm

You can replace

Code: Select all

{⍵+256×⍵<0}

with

Code: Select all

256|

DanB|Dyalog · Post by **DanB|Dyalog** » Thu Feb 14, 2013 8:47 pm

Just a note:
you can do any of
'UTF-8'⎕ucs {⍵+256×⍵<0}i
'UTF-8'⎕ucs 256|i
'UTF-8'⎕ucs ⎕ucs 80 ⎕dr i
'UTF-8'⎕ucs ⎕ucs c

Timings reveal that 256| is better

]cpu 256|v {⍵+256×⍵<0}v "⎕ucs 80 ⎕dr v" "⎕ucs c" -compare

256|v → 1.9E¯5 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
{⍵+256×⍵<0}v → 4.6E¯5 | +149% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
⎕ucs 80 ⎕dr v → 4.0E¯5 | +114% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
⎕ucs c → 3.8E¯5 | +106% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

but this is only true in 13.2 now that Roger has sped up some functions. Before 13.2 256| is slower.

paulmansour · Post by **paulmansour** » Sat Feb 16, 2013 3:51 pm

Thanks all for responses.

PMH, you humble me. I think about how much code I could throw away if I knew what I was doing!

Dan, thanks for the research. I'm going to 13.2 as we speak and looking forward to performance improvements.

Dyalog Forums

Reading UTF 8 Files

Reading UTF 8 Files

Re: Reading UTF 8 Files

Re: Reading UTF 8 Files

Re: Reading UTF 8 Files

Re: Reading UTF 8 Files

Re: Reading UTF 8 Files