* versus ⍟ performance

APL-related discussions - a stream of APL consciousness.
Not sure where to start a discussion ? Here's the place to be
Forum rules
This forum is for discussing APL-related issues. If you think that the subject is off-topic, then the Chat forum is probably a better place for your thoughts !
Post Reply
Roger|Dyalog
Posts: 238
Joined: Thu Jul 28, 2011 10:53 am

* versus ⍟ performance

Post by Roger|Dyalog »

0. A benchmark was posted in the APL Orchard on 2021-02-27:

      b←1+?⍨1e5  ⍝ with ⎕io←0; ⎕io delenda est!

cmpx '0.1*b' '5⍟b' '10⍟b' '17⍟b'
0.1*b → 9.0E¯3 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
* 5⍟b → 1.4E¯3 | -84% ⎕⎕⎕⎕⎕⎕
* 10⍟b → 1.2E¯3 | -87% ⎕⎕⎕⎕⎕
* 17⍟b → 1.4E¯3 | -84% ⎕⎕⎕⎕⎕⎕

The questions are, why is ⍺*⍵ so much slower than ⍺⍟⍵? And shouldn't power be faster than logarithm?

Explaining the relative speeds of * and ⍟ (both monadic and dyadic) is complicated.

1. *⍵ is computed by exp() and ⍟⍵ is computed by log() or log2() or log10(), C library functions all.

2. Mathematically and in the Dyalog implementation, ⍺⍟⍵ ←→ (⍟⍵)÷⍟⍺.

Mathematically, ⍺*⍵ ←→ *⍵×⍟⍺. In the Dyalog implementation, when ⍺≥0, ⍺*⍵ is computed by pow(), a C library function.

3. scalar⍟⍵ invokes different C library functions when scalar is 2 (log2), *1 (log), or 10 (log10). These do not make a big difference in speed.

      e←*1
cmpx '2⍟b' 'e⍟b' '10⍟b' '5⍟b' '17⍟b'
2⍟b → 1.70E¯3 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
* e⍟b → 1.09E¯3 | -37% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
* 10⍟b → 1.18E¯3 | -31% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
* 5⍟b → 1.39E¯3 | -19% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
* 17⍟b → 1.38E¯3 | -19% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

In fact:

      cmpx '2⍟b' '(⍟b)÷⍟2'
2⍟b → 1.70E¯3 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
(⍟b)÷⍟2 → 8.94E¯4 | -48% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

cmpx '10⍟b' '(⍟b)÷⍟10'
10⍟b → 1.18E¯3 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
(⍟b)÷⍟10 → 8.96E¯4 | -25% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
x←(⍴b)⍴10
cmpx 'x⍟b' '(⍟b)÷⍟x'
x⍟b → 2.82E¯3 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
(⍟b)÷⍟x → 1.56E¯3 | -45% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

I suspect the verboser APL expressions are faster because the divides are done using vector instructions, whereas the Dyalog implementation of ⍺⍟⍵ does scalar divides in a C loop, log(⍵[j])/log(⍺[j]) or log(⍵[j])/d, where d←log(scalar).

4. As stated above, ⍺*⍵ when ⍺≥0 is done using the C library function pow(). Again, using a mathematically equivalent expression, *⍵×⍟⍺, is faster, I suspect for the same reason why (⍟⍵)÷⍟⍺ is faster than ⍺⍟⍵--the divides are done using vector instructions.

      cmpx '0.1*b' '*b×⍟0.1'
0.1*b → 7.79E¯3 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
* *b×⍟0.1 → 4.19E¯3 | -47% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

reldiff←{⌈/,(|⍺-⍵)÷(|⍺)⌈(|⍵)} ⍝ maximum relative difference
(0.1*b) reldiff *b×⍟0.1
8.72981E¯14

(The last expression indicates that 0.1*b and *b×⍟0.1 are "equal", despite the asterisk in the first column of the cmpx result.)

In addition, I suspect 0.1*b and *b×⍟0.1 suffer in speed from hitting lots of underflows. (⍺*⍵ for finite arguments should not be 0, other than underflows. Likewise *⍵.)

      0.1*12345
0
+/ 0 = 0.1*b
99677

What if there are very few or no underflows?

      b1←b÷1e3
+/ 0 = 0.1*b1
0
cmpx '0.1*b1' '*b1×⍟0.1'
0.1*b1 → 5.91E¯3 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
* *b1×⍟0.1 → 1.14E¯3 | -81% ⎕⎕⎕⎕⎕⎕
(0.1*b1) reldiff *b1×⍟0.1
3.12521E¯14

5. In a hypothetical situation where I get to write the C library functions, I believe log() can be faster than exp():

You can have a precomputed table so that ⍟⍵ only need to be computed in detail in the range ÷√2 and √2 (about 0.7 and 1.4). For ⍵ in that range, equation 4.1.27 in Abramowitz and Stegun

      ⍟⍵ ←→ 2 × +/ j ÷⍨ ((⍵-1)÷(⍵+1)) * j←1+2×⍳n

provides a fast converging series for t←(⍵-1)÷(⍵+1), or 0.171573 or less. The terms of the series are odd powers of t, which means each term is smaller than the previous by a factor of t*2, or 0.0294373, which declines to 0 rapidly. See the J Wiki essay, Extended Precision Functions.

On the other hand, *⍵ would be computed by +/(⍵*j)÷!j←⍳n. You can have a precomputed table here too so that *⍵ only need to use the series for small |⍵ (<1), but you are still operating on both even and odd powers.

6. Let us do one big happy benchmark:

      x←?1e5⍴0
y←?1e5⍴0
cmpx '⍟x' 'x⍟y' '(⍟y)÷⍟x' '*x' 'x*y' '*y×⍟x'
⍟x → 6.43E¯4 | 0% ⎕⎕⎕
* x⍟y → 2.81E¯3 | +337% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
* (⍟y)÷⍟x → 1.59E¯3 | +148% ⎕⎕⎕⎕⎕⎕⎕⎕
* *x → 1.11E¯3 | +72% ⎕⎕⎕⎕⎕
* x*y → 6.21E¯3 | +865% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
* *y×⍟x → 1.82E¯3 | +182% ⎕⎕⎕⎕⎕⎕⎕⎕⎕

If you compare ⍟x to *x (6.43e¯4÷⍨1.11e¯3 or 1.73), and (⍟y)÷⍟x to *y×⍟x (1.59e¯3÷⍨1.82e¯3 or 1.14), the differences aren't so large, certainly not large enough (no greater than a factor of 2) to sweat over.

7. Some morals of this story:

  • There are lies, damned lies, and benchmarks.
  • Is it really worth it to have extra code for 2⍟⍵ and (*1)⍟⍵ and 10⍟⍵, when the benefits are not that significant?
  • Speed improvements made in one area (e.g. divide using vector instructions) create opportunities in other areas.
Finally, I don't recommend that y'all start coding (⍟⍵)÷(⍟⍺) instead of ⍺⍟⍵, or *⍵×⍟⍺ instead of ⍺*⍵. Personally, I expect Dyalog will improve * and ⍟ sometime, although not in the immediate future, and certainly not to the extent of writing replacements for C library functions.
Post Reply