Combining Take and Drop on a matrix for efficiency?

General APL language issues
petermsiegel
Posts: 159
Joined: Thu Nov 11, 2010 11:04 pm

Re: Combining Take and Drop on a matrix for efficiency?

Post by petermsiegel »

Regarding ⎕NA, I would naively pass an input (read-only) pointer to the input array, avoiding decoding the (undocumented) APL header.

Two questions arise for others, because the documentation doesn't say.
  • - Does Dyalog's ⎕NA copy an input array before passing to the associated function its pointer (i.e. a pointer to a copy of the user's array)? Faster, but less safe, would be to pass a pointer to the object in the workspace.
    - If the APL array that is being passed in is of rank 2 or higher, will ⎕NA accept it as is and point to its first element? Or are you required to ravel it first?
In any case, I've always ravelled arrays in advance-- you have to do the array index arithmetic yourself anyway on the C-side (not as nice as the seaside)-- but I never had an application where an extra array copy was time-critical, as it might be here.

Regarding decoding the header of an APL array in external code: it's easy to discover, but possibly unstable across operating systems and major releases, so I've only done it in the days when auxiliary processors roamed the earth. If read-only pointers to input arrays are efficiently handled, it shouldn't be necessary.
User avatar
StefanoLanzavecchia
Posts: 113
Joined: Fri Oct 03, 2008 9:37 am

Re: Combining Take and Drop on a matrix for efficiency?

Post by StefanoLanzavecchia »

The read-only pointer has a few shortcomings:
1) you don't know the type of the array, which means you don't know the size in bytes (or bits...) of each element;
2) you don't know its rank and shape, which you'd have to pass as extra arguments;
3) you don't have a way to build a result of the proper shape and type and fill it with the section of the original array you want to extract.

If all you have are pointers to a read-only buffer, you will never be able to create the result. If, instead, you use standard []NA kinds for the result, then you pay the price of the extra copy going out of []NA.
petermsiegel
Posts: 159
Joined: Thu Nov 11, 2010 11:04 pm

Re: Combining Take and Drop on a matrix for efficiency?

Post by petermsiegel »

Very interesting. Indeed, an APL cover function is needed to validate and ravel the input array, set aside a properly-sized output (return) array, then pass the arrays and their sizes to the C (or equiv.) routine, along with the take and drop parameters; the APL routine then would reshape the return array (which has to be a vector) into the expected matrix. Some, but not all, of this is needed for an APL-only solution.

The question is: With the copying that ⎕NA requires, will it nonetheless be faster than Take-Drop currently is (with its extra copying, for large arrays? And does it matter?

Cheers!
User avatar
StefanoLanzavecchia
Posts: 113
Joined: Fri Oct 03, 2008 9:37 am

Re: Combining Take and Drop on a matrix for efficiency?

Post by StefanoLanzavecchia »

The ravel and the reshape are one copy each. Plus one copy from C memory space to the workspace memory space for the returned array. Which is 3 copies, instead of just the two required by a take and a drop. I don't think this is going to fly. In particular the first copy (the ravel) acts on all the data, which means that it's worse than either the take or the drop (depending on which one is applied first) because it works on more data. And the data copied by the reshape is the same as the data copied by the second of the two primitives applied, so no loss there. But there's the copy out of []NA. So, all in all, given that this is pretty much an algorithm that is bound by RAM access and CPU caches, I think drop + take will win. Also, Dyalog has 30 years of experience dealing with fast slicing of arrays of all types. I would not want to compete with whomever wrote the code to slice out of 1bit per element multidimensional arrays. I never bothered to study SSE...

There's is actually a way to avoid the extra copies: what was possible in the 00's with auxiliary processors is still possible in 2024 but it's not longer publicly documented. If you ask Dyalog they'll tell you how to acquire the secret knowledge.

Still, there remains the issue of implementing a fast slicing algorithm which is beyond my capabilities.
petermsiegel
Posts: 159
Joined: Thu Nov 11, 2010 11:04 pm

Re: Combining Take and Drop on a matrix for efficiency?

Post by petermsiegel »

Understood re copies galore. Didn't realize Dyalog (unlike some older [IBM] systems) doesn't alias the ravel of a (large) object, i.e. generate a new header that points to the same payload. I'd have thought that with objects over a smallish threshold, it would be worthwhile.

I tested it and timings suggest it indeed does not. Live and learn.

In any case, new idioms and/or extensions seem the best approach to Rav's problem.
petermsiegel
Posts: 159
Joined: Thu Nov 11, 2010 11:04 pm

Re: Combining Take and Drop on a matrix for efficiency?

Post by petermsiegel »

Just as a mental exercise, I wrote a ⎕NA-ready routine in C to do the TakeDrop efficiently (the take-drop loop is tiny and fast). The routine uses addresses (A) in an attempt to minimize the overhead of any copying (and it requires no array flattening or numeric conversions), but to no avail:
  • the "declaration" is: 'I4 =A <A I4 I4 I4 I4 I4 I4',
    signifying: offset outArr inArr nRows nCols takeRows takeCols dropRows dropCols
It's slower than indexing or the use of and , perhaps due to the required duplication of the large APL array into the C environment. That's not unexpected, but the magnitude of slowdown suggests either the copying via ⎕NA is slower than within APL (but why?) or some other overhead in ⎕NA is required.

It appears the copying of the input array <A overwhelms any advantage of really fast C processing. (The output array uses =A to copy in a "prototype" replaced by the output, but in my tests it was too small to greatly impact the results.)

Entertaining!

Code: Select all

⍝ Let ¨in¨ be an 32-bit integer array of shape 3001 by 4002 
⍝ tR tC← 1000 999                   ⍝ Take shape 
⍝ dR dC← 998 997                    ⍝ Drop shape 
⍝ cmpx results:
  in[dR+⍳tR;dC+⍳tC]       → 7.2E¯4 |     0% ⎕⎕                                      
  (dR+⍳tR)(dC+⍳tC)⌷in     → 7.5E¯4 |    +4% ⎕⎕                                      
  tR tC↑dR dC↓in          → 3.7E¯3 |  +409% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕                             
* C Fn doing no work      → 1.3E¯2 | +1703% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕  
  C Fn doing ↑...↓        → 1.3E¯2 | +1744% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ 
Post Reply