Finding duplicates in vector

alexeyv · Post by **alexeyv** » Sat Mar 26, 2016 3:53 pm

Thanks, this is interesting. With the nubsieve the solution for the original task would be rather trivial.

Another question crossing my mind again (after my previous post about sort function) is what all above constructions will fail as soon as we require small change in the task: find duplicates in vector of strings case-insensitive. In this case neither index-of nor unique will help: there is no way to parametrize their behavior with the custom equality function.

In functional programming (for example in Common Lisp) there is always a possibility to parametrize the behavior of algorithm by specifying some predicate/equality functions. This holds true for generic programming in C++ as well (STL algorithms designed to be parametrized). However it seems there is no way to do it in APL without reimplementing the basic functions by the user.

Maybe it could be possible in a future to provide an option (via lexical closure or special variables like ⎕ML) to define temporarily the behavior/predicates for the family of functions which selects items like dyadic ⍳,∊, ∪, ⍋/⍒ etc?

Roger|Dyalog · Post by **Roger|Dyalog** » Sat Mar 26, 2016 11:24 pm

The specific example you described, finding duplicated case-insensitive strings, is very easy:

      dup tolower¨x

If you want an arbitrary equality function, that's not difficult either (as was shown for sorting, in one rather short line), but you pay the price in time because the primitives are highly optimized for the usual equality.

Roger|Dyalog · Post by **Roger|Dyalog** » Sat Mar 26, 2016 11:32 pm

For completeness, timings on the various function variants on larger arguments:

      y←?1e4⍴9e3
      cmpx 'd y' 'd1 y' 'f y' 'c y' 'c1 y'
  d y  → 4.88E¯5 |  0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
  d1 y → 4.93E¯5 | +1% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
  f y  → 4.89E¯5 |  0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
  c y  → 4.89E¯5 |  0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
  c1 y → 4.88E¯5 |  0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

That is, for larger arguments the fixed cost of interpretation (parsing, etc.) is amortized over a larger number of arguments, and all the variants currently take the same time. This will remain the case until such time when the compiler is bolstered with more sophisticated techniques such as loop fusion, algebraic simplification, etc.

Roger|Dyalog · Post by **Roger|Dyalog** » Sun Mar 27, 2016 5:54 am

I thought I'd compare the times for the variants of {⍵⌿⍨(⍳≢⍵)≠⍳⍨⍵} to that for a hand-coded C solution ("foo"). I expected the C solution to be no more than 2 times faster, and then say something about APL being worthwhile despite being slower because it's so much faster to write than C. However:

      cmpx 'd y' 'd1 y' 'f y' 'c y' 'c1 y' 'foo y'
  d y   → 5.06E¯5 |   0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕    
  d1 y  → 5.01E¯5 |  -1% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕     
  f y   → 4.96E¯5 |  -2% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕     
  c y   → 5.00E¯5 |  -2% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕     
  c1 y  → 5.00E¯5 |  -2% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕     
  foo y → 5.90E¯5 | +16% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

I believe the explanation is the same as that given in the Magic Function blog post: The APL primitives involved (here mainly ⍳⍨ and ⌿) are faster than a casually-written C program.

      d  ← {((⍳≢⍵)≠⍵⍳⍵)/⍵}
      f  ← ⊢ (⌿⍨) ⍳∘≢ ≠ ⍳⍨ 
      d1 ← {⍵⌿⍨(⍳≢⍵)≠⍳⍨⍵}

      c ←d  ⋄ 2(400⌶)'c' 
      c1←d1 ⋄ 2(400⌶)'c1'

      y←?1e4⍴9e3

alexeyv · Post by **alexeyv** » Sun Mar 27, 2016 8:01 am

Thanks, that is interesting. I was thinking about more general approach, then I would like to reuse the same algorithm for more complex cases - case-insensitive strings is just an example, it could potentially be say instances of classes or arrays of instances. However I guess the simplest approach is not to reuse functional programming practices, but rather extract necessary data to the form acceptable by APL (arrays of strings or numbers) and perform manipulations on these forms.

Interesting comparison to trivial C implementation - this basically means what going the C route should be necessary only if really justified with heavy experimentation for array-manipulating tasks.

Dyalog Forums

Finding duplicates in vector

Re: Finding duplicates in vector

Re: Finding duplicates in vector

Re: Finding duplicates in vector

Re: Finding duplicates in vector

Re: Finding duplicates in vector