Unique Values and Associated Indices

paulmansour · Post by **paulmansour** » Wed Jul 27, 2011 3:22 pm

Browsing the J archives I came across an interesting thread on aggregation:

http://jsoftware.com/pipermail/programming/2011-May/022868.html

The J solution seems to be an operator of some sort, named "key", which Roger Hui helpfully gives a dyalog implmentation of:

Code: Select all

Key←{⍺⍺¨(↓(∪⍺)∘.=⍺)/¨⊂⍵}

Roger also gives a better one:

Code: Select all

Key←{⎕ML←3 ⋄ ⍺⍺¨i[j]⊂⍵[j←⍋i←1+⍺⍳⍺]}

You can see that the operator applies its operand function to the right argument partitioned by the unique values of the left argument:

Code: Select all

      1 2 3 1 2 2 3 +/ Key 100 200 50 50 70 50 75
150 320 125

It was noted in the J thread that this solution does not fully answer the orginal question, which was to produce the unique values of the partition vector as well as the aggregated values. The J solution seems to solve this problem by using the Key operator as is, and then laminating on the uniques, though I could be wrong as I can't read the J that well. I don't know how you can separate the determination of the uniques and the partition by unique vales and still get good performance.

In my work, a "key" operator is not that useful (or rather computationally wasteful), as there are going to be repeated aggregations and repeated partitioning of different data all based on the same partition vector. In other words, the operand and the right argument to key will vary but the left argument will stay the same over many calls. So I have a function that looks something like:

Code: Select all

 
UniquesAndIndices←{
     g←⍋⍵
     s←⍵[g]
     f←s≠¯1⌽s
     f[⍳×⍴f]←1
     p←f{⎕ML←0 ⋄ ⍺⊂⍵}g
     u←f/s
     u p
 }

That produces uniques and indices, that can then be saved and used repeatedly:

Code: Select all

]disp   UniquesAndIndices   1 2 3 1 2 2 3 
┌→────────────────────────────────┐
│ ┌→────┐ ┌→────────────────────┐ │
│ │1 2 3│ │ ┌→──┐ ┌→────┐ ┌→──┐ │ │
│ └~────┘ │ │0 3│ │1 4 5│ │2 6│ │ │
│         │ └~──┘ └~────┘ └~──┘ │ │
│         └∊────────────────────┘ │
└∊────────────────────────────────┘

This function uses the sort-shift-and compare technique I got many years ago from Gary Berguist's still indispensible book "APL- Advanced techniques and Utilities".

When I saw Roger's version of "key" using dyadic iota I thought maybe after all these years there is a faster way to do this, so I tried this:

Code: Select all

UniquesAndIndices1←{        
     i←1+⍵⍳⍵
     j←⍋i
     k←i[j]{⎕ML←3 ⋄ ⍺⊂⍵}j
     u←⍵[↑¨k]
     u k
 }

But it turns out old shift and compare technique appears to be almost twice as fast as the second version using dyadic iota.

I'm sure Roger didn't spend a lot of time optimizing his Dyalog solution to Key (I'd be very interested to see what he would come up with if he did.), but my general question to the forum is, are there any faster techniques for UniquesAndIndices now with all the new stuff over the years, with idiom recognition and sored hash tables and what not? Or is the pre-nested array APL gem still supreme?

This general problem would seem to be at the core of many commercial APL systems. I'm curious as to how other people solve it.

Roger|Dyalog · Post by **Roger|Dyalog** » Thu Jul 28, 2011 6:12 pm

"Key" is a vast topic but I will try to do it justice here. (Fools rush in, etc.)

a. Concerning efficiency, you have to take into account whether key is a primitive operator and whether the interpreter recognizes common phrases involving key and implement them with special code. In J, key (/.) is primitive, and the following phrases are recognized: http://www.jsoftware.com/help/dictionary/special.htm

Code: Select all

#/.        # of occurrences
f//.       f is plus, minus, times, max, min,  etc.
</. i.@#   indices
({.,#)/.   uniques and # of occurrences
(+/%#)/.   mean

If key is not primitive, it'd be too much to expect the interpreter to recognize a Dop which implements +/Key, for example.

b. If the keys (the argument x) are to be used repeatedly and are "difficult", then you'd want to do a dyadic iota x1←x⍳x once, and thereafter use x1 instead of x as the key. x1, being small range integers, is "easy". An example of a "difficult" x would be a matrix of names, or a vector of floating point numbers with non-zero comparison tolerance. The essay http://www.jsoftware.com/jwiki/Essays/Index_in_Nub discusses this topic further.

In fact, if the keys are to be used repeatedly, I would precompute and save two values from x:

Code: Select all

p←⍋x
b←(1↓t,¯1)≠t←x[p]

Then x +/Key y (for example) can be computed as (p b) keysum y where keysum←{p b←⍺ ⋄ s-¯1↓0,s←b/+\⍵[p]} . This is the "sort-shift-compare" technique that you mentioned, but I save the result of sort (p) and the compare (b).

This assumes that you are OK with keysum giving the answers in a different order than +/Key .

I did a benchmark in J comparing x +//.y and the p-b technique, and:

Code: Select all

   x=: ?1e6 $ 1e4
   y=: 0.01 * _4000 + ?1e6$10000
   p=: /: x
   b=: (}.t,_1)~:t=: p{x
   
   10 timer 'x+//.y'
0.0170098
   10 timer 's-}:0,s=: b#+/\p{y'
0.049308

That is, the J primitive "key" with special code for +//., beats the best that you can do with "sort-shift-compare" with preprocessing.

In fact:

Code: Select all

   10 timer 'x i. x'
0.0165311
   10 timer 'p{y'
0.033441

So the dyadic iota on x, which internally x +//. y does every time, beats p{y. (x in this case are small range
integers and are "easy".)

c. A couple difference between J and Dyalog that are relevant here: x i. y in J works with any arrays x and y, whereas the Dyalg x⍳y requires that x be a vector; /:x in J can grade anything http://www.jsoftware.com/jwiki/Essays/The_TAO_of_J but different expressions are required for to grade x in Dyalog (e.g. if x is an enclosed array).

d. Additional discussion on Key: http://www.jsoftware.com/jwiki/Essays/Key

paulmansour · Post by **paulmansour** » Fri Jul 29, 2011 7:34 pm

Roger,

Thanks for you extensive reply. I'm trying to digest it all. The key operator certainly bears considerable study!

Paul

Roger|Dyalog · Post by **Roger|Dyalog** » Fri Jul 29, 2011 10:07 pm

While waiting for the BAA meeting to start I had an idea, viz., there is an alternative method in Dyalog APL for doing the key sum, using the +← construct:

Code: Select all

 keysum←{        
     z←(⍴∪⍺)⍴0   
     z[(∪⍺)⍳⍺]+←⍵
     z           
 }    

      'mississippi' ,[¯0.5] ⍳11
m i s s i s s i p p  i
0 1 2 3 4 5 6 7 8 9 10

      'mississippi' keysum ⍳11
0 22 16 17

A benchmark comparing this method with the sort-shift-compare method previously discussed (with preprocessing to compute P and B):

Code: Select all

      x←?1e6⍴1e4
      y←0.01× ¯4000+?1e6⍴10000

      P←⍋x
      B←(1↓t,¯1)≠t←x[P]

      timer 's-¯1↓0,s←B/+\y[P]'
0.125
      timer 'x keysum y'
0.078

(Note: SSC and keysum produce results in different order.)

For a fairer comparison, keysum too should be allowed to do preprocessing, to compute the number of unique keys and the indices in the unique keys.

Code: Select all

      N←⍴∪x
      NI←(∪x)⍳x

 keysum1←{   
     z←N⍴0   
     z[NI]+←⍵
     z       
 }      

      timer 'keysum1 y'
0.078

In this case it made no difference in time, because x, being small-range integers, is "easy". Note also that for "difficult" x the phrase (∪x)⍳x should be replaced by the equivalent (∪i)⍳i←x⍳x for reasons explained in http://www.jsoftware.com/jwiki/Essays/Index_in_Nub.

Dyalog Forums

Unique Values and Associated Indices

Unique Values and Associated Indices

Re: Unique Values and Associated Indices

Re: Unique Values and Associated Indices

Re: Unique Values and Associated Indices