Unique Values and Associated Indices
Posted: Wed Jul 27, 2011 3:22 pm
Browsing the J archives I came across an interesting thread on aggregation:
http://jsoftware.com/pipermail/programming/2011-May/022868.html
The J solution seems to be an operator of some sort, named "key", which Roger Hui helpfully gives a dyalog implmentation of:
Roger also gives a better one:
You can see that the operator applies its operand function to the right argument partitioned by the unique values of the left argument:
It was noted in the J thread that this solution does not fully answer the orginal question, which was to produce the unique values of the partition vector as well as the aggregated values. The J solution seems to solve this problem by using the Key operator as is, and then laminating on the uniques, though I could be wrong as I can't read the J that well. I don't know how you can separate the determination of the uniques and the partition by unique vales and still get good performance.
In my work, a "key" operator is not that useful (or rather computationally wasteful), as there are going to be repeated aggregations and repeated partitioning of different data all based on the same partition vector. In other words, the operand and the right argument to key will vary but the left argument will stay the same over many calls. So I have a function that looks something like:
That produces uniques and indices, that can then be saved and used repeatedly:
This function uses the sort-shift-and compare technique I got many years ago from Gary Berguist's still indispensible book "APL- Advanced techniques and Utilities".
When I saw Roger's version of "key" using dyadic iota I thought maybe after all these years there is a faster way to do this, so I tried this:
But it turns out old shift and compare technique appears to be almost twice as fast as the second version using dyadic iota.
I'm sure Roger didn't spend a lot of time optimizing his Dyalog solution to Key (I'd be very interested to see what he would come up with if he did.), but my general question to the forum is, are there any faster techniques for UniquesAndIndices now with all the new stuff over the years, with idiom recognition and sored hash tables and what not? Or is the pre-nested array APL gem still supreme?
This general problem would seem to be at the core of many commercial APL systems. I'm curious as to how other people solve it.
http://jsoftware.com/pipermail/programming/2011-May/022868.html
The J solution seems to be an operator of some sort, named "key", which Roger Hui helpfully gives a dyalog implmentation of:
Code: Select all
Key←{⍺⍺¨(↓(∪⍺)∘.=⍺)/¨⊂⍵}
Roger also gives a better one:
Code: Select all
Key←{⎕ML←3 ⋄ ⍺⍺¨i[j]⊂⍵[j←⍋i←1+⍺⍳⍺]}
You can see that the operator applies its operand function to the right argument partitioned by the unique values of the left argument:
Code: Select all
1 2 3 1 2 2 3 +/ Key 100 200 50 50 70 50 75
150 320 125
It was noted in the J thread that this solution does not fully answer the orginal question, which was to produce the unique values of the partition vector as well as the aggregated values. The J solution seems to solve this problem by using the Key operator as is, and then laminating on the uniques, though I could be wrong as I can't read the J that well. I don't know how you can separate the determination of the uniques and the partition by unique vales and still get good performance.
In my work, a "key" operator is not that useful (or rather computationally wasteful), as there are going to be repeated aggregations and repeated partitioning of different data all based on the same partition vector. In other words, the operand and the right argument to key will vary but the left argument will stay the same over many calls. So I have a function that looks something like:
Code: Select all
UniquesAndIndices←{
g←⍋⍵
s←⍵[g]
f←s≠¯1⌽s
f[⍳×⍴f]←1
p←f{⎕ML←0 ⋄ ⍺⊂⍵}g
u←f/s
u p
}
That produces uniques and indices, that can then be saved and used repeatedly:
Code: Select all
]disp UniquesAndIndices 1 2 3 1 2 2 3
┌→────────────────────────────────┐
│ ┌→────┐ ┌→────────────────────┐ │
│ │1 2 3│ │ ┌→──┐ ┌→────┐ ┌→──┐ │ │
│ └~────┘ │ │0 3│ │1 4 5│ │2 6│ │ │
│ │ └~──┘ └~────┘ └~──┘ │ │
│ └∊────────────────────┘ │
└∊────────────────────────────────┘
This function uses the sort-shift-and compare technique I got many years ago from Gary Berguist's still indispensible book "APL- Advanced techniques and Utilities".
When I saw Roger's version of "key" using dyadic iota I thought maybe after all these years there is a faster way to do this, so I tried this:
Code: Select all
UniquesAndIndices1←{
i←1+⍵⍳⍵
j←⍋i
k←i[j]{⎕ML←3 ⋄ ⍺⊂⍵}j
u←⍵[↑¨k]
u k
}
But it turns out old shift and compare technique appears to be almost twice as fast as the second version using dyadic iota.
I'm sure Roger didn't spend a lot of time optimizing his Dyalog solution to Key (I'd be very interested to see what he would come up with if he did.), but my general question to the forum is, are there any faster techniques for UniquesAndIndices now with all the new stuff over the years, with idiom recognition and sored hash tables and what not? Or is the pre-nested array APL gem still supreme?
This general problem would seem to be at the core of many commercial APL systems. I'm curious as to how other people solve it.