Page 1 of 1
Handling Missing Data Values-- potential for using NaN
Posted: Mon Jul 02, 2012 3:14 am
by petermsiegel
Has there been any thought to adding a mechanism for dealing with missing values in numeric arrays? I've been using R recently and it has some elegant moments in handling missing values as an "expected" occurrence. While inserting ⎕NULL values in the middle of APL arrays can work to flag missing values, other languages (SAS) encode NaN directly, e.g. with a special token "." (dot) for a general missing value, .A for one specific missing value type, .B, etc.
NaN is a valid code (actually, a family of codes) within the IEEE 754 floating point standard designed for this purpose. The nice thing about NAN is that you can create both quiet NaNs, that propagate (if a scalar s has value NAN, then s+1 is NaN, and so is (s+1)x2) and signaling NaNs, that in principle signal an immediate error.
The disadvantage of course is that NaN can not be encoded in integers, so values like -MAXINT are used (in APL, this would typically (by convention) force integers with missing values to be 4 bytes and make for occasionally odd behaviour).
As a related point, what happens now when INF or -INF are generated? If one were to encode a NaN value within an APL floating array (by building such a value using ⎕DR), what would happen?
Re: Handling Missing Data Values-- potential for using NaN
Posted: Mon Jul 02, 2012 2:24 pm
by PMH
petermsiegel wrote: The disadvantage of course is that NaN can not be encoded in integers, so values like -MAXINT are used (in APL, this would typically (by convention) force integers with missing values to be 4 bytes and make for occasionally odd behaviour).
Not necessarily, based on these definitions...
Code: Select all
#define MAXUINT8 ((UINT8)~((UINT8)0))
#define MAXINT8 ((INT8)(MAXUINT8 >> 1))
#define MININT8 ((INT8)~MAXINT8)
#define MAXUINT16 ((UINT16)~((UINT16)0))
#define MAXINT16 ((INT16)(MAXUINT16 >> 1))
#define MININT16 ((INT16)~MAXINT16)
#define MAXUINT32 ((UINT32)~((UINT32)0))
#define MAXINT32 ((INT32)(MAXUINT32 >> 1))
#define MININT32 ((INT32)~MAXINT32)
#define MAXUINT64 ((UINT64)~((UINT64)0))
#define MAXINT64 ((INT64)(MAXUINT64 >> 1))
#define MININT64 ((INT64)~MAXINT64)
... you could instead define the min values one bigger...
Code: Select all
#define MININT8 ((INT8)-MAXINT8)
#define MININT16 ((INT16)-MAXINT16)
#define MININT32 ((INT32)-MAXINT32)
#define MININT64 ((INT64)-MAXINT64)
...to get the 0x80/0x8000/0x80000000 representations available for the NaNs:
Code: Select all
#define NaNINT8 ((INT8)~MAXINT8)
#define NaNINT16 ((INT16)~MAXINT16)
#define NaNINT32 ((INT32)~MAXINT32)
#define NanINT64 ((INT64)~MAXINT64)
However, I think it is much more practical to limit NaNs to the real type only.
If any numerical array might consist a NaN, it will internally get converted to real.
That avoids a limitation of the 8/16/32 bit integer space.
BTW
Doesn't there exist a single Unicode for []NULL and for NaN, and if not, why not?These []NULLs are definitely not a looker.
Re: Handling Missing Data Values-- potential for using NaN
Posted: Mon Jul 02, 2012 7:30 pm
by Morten|Dyalog
petermsiegel wrote:Has there been any thought to adding a mechanism for dealing with missing values in numeric arrays?
Yes, this is a topic that surfaces from time to time. We end up with the same answer each time round the loop, and our answer is that there is no general "missing value" strategy that works well for APL arrays. R and SAS are a bit closer to being "applications"; they focus on statistical processing and have a bunch of special functions with meaningful NaN handling built in. APL primitives are much more general and we are unable to come up with useful definitions for the behaviour of APL primitives when faced with arrays containing NaNs.
We can't even get really simple examples like avg←{+/⍵÷⍴⍵} to work when ⍵ contains NaNs.
All the APL application code I ever saw would identify missing values, typically turn them into 0's so that +/ works (it is not very useful if +/x returns NaN if any element of x is NaN), and then have very application-specific strategies for dealing with them.
Re: Handling Missing Data Values-- potential for using NaN
Posted: Tue Jul 03, 2012 12:27 am
by petermsiegel
Thanks for the two very thoughtful responses.
Re: integer encodings of missing values,
Not necessarily, based on these definitions...
my thinking was that if one were to build an integer array, one wouldn't choose for the missing value 1-MAXINT8 when it is 8-bits, then 1-MAXINT16 when it is 16-bits, etc. Not only would it create several missing integer values to contend with, but it likely would cause missing and actual values to collide. If you choose -2*31 as the conventional (one and only) missing value, then you can use anything but -2*31 as actual values, though it forces a more compact array to 4-bytes when a missing value is seen. Your approach works if you adopt a convention for each integer size, but you then need to be sure nothing conspires to change the integer sizes (because the 16-bit missing value is not the same as the 32-bit value). Of course, one could use Decimals as if integers and avoid the problem, but that is just out of the frying pan into the fire.
Re: Handling Missing Data Values-- potential for using NaN
Posted: Tue Jul 03, 2012 11:46 am
by PMH
How do you think about an extension of []DIV ?
Currently,
[]DIV = 0 returns for 1÷0 a Domain Error,
[]DIV = 1 returns for 1÷0 a 0,
and
[]DIV = 2 could return for 1÷0 a NaN.
A NaN could be diplayed as Unicode U+26A0 (⚠)
(See
http://unicode.org/review/resolved-pri.html#pri74)
Re: Handling Missing Data Values-- potential for using NaN
Posted: Tue Jul 03, 2012 2:16 pm
by paulmansour
I think it is a huge mistake to bring in NaNs or nulls. It may solve some problems, but will only introduce many more subtle and intractable problems.
My first question would be, what does NaN=NaN return, and why?
Re: Handling Missing Data Values-- potential for using NaN
Posted: Tue Jul 03, 2012 9:32 pm
by petermsiegel
Nice conversation.
All approaches have tradeoffs, including complexity. To be sure, having NaNs as a default creates headaches, just as having complex numbers does (e.g. where taking the square root of a negative number is a programming mistake). I think we all know that there could be advantages for some community members and of course costs to the vendor to support. The fact that ⌷NULL is becoming a
de facto place holder for imported missing values (for Excel missing values and for APL - to -R interfaces) makes the question more interesting. ⌷VFI by the way provides some useful support for importing aberrant values-- which I use now in importing missing values (coded as letters) and replacing them by NULLs.
What about leaving the semantics mostly as they are now, but allowing a special value, say ⌷NAN to be encoded within a floating point value, but otherwise stored as a "pointer," much as ⌷NULL is.
Code: Select all
For example, Let I be an integer array, where I[5]←⌷NAN and R a real array, where R[5]←⌷NAN. R[5] would be a floating point number per se with NaN encoding. For the integer case, the integer at I[5] would be "replaced" (as now) by a pointer object, with (pointers to) integers in all other positions (say). R would be a floating array (perhaps there can be some indicator that there is at least one NaN value), but I would be a pointer object (as now).
Not perfect, but would make floating arrays useful for propagating NaNs, without necessarily having the full semantics. (We can speculate whether one could be given a choice of ignoring NaNs or causing TRAPs. By default, ⌷NAN would have the semantics of ⌷NULL).
Re: Handling Missing Data Values-- potential for using NaN
Posted: Thu Jul 05, 2012 12:27 pm
by DanB|Dyalog
I don't think NaNs are necessary nor are they a good idea.
In all the places I have seen them in use (typically missing values in timeseries) it was easy to produce code to handle NaNs by faking them with special values not seen in the data.
In one place I worked we had special user defined operators to apply fns to deal with NaNs (they were 1E18 in the data) so we could have first, last, +/, average, etc. taking into account the presence of those 1E18s.
Easy.
Re: Handling Missing Data Values-- potential for using NaN
Posted: Fri Jul 06, 2012 12:26 pm
by giangiquario
DanB|Dyalog wrote:I don't think NaNs are necessary nor are they a good idea.
In all the places I have seen them in use (typically missing values in timeseries) it was easy to produce code to handle NaNs by faking them with special values not seen in the data.
In one place I worked we had special user defined operators to apply fns to deal with NaNs (they were 1E18 in the data) so we could have first, last, +/, average, etc. taking into account the presence of those 1E18s.
Easy.
I do agree.
Twenty years ago I was developing collection and processing of statistical data (VSAPL and APL PLUS PC). There were many missing data, but I could easily deal with them (data were numerical matrixes + something similar to 1E8)). Afterwards, with APL2, I used defined operators to apply arithmetical functions.