我很难调试一个问题,在这个问题中,nan
a list
和nan
a numpy.array
中的float 在处理时使用的方式不同itertools.groupby
:
给出以下列表和数组:
from itertools import groupby import numpy as np lst = [np.nan, np.nan, np.nan, 0.16, 1, 0.16, 0.9999, 0.0001, 0.16, 0.101, np.nan, 0.16] arr = np.array(lst)
当我遍历列表时,连续的nan
s被分组:
>>> for key, group in groupby(lst): ... if np.isnan(key): ... print(key, list(group), type(key)) nan [nan, nan, nan]nan [nan]
但是,如果我使用数组,它会将连续的nan
s放在不同的组中:
>>> for key, group in groupby(arr): ... if np.isnan(key): ... print(key, list(group), type(key)) nan [nan]nan [nan] nan [nan] nan [nan]
即使我将数组转换回列表:
>>> for key, group in groupby(arr.tolist()): ... if np.isnan(key): ... print(key, list(group), type(key)) nan [nan]nan [nan] nan [nan] nan [nan]
我正在使用:
numpy 1.11.3 python 3.5
我知道,nan != nan
为什么这些操作会给出不同的结果呢?那怎么可能groupby
分组nan
?
Python列表只是指向内存中对象的指针数组.特别是lst
保存指向对象的指针np.nan
:
>>> [id(x) for x in lst] [139832272211880, # nan 139832272211880, # nan 139832272211880, # nan 139832133974296, 139832270325408, 139832133974296, 139832133974464, 139832133974320, 139832133974296, 139832133974440, 139832272211880, # nan 139832133974296]
(np.nan
在我的电脑上是139832272211880.)
另一方面,NumPy数组只是连续的内存区域; 它们是位和字节的区域,由NumPy解释为值序列(浮点数,整数等).
麻烦的是,当你要求Python迭代一个持有浮动值的NumPy数组(在for
-loop或者groupby
级别)时,Python需要将这些字节装入一个合适的Python对象中.它在内存中为迭代时数组中的每个单独值创建一个全新的Python对象.
例如,您可以看到nan
在.tolist()
调用时创建了每个值的不同对象:
>>> [id(x) for x in arr.tolist()] [4355054616, # nan 4355054640, # nan 4355054664, # nan 4355054688, 4355054712, 4355054736, 4355054760, 4355054784, 4355054808, 4355054832, 4355054856, # nan 4355054880]
itertools.groupby
能够np.nan
为Python列表分组,因为它在比较Python对象时首先检查身份.因为这些指针nan
都指向同一个np.nan
对象,所以可以进行分组.
但是,对NumPy数组的迭代不允许此初始身份检查成功,因此Python会回退到检查相等性,nan != nan
正如您所说.
tobias_k和ajcr的答案是正确的,这是因为nan
列表中的s具有相同的值,id
而当它们在numpy-array中"迭代"时它们具有不同的id.
这个答案是对这些答案的补充.
>>> from itertools import groupby >>> import numpy as np >>> lst = [np.nan, np.nan, np.nan, 0.16, 1, 0.16, 0.9999, 0.0001, 0.16, 0.101, np.nan, 0.16] >>> arr = np.array(lst) >>> for key, group in groupby(lst): ... if np.isnan(key): ... print(key, id(key), [id(item) for item in group]) nan 1274500321192 [1274500321192, 1274500321192, 1274500321192] nan 1274500321192 [1274500321192] >>> for key, group in groupby(arr): ... if np.isnan(key): ... print(key, id(key), [id(item) for item in group]) nan 1274537130480 [1274537130480] nan 1274537130504 [1274537130504] nan 1274537130480 [1274537130480] nan 1274537130480 [1274537130480] # same id as before but these are not consecutive >>> for key, group in groupby(arr.tolist()): ... if np.isnan(key): ... print(key, id(key), [id(item) for item in group]) nan 1274537130336 [1274537130336] nan 1274537130408 [1274537130408] nan 1274500320904 [1274500320904] nan 1274537130168 [1274537130168]
问题是Python PyObject_RichCompare
在比较值时使用-operation,它只测试对象标识==
失败,因为它没有实现.itertools.groupby
在另一方面用途PyObject_RichCompareBool
(参见源:1,2),其测试对于对象标识第一和之前==
被测试.
这可以通过一个小的cython片段来验证:
%load_ext cython %%cython from cpython.object cimport PyObject_RichCompareBool, PyObject_RichCompare, Py_EQ def compare(a, b): return PyObject_RichCompare(a, b, Py_EQ), PyObject_RichCompareBool(a, b, Py_EQ) >>> compare(np.nan, np.nan) (False, True)
PyObject_RichCompareBool
读取的源代码如下:
/* Perform a rich comparison with object result. This wraps do_richcompare()
with a check for NULL arguments and a recursion check. */
/* Perform a rich comparison with integer result. This wraps
PyObject_RichCompare(), returning -1 for error, 0 for false, 1 for true. */
int
PyObject_RichCompareBool(PyObject *v, PyObject *w, int op)
{
PyObject *res;
int ok;
/* Quick result when objects are the same.
Guarantees that identity implies equality. */
/**********************That's the difference!****************/
if (v == w) {
if (op == Py_EQ)
return 1;
else if (op == Py_NE)
return 0;
}
res = PyObject_RichCompare(v, w, op);
if (res == NULL)
return -1;
if (PyBool_Check(res))
ok = (res == Py_True);
else
ok = PyObject_IsTrue(res);
Py_DECREF(res);
return ok;
}
对象标识test(if (v == w)
)确实在使用普通python比较PyObject_RichCompare(v, w, op);
并在其文档中提到之前完成:
注意 :
如果o1和o2是同一个对象,
PyObject_RichCompareBool()
则对于Py_EQ将始终返回1,对于Py_NE将始终返回0.
我不确定这是否是原因,但我只是注意到了这一点nan
,lst
并且arr
:
>>> lst[0] == lst[1], arr[0] == arr[1] (False, False) >>> lst[0] is lst[1], arr[0] is arr[1] (True, False)
即,虽然nan
都是不等的,但是常规np.nan
(类型float
)都是相同的实例,而nan
在类型中arr
是不同的实例numpy.float64
).所以我的猜测是,如果没有key
给出任何功能,groupby
将在进行更昂贵的等式检查之前测试身份.
这也与没有分组的观察结果一致arr.tolist()
,因为即使nan
现在float
再次出现,它们也不再是同一个实例.
>>> atl = arr.tolist() >>> atl[0] is atl[1] False