当前位置:  开发笔记 > 编程语言 > 正文

为什么itertools.groupby可以将NaN组合在列表中,但不能组合在numpy数组中

如何解决《为什么itertools.groupby可以将NaN组合在列表中,但不能组合在numpy数组中》经验,为你挑选了3个好方法。

我很难调试一个问题,在这个问题中,nana listnana numpy.array中的float 在处理时使用的方式不同itertools.groupby:

给出以下列表和数组:

from itertools import groupby
import numpy as np

lst = [np.nan, np.nan, np.nan, 0.16, 1, 0.16, 0.9999, 0.0001, 0.16, 0.101, np.nan, 0.16]
arr = np.array(lst)

当我遍历列表时,连续的nans被分组:

>>> for key, group in groupby(lst):
...     if np.isnan(key):
...         print(key, list(group), type(key))
nan [nan, nan, nan] 
nan [nan] 

但是,如果我使用数组,它会将连续的nans放在不同的组中:

>>> for key, group in groupby(arr):
...     if np.isnan(key):
...         print(key, list(group), type(key))
nan [nan] 
nan [nan] 
nan [nan] 
nan [nan] 

即使我将数组转换回列表:

>>> for key, group in groupby(arr.tolist()):
...     if np.isnan(key):
...         print(key, list(group), type(key))
nan [nan] 
nan [nan] 
nan [nan] 
nan [nan] 

我正在使用:

numpy 1.11.3
python 3.5

我知道,nan != nan为什么这些操作会给出不同的结果呢?那怎么可能groupby分组nan



1> Alex Riley..:

Python列表只是指向内存中对象的指针数组.特别是lst保存指向对象的指针np.nan:

>>> [id(x) for x in lst]
[139832272211880, # nan
 139832272211880, # nan
 139832272211880, # nan
 139832133974296,
 139832270325408,
 139832133974296,
 139832133974464,
 139832133974320,
 139832133974296,
 139832133974440,
 139832272211880, # nan
 139832133974296]

(np.nan在我的电脑上是139832272211880.)

另一方面,NumPy数组只是连续的内存区域; 它们是位和字节的区域,由NumPy解释为值序列(浮点数,整数等).

麻烦的是,当你要求Python迭代一个持有浮动值的NumPy数组(在for-loop或者groupby级别)时,Python需要将这些字节装入一个合适的Python对象中.它在内存中为迭代时数组中的每个单独值创建一个全新的Python对象.

例如,您可以看到nan.tolist()调用时创建了每个值的不同对象:

>>> [id(x) for x in arr.tolist()]
[4355054616, # nan
 4355054640, # nan
 4355054664, # nan
 4355054688,
 4355054712,
 4355054736,
 4355054760,
 4355054784,
 4355054808,
 4355054832,
 4355054856, # nan
 4355054880]

itertools.groupby能够np.nan为Python列表分组,因为它在比较Python对象时首先检查身份.因为这些指针nan都指向同一个np.nan对象,所以可以进行分组.

但是,对NumPy数组的迭代不允许此初始身份检查成功,因此Python会回退到检查相等性,nan != nan正如您所说.



2> MSeifert..:

tobias_k和ajcr的答案是正确的,这是因为nan列表中的s具有相同的值,id而当它们在numpy-array中"迭代"时它们具有不同的id.

这个答案是对这些答案的补充.

>>> from itertools import groupby
>>> import numpy as np

>>> lst = [np.nan, np.nan, np.nan, 0.16, 1, 0.16, 0.9999, 0.0001, 0.16, 0.101, np.nan, 0.16]
>>> arr = np.array(lst)

>>> for key, group in groupby(lst):
...     if np.isnan(key):
...         print(key, id(key), [id(item) for item in group])
nan 1274500321192 [1274500321192, 1274500321192, 1274500321192]
nan 1274500321192 [1274500321192]

>>> for key, group in groupby(arr):
...     if np.isnan(key):
...         print(key, id(key), [id(item) for item in group])
nan 1274537130480 [1274537130480]
nan 1274537130504 [1274537130504]
nan 1274537130480 [1274537130480]
nan 1274537130480 [1274537130480]  # same id as before but these are not consecutive

>>> for key, group in groupby(arr.tolist()):
...     if np.isnan(key):
...         print(key, id(key), [id(item) for item in group])
nan 1274537130336 [1274537130336]
nan 1274537130408 [1274537130408]
nan 1274500320904 [1274500320904]
nan 1274537130168 [1274537130168]

问题是Python PyObject_RichCompare在比较值时使用-operation,它只测试对象标识==失败,因为它没有实现.itertools.groupby在另一方面用途PyObject_RichCompareBool(参见源:1,2),其测试对于对象标识第一和之前==被测试.

这可以通过一个小的cython片段来验证:

%load_ext cython
%%cython

from cpython.object cimport PyObject_RichCompareBool, PyObject_RichCompare, Py_EQ

def compare(a, b):
    return PyObject_RichCompare(a, b, Py_EQ), PyObject_RichCompareBool(a, b, Py_EQ)

>>> compare(np.nan, np.nan)
(False, True)

PyObject_RichCompareBool读取的源代码如下:

/* Perform a rich comparison with object result.  This wraps do_richcompare()
   with a check for NULL arguments and a recursion check. */

/* Perform a rich comparison with integer result.  This wraps
   PyObject_RichCompare(), returning -1 for error, 0 for false, 1 for true. */
int
PyObject_RichCompareBool(PyObject *v, PyObject *w, int op)
{
    PyObject *res;
    int ok;

    /* Quick result when objects are the same.
       Guarantees that identity implies equality. */
    /**********************That's the difference!****************/
    if (v == w) {
        if (op == Py_EQ)
            return 1;
        else if (op == Py_NE)
            return 0;
    }

    res = PyObject_RichCompare(v, w, op);
    if (res == NULL)
        return -1;
    if (PyBool_Check(res))
        ok = (res == Py_True);
    else
        ok = PyObject_IsTrue(res);
    Py_DECREF(res);
    return ok;
}

对象标识test(if (v == w))确实在使用普通python比较PyObject_RichCompare(v, w, op);并在其文档中提到之前完成:

注意 :

如果o1和o2是同一个对象,PyObject_RichCompareBool()则对于Py_EQ将始终返回1,对于Py_NE将始终返回0.



3> tobias_k..:

我不确定这是否是原因,但我只是注意到了这一点nan,lst并且arr:

>>> lst[0] == lst[1], arr[0] == arr[1]
(False, False)
>>> lst[0] is lst[1], arr[0] is arr[1]
(True, False)

即,虽然nan都是不等的,但是常规np.nan(类型float)都是相同的实例,而nan在类型中arr不同的实例numpy.float64).所以我的猜测是,如果没有key给出任何功能,groupby将在进行更昂贵的等式检查之前测试身份.

这也与没有分组的观察结果一致arr.tolist(),因为即使nan现在float再次出现,它们也不再是同一个实例.

>>> atl = arr.tolist()
>>> atl[0] is atl[1]
False

推荐阅读
wurtjq
这个屌丝很懒,什么也没留下!
DevBox开发工具箱 | 专业的在线开发工具网站    京公网安备 11010802040832号  |  京ICP备19059560号-6
Copyright © 1998 - 2020 DevBox.CN. All Rights Reserved devBox.cn 开发工具箱 版权所有