我的数据如下:
Close a b c d e Time 2015-12-03 2051.25 5 4 3 1 1 05:00:00 2015-12-04 2088.25 5 4 3 1 NaN 06:00:00 2015-12-07 2081.50 5 4 3 NaN NaN 07:00:00 2015-12-08 2058.25 5 4 NaN NaN NaN 08:00:00 2015-12-09 2042.25 5 NaN NaN NaN NaN 09:00:00
我需要"水平"计算不是NaN的列['a']到['e']中的值.结果将是这样的:
df['Count'] = ..... df Close a b c d e Time Count 2015-12-03 2051.25 5 4 3 1 1 05:00:00 5 2015-12-04 2088.25 5 4 3 1 NaN 06:00:00 4 2015-12-07 2081.50 5 4 3 NaN NaN 07:00:00 3 2015-12-08 2058.25 5 4 NaN NaN NaN 08:00:00 2 2015-12-09 2042.25 5 NaN NaN NaN NaN 09:00:00 1
谢谢
您可以从您的df中进行选择并呼叫count
传递axis=1
:
In [24]: df['count'] = df[list('abcde')].count(axis=1) df Out[24]: Close a b c d e Time count 2015-12-03 2051.25 5 4 3 1 1 05:00:00 5 2015-12-04 2088.25 5 4 3 1 NaN 06:00:00 4 2015-12-07 2081.50 5 4 3 NaN NaN 07:00:00 3 2015-12-08 2058.25 5 4 NaN NaN NaN 08:00:00 2 2015-12-09 2042.25 5 NaN NaN NaN NaN 09:00:00 1
的时间设置
In [25]: %timeit df[['a', 'b', 'c', 'd', 'e']].apply(lambda x: sum(x.notnull()), axis=1) %timeit df.drop(['Close', 'Time'], axis=1).count(axis=1) %timeit df[list('abcde')].count(axis=1) 100 loops, best of 3: 3.28 ms per loop 100 loops, best of 3: 2.76 ms per loop 100 loops, best of 3: 2.98 ms per loop
apply
是最慢的,这不是一个惊喜,drop
版本略快,但从语义上讲,我更喜欢传递感兴趣的列表并要求count
可读性
嗯,我现在不断变化的时间:
In [27]: %timeit df[['a', 'b', 'c', 'd', 'e']].apply(lambda x: sum(x.notnull()), axis=1) %timeit df.drop(['Close', 'Time'], axis=1).count(axis=1) %timeit df[list('abcde')].count(axis=1) %timeit df[['a', 'b', 'c', 'd', 'e']].count(axis=1) 100 loops, best of 3: 3.33 ms per loop 100 loops, best of 3: 2.7 ms per loop 100 loops, best of 3: 2.7 ms per loop 100 loops, best of 3: 2.57 ms per loop
更多时间
In [160]: %timeit df[['a', 'b', 'c', 'd', 'e']].apply(lambda x: sum(x.notnull()), axis=1) %timeit df.drop(['Close', 'Time'], axis=1).count(axis=1) %timeit df[list('abcde')].count(axis=1) %timeit df[['a', 'b', 'c', 'd', 'e']].count(axis=1) %timeit df[list('abcde')].notnull().sum(axis=1) 1000 loops, best of 3: 1.4 ms per loop 1000 loops, best of 3: 1.14 ms per loop 1000 loops, best of 3: 1.11 ms per loop 1000 loops, best of 3: 1.11 ms per loop 1000 loops, best of 3: 1.05 ms per loop
似乎测试notnull
和求和(因为notnull
将生成布尔掩码)在此数据集上更快
在50k行df上,最后一种方法稍微快一些:
In [172]: %timeit df[['a', 'b', 'c', 'd', 'e']].apply(lambda x: sum(x.notnull()), axis=1) %timeit df.drop(['Close', 'Time'], axis=1).count(axis=1) %timeit df[list('abcde')].count(axis=1) %timeit df[['a', 'b', 'c', 'd', 'e']].count(axis=1) %timeit df[list('abcde')].notnull().sum(axis=1) 1 loops, best of 3: 5.83 s per loop 100 loops, best of 3: 6.15 ms per loop 100 loops, best of 3: 6.49 ms per loop 100 loops, best of 3: 6.04 ms per loop