假设在我计算列该只大熊猫例如C
乘以A
与B
和float
使用,如果一定条件满足apply
一个lambda
功能:
import pandas as pd df = pd.DataFrame({'A':[1,2,3,4,5,6,7,8,9],'B':[9,8,7,6,5,4,3,2,1]}) df['C'] = df.apply(lambda x: x.A if x.B > 5 else 0.1*x.A*x.B, axis=1)
预期结果将是:
A B C 0 1 9 1.0 1 2 8 2.0 2 3 7 3.0 3 4 6 4.0 4 5 5 2.5 5 6 4 2.4 6 7 3 2.1 7 8 2 1.6 8 9 1 0.9
问题是这段代码很慢,我需要在大约5600万行的数据帧上执行此操作.
%timeit
上述lambda操作的结果是:
1000 loops, best of 3: 1.63 ms per loop
从我的大型数据帧上的计算时间和内存使用情况开始,我假设此操作在进行计算时使用中间序列.
我尝试以不同的方式制定它,包括使用临时列,但我提出的每个替代解决方案都更慢.
有没有办法以不同的更快的方式获得我需要的结果,例如通过使用numpy
?
为了提高性能,您可能最好使用NumPy阵列并使用np.where
-
a = df.values # Assuming you have two columns A and B df['C'] = np.where(a[:,1]>5,a[:,0],0.1*a[:,0]*a[:,1])
运行时测试
def numpy_based(df): a = df.values # Assuming you have two columns A and B df['C'] = np.where(a[:,1]>5,a[:,0],0.1*a[:,0]*a[:,1])
计时 -
In [271]: df = pd.DataFrame(np.random.randint(0,9,(10000,2)),columns=[['A','B']]) In [272]: %timeit numpy_based(df) 1000 loops, best of 3: 380 µs per loop In [273]: df = pd.DataFrame(np.random.randint(0,9,(10000,2)),columns=[['A','B']]) In [274]: %timeit df['C'] = df.A.where(df.B.gt(5), df[['A', 'B']].prod(1).mul(.1)) 100 loops, best of 3: 3.39 ms per loop In [275]: df = pd.DataFrame(np.random.randint(0,9,(10000,2)),columns=[['A','B']]) In [276]: %timeit df['C'] = np.where(df['B'] > 5, df['A'], 0.1 * df['A'] * df['B']) 1000 loops, best of 3: 1.12 ms per loop In [277]: df = pd.DataFrame(np.random.randint(0,9,(10000,2)),columns=[['A','B']]) In [278]: %timeit df['C'] = np.where(df.B > 5, df.A, df.A.mul(df.B).mul(.1)) 1000 loops, best of 3: 1.19 ms per loop
仔细看看
让我们仔细看看NumPy的数字运算能力,并与大熊猫进行比较 -
# Extract out as array (its a view, so not really expensive # .. as compared to the later computations themselves) In [291]: a = df.values In [296]: %timeit df.values 10000 loops, best of 3: 107 µs per loop
案例#1:使用NumPy数组并使用numpy.where:
In [292]: %timeit np.where(a[:,1]>5,a[:,0],0.1*a[:,0]*a[:,1]) 10000 loops, best of 3: 86.5 µs per loop
再次,分配到一个新列:df['C']
也不会非常昂贵 -
In [300]: %timeit df['C'] = np.where(a[:,1]>5,a[:,0],0.1*a[:,0]*a[:,1]) 1000 loops, best of 3: 323 µs per loop
案例#2:使用pandas数据帧并使用其.where
方法(无NumPy)
In [293]: %timeit df.A.where(df.B.gt(5), df[['A', 'B']].prod(1).mul(.1)) 100 loops, best of 3: 3.4 ms per loop
案例#3:使用pandas数据帧(没有NumPy数组),但使用numpy.where
-
In [294]: %timeit np.where(df['B'] > 5, df['A'], 0.1 * df['A'] * df['B']) 1000 loops, best of 3: 764 µs per loop
案例#4:再次使用pandas数据帧(没有NumPy数组),但使用numpy.where
-
In [295]: %timeit np.where(df.B > 5, df.A, df.A.mul(df.B).mul(.1)) 1000 loops, best of 3: 830 µs per loop