我有一个数据帧,我需要根据以下条件过滤它
CITY == 'Mumbai' & LANGUAGE == 'English' & GENRE == 'ACTION' & count_GENRE >= 1 CITY == 'Mumbai' & LANGUAGE == 'English' & GENRE == 'ROMANCE' & count_GENRE >= 1 CITY == 'Mumbai' & LANGUAGE == 'Hindi' & count_LANGUAGE >= 1 & GENRE == 'ACTION'
当我试图这样做的时候
df1 = df.query(condition1) df2 = df.query(condition2)
我收到内存错误(因为我的数据框大小是巨大的).
所以我计划通过过滤主要条件然后子条件,这样负载会更少,性能会更好.
通过解析上述条件,以某种方式设法获得
main_filter = "CITY == 'Mumbai'" sub_cond1 = "LANGUAGE == 'English'" sub_cond1_cond1 = "GENRE == 'ACTION' & count_GENRE >= 1" sub_cond1_cond2 = "GENRE == 'ROMANCE' & count_GENRE >= 1" sub_cond2 = "LANGUAGE == 'Hindi' & count_LANGUGE >= 1" sub_cond2_cond1 = "GENRE == 'COMEDY'"
因此,将其视为树结构(当然不是二元结构,实际上它根本不是树).
现在我想遵循一个多处理方法(子进程下的深 - 子进程)
现在我想要类似的东西
on level 1 df = df_main.query(main_filter) on level 2 df1 = df.query(sub_cond1) df2 = df.query(sub_cond2) onlevel 3 df11 = df1.query(sub_cond1_cond1) df12 = df1.query(sub_cond1_cond2) df21 = df2.query(sub_cond2_cond1) ######like this
所以问题是如何将条件正确地传递到每个级别(如果我要将所有条件存储在列表中(实际上甚至没有考虑过)).
注意:每次过滤的结果应该导出到单独的单独的csvs.
例如:
df11.to_csv('CITY == 'Mumbai' & LANGUAGE == 'English' & GENRE == 'ACTION' & count_GENRE >= 1')
作为入门者,我不知道如何遵循多处理(其语法和执行方式等,特别是对于此).但不幸的是得到了这个任务.因此无法发布任何代码.
所以任何人都可以给出一个代码行示例来实现这一点.
如果你有更好的想法(类对象或节点遍历),请建议.
这看起来像一个适合的问题dask
,python模块可以帮助您处理大于内存的数据.
我将展示如何使用这个解决这个问题dask.dataframe
.让我们从创建一些数据开始:
import pandas as pd from collections import namedtuple Record = namedtuple('Record', "CITY LANGUAGE GENRE count_GENRE count_LANGUAGE") cities = ['Mumbai', 'Chennai', 'Bengalaru', 'Kolkata'] languages = ['English', 'Hindi', 'Spanish', 'French'] genres = ['Action', 'Romance', 'Comedy', 'Drama'] import random df = pd.DataFrame([Record(random.choice(cities), random.choice(languages), random.choice(genres), random.choice([1,2,3]), random.choice([1,2,3])) for i in range(4000000)]) df.to_csv('temp.csv', index=False) print(df.head()) CITY LANGUAGE GENRE count_GENRE count_LANGUAGE 0 Chennai Spanish Action 2 1 1 Bengalaru English Drama 2 3 2 Kolkata Spanish Action 2 1 3 Mumbai French Romance 1 2 4 Chennai French Action 2 3
上面创建的数据有400万行,占用107 MB.它不是大于内存,但足以在本例中使用.
下面我展示了一个python
会话的记录,我根据问题中的标准过滤了数据:
>>> import dask.dataframe as dd >>> dask_df = dd.read_csv('temp.csv', header=0) >>> dask_df.npartitions 4 # We see above that dask.dataframe has decided to split the # data into 4 partitions # We now execute the query: >>> result = dask_df[(dask_df['CITY'] == 'Mumbai') & ... (dask_df['LANGUAGE'] == 'English') & ... (dask_df['GENRE'] == 'Action') & ... (dask_df['count_GENRE'] > 1)] >>> # The line above takes very little time to execute. In fact, nothing has # really been computed yet. Behind the scenes dask has create a plan to # execute the query, but has not yet pulled the trigger. # The result object is a dask dataframe: >>> type(result)>>> result dd.DataFrame # We now pull the trigger by calling the compute() method on the dask # dataframe. The execution of the line below takes a few seconds: >>> dfout = result.compute() # The result is a regular pandas dataframe: >>> type(dfout) # Of our 4 million records, only ~40k match the query: >>> len(dfout) 41842 >>> dfout.head() CITY LANGUAGE GENRE count_GENRE count_LANGUAGE 225 Mumbai English Action 2 3 237 Mumbai English Action 3 2 306 Mumbai English Action 3 3 335 Mumbai English Action 2 2 482 Mumbai English Action 2 3
我希望这可以让您开始解决您的问题.有关更多信息,dask
请参阅教程和示例.