我有一个CSV,其中一个字段是嵌套的JSON对象,存储为字符串.我想将CSV加载到数据帧中,并将JSON解析为附加到原始数据帧的一组字段; 换句话说,提取JSON的内容并使它们成为数据帧的一部分.
我的CSV:
id|dist|json_request 1|67|{"loc":{"lat":45.7, "lon":38.9},"arrival": "Monday", "characteristics":{"body":{"color":"red", "make":"sedan"}, "manuf_year":2014}} 2|34|{"loc":{"lat":46.89, "lon":36.7},"arrival": "Tuesday", "characteristics":{"body":{"color":"blue", "make":"sedan"}, "manuf_year":2014}} 3|98|{"loc":{"lat":45.70, "lon":31.0}, "characteristics":{"body":{"color":"yellow"}, "manuf_year":2010}}
请注意,并非所有行的所有键都相同.我希望它能产生一个与此相当的数据框:
data = {'id' : [1, 2, 3], 'dist' : [67, 34, 98], 'loc_lat': [45.7, 46.89, 45.70], 'loc_lon': [38.9, 36.7, 31.0], 'arrival': ["Monday", "Tuesday", "NA"], 'characteristics_body_color':["red", "blue", "yellow"], 'characteristics_body_make':["sedan", "sedan", "NA"], 'characteristics_manuf_year':[2014, 2014, 2010]} df = pd.DataFrame(data)
(我很抱歉,我不能让桌子本身看起来很明智!请不要生我的气,我是菜鸟:()
在经历了很多困难之后,我提出了以下解决方案:
#Import data df_raw = pd.read_csv("sample.csv", delimiter="|") #Parsing function def parse_request(s): sj = json.loads(s) norm = json_normalize(sj) return norm #Create an empty dataframe to store results parsed = pd.DataFrame(columns=['id']) #Loop through and parse JSON in each row for i in df_raw.json_request: parsed = parsed.append(parse_request(i)) #Merge results back onto original dataframe df_parsed = df_raw.join(parsed)
这显然是不优雅的,效率非常低(在我需要解析的300K行上需要多个小时).有没有更好的办法?
我已经完成了以下相关问题: 将CSV读入pandas,其中一列是json字符串 (似乎只适用于简单的非嵌套JSON)
JSON到pandas DataFrame (我从中借用了部分解决方案,但我无法弄清楚如何在数据帧中应用此解决方案而不循环遍历行)
我使用的是Python 3.3和Pandas 0.17.
这是一种方法,可以将速度提高10到100倍,并且应该允许您在一分钟内读取大文件,而不是一个多小时.我们的想法是,只有在读取了所有数据后才构造数据帧,从而减少了需要分配内存的次数,并且只json_normalize
对整个数据块调用一次,而不是每行调用一次:
import csv import json import pandas as pd from pandas.io.json import json_normalize with open('sample.csv') as fh: rows = csv.reader(fh, delimiter='|') header = next(rows) # "transpose" the data. `data` is now a tuple of strings # containing JSON, one for each row idents, dists, data = zip(*rows) data = [json.loads(row) for row in data] df = json_normalize(data) df['ids'] = idents df['dists'] = dists
以便:
>>> print(df) arrival characteristics.body.color characteristics.body.make \ 0 Monday red sedan 1 Tuesday blue sedan 2 NaN yellow NaN characteristics.manuf_year loc.lat loc.lon ids 0 2014 45.70 38.9 1 1 2014 46.89 36.7 2 2 2010 45.70 31.0 3
此外,我看着什么pandas
的json_normalize
是干什么的,它执行一些深层次的副本,如果你只是建立从CSV一个数据帧不应该是必要的.我们可以实现我们自己的flatten
函数,它接受字典并"平坦化"键,类似于json_normalize
它.然后我们可以创建一个生成器,它一次吐出一行数据帧作为记录.这种方法更快:
def flatten(dct, separator='_'): """A fast way to flatten a dictionary,""" res = {} queue = [('', dct)] while queue: prefix, d = queue.pop() for k, v in d.items(): key = prefix + k if not isinstance(v, dict): res[key] = v else: queue.append((key + separator, v)) return res def records_from_json(fh): """Yields the records from a file object.""" rows = csv.reader(fh, delimiter='|') header = next(rows) for ident, dist, data in rows: rec = flatten(json.loads(data)) rec['id'] = ident rec['dist'] = dist yield rec def from_records(path): with open(path) as fh: return pd.DataFrame.from_records(records_from_json(fh))
以下是计时实验的结果,我通过重复行人为地增加了样本数据的大小.行数表示为n_rows
:
method 1 (s) method 2 (s) original time (s) n_rows 96 0.008217 0.002971 0.362257 192 0.014484 0.004720 0.678590 384 0.027308 0.008720 1.373918 768 0.055644 0.016175 2.791400 1536 0.105730 0.030914 5.727828 3072 0.209049 0.060105 11.877403
线性外推,第一种方法应该在大约20秒内读取300k行,而第二种方法应该花费大约6秒.