我有一个file.csv
约15k行,看起来像这样
SAMPLE_TIME, POS, OFF, HISTOGRAM 2015-07-15 16:41:56, 0-0-0-0-3, 1, 2,0,5,59,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0, 2015-07-15 16:42:55, 0-0-0-0-3, 1, 0,0,5,9,0,0,0,0,0,2,0,0,0,50,0, 2015-07-15 16:43:55, 0-0-0-0-3, 1, 0,0,5,5,0,0,0,0,0,2,0,0,0,0,4,0,0,0, 2015-07-15 16:44:56, 0-0-0-0-3, 1, 2,0,5,0,0,0,0,0,0,2,0,0,0,6,0,0,0,0
我希望它导入到pandas.DataFrame
任何给定没有标题的列的随机值,如下所示:
SAMPLE_TIME, POS, OFF, HISTOGRAM 1 2 3 4 5 6 2015-07-15 16:41:56, 0-0-0-0-3, 1, 2, 0, 5, 59, 4, 0, 0, 2015-07-15 16:42:55, 0-0-0-0-3, 1, 0, 0, 5, 0, 6, 0, nan 2015-07-15 16:43:55, 0-0-0-0-3, 1, 0, 0, 5, 0, 7, nan nan 2015-07-15 16:44:56, 0-0-0-0-3, 1, 2, 0, 5, 0, 0, 2, nan
这是不可能导入的,因为我尝试了不同的解决方案,例如给出一个特定的标题,但仍然没有快乐,我能够使其工作的唯一方法是在.csv
文件中手动添加标题.哪种打败自动化的目的!
然后我尝试了这个解决方案:这样做
lines=list(csv.reader(open('file.csv'))) header, values = lines[0], lines[1:]
它正确地读取了给我一个~15k元素列表的文件values
,每个元素都是一个字符串列表,其中每个字符串都是从文件中正确解析的数据字段,但是当我尝试这样做时:
data = {h:v for h,v in zip (header, zip(*values))} df = pd.DataFrame.from_dict(data)
或这个:
data2 = {h:v for h,v in zip (str(xrange(16)), zip(*values))} df2 = pd.DataFrame.from_dict(data)
然后非标题列消失,列的顺序完全混合.任何可能的解决方案的想法?
您可以根据第一个实际行的长度创建列:
from tempfile import TemporaryFile with open("out.txt") as f, TemporaryFile("w+") as t: h, ln = next(f), len(next(f).split(",")) header = h.strip().split(",") f.seek(0), next(f) header += range(ln) print(pd.read_csv(f, names=header))
哪个会给你:
SAMPLE_TIME POS OFF HISTOGRAM 0 1 2 3 \ 0 2015-07-15 16:41:56 0-0-0-0-3 1 2 0 5 59 0 1 2015-07-15 16:42:55 0-0-0-0-3 1 0 0 5 9 0 2 2015-07-15 16:43:55 0-0-0-0-3 1 0 0 5 5 0 3 2015-07-15 16:44:56 0-0-0-0-3 1 2 0 5 0 0 4 5 ... 13 14 15 16 17 18 19 20 21 22 0 0 0 ... 0 0 0 0 0 NaN NaN NaN NaN NaN 1 0 0 ... 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2 0 0 ... 4 0 0 0 NaN NaN NaN NaN NaN NaN 3 0 0 ... 0 0 0 0 NaN NaN NaN NaN NaN NaN [4 rows x 27 columns]
或者你可以在传递给pandas之前清理文件:
import pandas as pd from tempfile import TemporaryFile with open("in.csv") as f, TemporaryFile("w+") as t: for line in f: t.write(line.replace(" ", "")) t.seek(0) ln = len(line.strip().split(",")) header = t.readline().strip().split(",") header += range(ln) print(pd.read_csv(t,names=header))
哪个给你:
SAMPLE_TIME POS OFF HISTOGRAM 0 1 2 3 4 5 ... 11 \ 0 2015-07-1516:41:56 0-0-0-0-3 1 2 0 5 59 0 0 0 ... 0 1 2015-07-1516:42:55 0-0-0-0-3 1 0 0 5 9 0 0 0 ... 0 2 2015-07-1516:43:55 0-0-0-0-3 1 0 0 5 5 0 0 0 ... 0 3 2015-07-1516:44:56 0-0-0-0-3 1 2 0 5 0 0 0 0 ... 0 12 13 14 15 16 17 18 19 20 0 0 0 0 0 0 0 NaN NaN NaN 1 50 0 NaN NaN NaN NaN NaN NaN NaN 2 0 4 0 0 0 NaN NaN NaN NaN 3 6 0 0 0 0 NaN NaN NaN NaN [4 rows x 25 columns]
或者删除列将是所有娜娜:
print(pd.read_csv(f, names=header).dropna(axis=1,how="all"))
给你:
SAMPLE_TIME POS OFF HISTOGRAM 0 1 2 3 \ 0 2015-07-15 16:41:56 0-0-0-0-3 1 2 0 5 59 0 1 2015-07-15 16:42:55 0-0-0-0-3 1 0 0 5 9 0 2 2015-07-15 16:43:55 0-0-0-0-3 1 0 0 5 5 0 3 2015-07-15 16:44:56 0-0-0-0-3 1 2 0 5 0 0 4 5 ... 8 9 10 11 12 13 14 15 16 17 0 0 0 ... 2 0 0 0 0 0 0 0 0 0 1 0 0 ... 2 0 0 0 50 0 NaN NaN NaN NaN 2 0 0 ... 2 0 0 0 0 4 0 0 0 NaN 3 0 0 ... 2 0 0 0 6 0 0 0 0 NaN [4 rows x 22 columns]