在我看来,这是一个更具可读性和清晰度,但它可能性能稍差,并假设输入文件格式正确(例如空行真的是空的,而你的代码即使有一些随机的空格也可以工作"空"线).它利用正则表达式组,他们完成解析行的所有工作,我们只是将开始和结束转换为整数.
line_regex = re.compile('^\((\d+), (\d+), (.+)\)$', re.MULTILINE) sents_with_positions = [] sents_words = [] for section in _input.split('\n\n'): words_with_positions = [ (int(start), int(end), text) for start, end, text in line_regex.findall(section) ] words = tuple(t[2] for t in words_with_positions) sents_with_positions.append(words_with_positions) sents_words.append(words)
以一些分隔符分隔的块分析文本文件是一个常见问题.它有一个实用功能,如open_chunk
下面的,它可以给定一个正则表达式分隔符"chunkify"文本文件.该open_chunk
函数一次生成一个块,而不一次读取整个文件,因此可以在任何大小的文件上使用.一旦确定了块,处理每个块相对容易:
import re def open_chunk(readfunc, delimiter, chunksize=1024): """ readfunc(chunksize) should return a string. http://stackoverflow.com/a/17508761/190597 (unutbu) """ remainder = '' for chunk in iter(lambda: readfunc(chunksize), ''): pieces = re.split(delimiter, remainder + chunk) for piece in pieces[:-1]: yield piece remainder = pieces[-1] if remainder: yield remainder sents_with_positions = [] sents_words = [] with open('data') as infile: for chunk in open_chunk(infile.read, r'\n\n'): row = [] words = [] # Taken from LeartS's answer: http://stackoverflow.com/a/34416814/190597 for start, end, word in re.findall( r'\((\d+),\s*(\d+),\s*(.*)\)', chunk, re.MULTILINE): start, end = int(start), int(end) row.append((start, end, word)) words.append(word) sents_with_positions.append(row) sents_words.append(words) print(sents_words) print(sents_with_positions)
产量包括
(86, 87, ')'), (87, 88, ','), (96, 97, '(')