我需要解析实时聊天对话的记录.我第一次看到该文件的想法是在问题上抛出正则表达式,但我想知道人们使用了什么其他方法.
我把优雅放在标题中,因为我之前发现这种类型的任务有可能难以维持只依赖正则表达式.
成绩单由www.providesupport.com生成并通过电子邮件发送到帐户,然后我从电子邮件中提取纯文本成绩单附件.
解析文件的原因是为了以后提取对话文本,还要识别访问者和运营商名称,以便通过CRM提供信息.
以下是成绩单文件的示例:
Chat Transcript Visitor: Random Website Visitor Operator: Milton Company: Initech Started: 16 Oct 2008 9:13:58 Finished: 16 Oct 2008 9:45:44 Random Website Visitor: Where do i get the cover sheet for the TPS report? * There are no operators available at the moment. If you would like to leave a message, please type it in the input field below and click "Send" button * Call accepted by operator Milton. Currently in room: Milton, Random Website Visitor. Milton: Y-- Excuse me. You-- I believe you have my stapler? Random Website Visitor: I really just need the cover sheet, okay? Milton: it's not okay because if they take my stapler then I'll, I'll, I'll set the building on fire... Random Website Visitor: oh i found it, thanks anyway. * Random Website Visitor is now off-line and may not reply. Currently in room: Milton. Milton: Well, Ok. But… that's the last straw. * Milton has left the conversation. Currently in room: room is empty. Visitor Details --------------- Your Name: Random Website Visitor Your Question: Where do i get the cover sheet for the TPS report? IP Address: 255.255.255.255 Host Name: 255.255.255.255 Referrer: Unknown Browser/OS: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322; InfoPath.1; .NET CLR 2.0.50727)
小智.. 12
不,事实上,对于您描述的特定类型的任务,我怀疑有一种"清洁"的方式来做这个比正则表达式.看起来你的文件有嵌入的换行符,所以我们在这里做的通常是将行作为分解单元,应用每行正则表达式.同时,您创建一个小型状态机并使用正则表达式匹配来触发该状态机中的转换.通过这种方式,您可以了解文件中的位置以及可以预期的字符数据类型.另外,请考虑使用命名捕获组并从外部文件加载正则表达式.这样,如果你的成绩单的格式发生了变化,那么调整正则表达式就好了,而不是编写新的特定于解析的代码.
不,事实上,对于您描述的特定类型的任务,我怀疑有一种"清洁"的方式来做这个比正则表达式.看起来你的文件有嵌入的换行符,所以我们在这里做的通常是将行作为分解单元,应用每行正则表达式.同时,您创建一个小型状态机并使用正则表达式匹配来触发该状态机中的转换.通过这种方式,您可以了解文件中的位置以及可以预期的字符数据类型.另外,请考虑使用命名捕获组并从外部文件加载正则表达式.这样,如果你的成绩单的格式发生了变化,那么调整正则表达式就好了,而不是编写新的特定于解析的代码.
使用Perl,您可以使用Parse :: RecDescent
它很简单,你的语法可以在以后维护.
您可能想要考虑一个完整的解析器生成器.
正则表达式适用于搜索小子串的文本,但如果您真的对将整个文件解析为有意义的数据感兴趣,那么它们的功能很差.
如果子串的上下文很重要,它们尤其不足.
大多数人都把正则表达式都放在一切,因为这就是他们所知道的.他们从未学过任何解析器生成工具,他们最终编写了许多生成规则组合和语义操作处理,您可以使用解析器生成器免费获得.
正则表达式很棒,但是如果你需要一个解析器,它们就无法替代.
这是基于lepl
解析器生成器库的两个解析器.它们都产生相同的结果.
from pprint import pprint from lepl import AnyBut, Drop, Eos, Newline, Separator, SkipTo, Space # field = name , ":" , value name, value = AnyBut(':\n')[1:,...], AnyBut('\n')[::'n',...] with Separator(~Space()[:]): field = name & Drop(':') & value & ~(Newline() | Eos()) > tuple header_start = SkipTo('Chat Transcript' & Newline()[2]) header = ~header_start & field[1:] > dict server_message = Drop('* ') & AnyBut('\n')[:,...] & ~Newline() > 'Server' conversation = (server_message | field)[1:] > list footer_start = 'Visitor Details' & Newline() & '-'*15 & Newline() footer = ~footer_start & field[1:] > dict chat_log = header & ~Newline() & conversation & ~Newline() & footer pprint(chat_log.parse_file(open('chat.log')))
from pprint import pprint from lepl import And, Drop, Newline, Or, Regexp, SkipTo def Field(name, value=Regexp(r'\s*(.*?)\s*?\n')): """'name , ":" , value' matcher""" return name & Drop(':') & value > tuple Fields = lambda names: reduce(And, map(Field, names)) header_start = SkipTo(Regexp(r'^Chat Transcript$') & Newline()[2]) header_fields = Fields("Visitor Operator Company Started Finished".split()) server_message = Regexp(r'^\* (.*?)\n') > 'Server' footer_fields = Fields(("Your Name, Your Question, IP Address, " "Host Name, Referrer, Browser/OS").split(', ')) with open('chat.log') as f: # parse header to find Visitor and Operator's names headers, = (~header_start & header_fields > dict).parse_file(f) # only Visitor, Operator and Server may take part in the conversation message = reduce(Or, [Field(headers[name]) for name in "Visitor Operator".split()]) conversation = (message | server_message)[1:] messages, footers = ((conversation > list) & Drop('\nVisitor Details\n---------------\n') & (footer_fields > dict)).parse_file(f) pprint((headers, messages, footers))
输出:
({'Company': 'Initech', 'Finished': '16 Oct 2008 9:45:44', 'Operator': 'Milton', 'Started': '16 Oct 2008 9:13:58', 'Visitor': 'Random Website Visitor'}, [('Random Website Visitor', 'Where do i get the cover sheet for the TPS report?'), ('Server', 'There are no operators available at the moment. If you would like to leave a message, please type it in the input field below and click "Send" button'), ('Server', 'Call accepted by operator Milton. Currently in room: Milton, Random Website Visitor.'), ('Milton', 'Y-- Excuse me. You-- I believe you have my stapler?'), ('Random Website Visitor', 'I really just need the cover sheet, okay?'), ('Milton', "it's not okay because if they take my stapler then I'll, I'll, I'll set the building on fire..."), ('Random Website Visitor', 'oh i found it, thanks anyway.'), ('Server', 'Random Website Visitor is now off-line and may not reply. Currently in room: Milton.'), ('Milton', "Well, Ok. But… that's the last straw."), ('Server', 'Milton has left the conversation. Currently in room: room is empty.')], {'Browser/OS': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322; InfoPath.1; .NET CLR 2.0.50727)', 'Host Name': '255.255.255.255', 'IP Address': '255.255.255.255', 'Referrer': 'Unknown', 'Your Name': 'Random Website Visitor', 'Your Question': 'Where do i get the cover sheet for the TPS report?'})
构建解析器?我无法确定您的数据是否足够常规,但可能值得研究.