这是我的要求.我想以一种允许我实现以下内容的方式标记和标记段落.
应在段落中标识日期和时间,并将其标记为日期和时间
应识别段落中的已知短语并将其标记为CUSTOM
应该通过默认nltk的word_tokenize和pos_tag函数对其余内容进行标记化吗?
例如,跟随sentense
"They all like to go there on 5th November 2010, but I am not interested."
如果自定义短语是"我不感兴趣",则应按如下方式标记和标记化.
[('They', 'PRP'), ('all', 'VBP'), ('like', 'IN'), ('to', 'TO'), ('go', 'VB'), ('there', 'RB'), ('on', 'IN'), ('5th November 2010', 'DATE'), (',', ','), ('but', 'CC'), ('I am not interested', 'CUSTOM'), ('.', '.')]
任何建议都会有用.
正确的答案是编译以您想要的方式标记的大型数据集,然后在其上训练机器学习的块.如果这太耗费时间,那么简单的方法就是运行POS标记器并使用正则表达式对其输出进行后处理.获得最长的比赛是这里最难的部分:
s = "They all like to go there on 5th November 2010, but I am not interested." DATE = re.compile(r'^[1-9][0-9]?(th|st|rd)? (January|...)( [12][0-9][0-9][0-9])?$') def custom_tagger(sentence): tagged = pos_tag(word_tokenize(sentence)) phrase = [] date_found = False i = 0 while i < len(tagged): (w,t) = tagged[i] phrase.append(w) in_date = DATE.match(' '.join(phrase)) date_found |= bool(in_date) if date_found and not in_date: # end of date found yield (' '.join(phrase[:-1]), 'DATE') phrase = [] date_found = False elif date_found and i == len(tagged)-1: # end of date found yield (' '.join(phrase), 'DATE') return else: i += 1 if not in_date: yield (w,t) phrase = []
Todo:扩展DATE
re,插入代码以搜索CUSTOM
短语,通过匹配POS标签和令牌使其更加复杂,并决定是否5th
应将其视为日期.(可能不是,所以过滤掉只包含序数的长度为1的日期.)