我正在尝试使用StanfordNERTagger和nltk从一段文本中提取关键字.
docText="John Donk works for POI. Brian Jones wants to meet with Xyz Corp. for measuring POI's Short Term performance Metrics." words = re.split("\W+",docText) stops = set(stopwords.words("english")) #remove stop words from the list words = [w for w in words if w not in stops and len(w) > 2] str = " ".join(words) print str stn = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz') stp = StanfordPOSTagger('english-bidirectional-distsim.tagger') stanfordPosTagList=[word for word,pos in stp.tag(str.split()) if pos == 'NNP'] print "Stanford POS Tagged" print stanfordPosTagList tagged = stn.tag(stanfordPosTagList) print tagged
这给了我
John Donk works POI Brian Jones wants meet Xyz Corp measuring POI Short Term performance Metrics Stanford POS Tagged [u'John', u'Donk', u'POI', u'Brian', u'Jones', u'Xyz', u'Corp', u'POI', u'Short', u'Term'] [(u'John', u'PERSON'), (u'Donk', u'PERSON'), (u'POI', u'ORGANIZATION'), (u'Brian', u'ORGANIZATION'), (u'Jones', u'ORGANIZATION'), (u'Xyz', u'ORGANIZATION'), (u'Corp', u'ORGANIZATION'), (u'POI', u'O'), (u'Short', u'O'), (u'Term', u'O')]
很明显,像Short
和Term
被标记为的东西NNP
.我拥有的数据包含许多非NNP
单词大写的实例.这可能是由于拼写错误或者可能是标题.我对此没有多少控制权.
我如何解析或清理数据,以便我可以检测到非NNP
术语,即使它可能是大写的?我不希望术语像Short
和Term
被归类为NNP
此外,不确定为什么John Donk
被捕获为一个人,但Brian Jones
不是.可能是由于NNP
我的数据中的其他大写的非?这可能对如何StanfordNERTagger
对待其他一切产生影响吗?
更新,一种可能的解决方案
这是我打算做的
取每个单词并转换为小写
标记小写字
如果标签是,NNP
那么我们知道原始单词也必须是NNP
如果没有,那么原来的单词就是错误的
这是我试图做的
str = " ".join(words) print str stp = StanfordPOSTagger('english-bidirectional-distsim.tagger') for word in str.split(): wl = word.lower() print wl w,pos = stp.tag(wl) print pos if pos=="NNP": print "Got NNP" print w
但这给了我错误
John Donk works POI Jones wants meet Xyz Corp measuring POI short term performance metrics john Traceback (most recent call last): File "X:\crp.py", line 37, inw,pos = stp.tag(wl) ValueError: too many values to unpack
我尝试了多种方法,但总会出现一些错误.我如何标记单个单词?
我不想将整个字符串转换为小写,然后Tag.如果我这样做,则StanfordPOSTagger
返回一个空字符串
首先,看看你的另一个问题,设置斯坦福CoreNLP从命令行或python调用:nltk:如何防止专有名词的阻塞.
对于正确的句子,我们看到NER正常工作:
>>> from corenlp import StanfordCoreNLP >>> nlp = StanfordCoreNLP('http://localhost:9000') >>> text = ('John Donk works POI Jones wants meet Xyz Corp measuring POI short term performance metrics. ' ... 'john donk works poi jones wants meet xyz corp measuring poi short term performance metrics') >>> output = nlp.annotate(text, properties={'annotators': 'tokenize,ssplit,pos,ner', 'outputFormat': 'json'}) >>> annotated_sent0 = output['sentences'][0] >>> annotated_sent1 = output['sentences'][1] >>> for token in annotated_sent0['tokens']: ... print token['word'], token['lemma'], token['pos'], token['ner'] ... John John NNP PERSON Donk Donk NNP PERSON works work VBZ O POI POI NNP ORGANIZATION Jones Jones NNP ORGANIZATION wants want VBZ O meet meet VB O Xyz Xyz NNP ORGANIZATION Corp Corp NNP ORGANIZATION measuring measure VBG O POI poi NN O short short JJ O term term NN O performance performance NN O metrics metric NNS O . . . O
对于降低的句子,你不会得到NNP
POS标签或任何NER标签:
>>> for token in annotated_sent1['tokens']: ... print token['word'], token['lemma'], token['pos'], token['ner'] ... john john NN O donk donk JJ O works work NNS O poi poi VBP O jones jone NNS O wants want VBZ O meet meet VB O xyz xyz NN O corp corp NN O measuring measure VBG O poi poi NN O short short JJ O term term NN O performance performance NN O metrics metric NNS O
所以你的问题应该是:
您的NLP应用程序的最终目标是什么?
为什么你的输入较低?是你在做什么或如何提供数据?
在回答完这些问题之后,您可以继续决定您对NER标签的真正想法,即
如果输入是低级的,那是因为你构建NLP工具链的方式,那么
不要那样做!!!在普通文本上执行NER而不会出现您创建的扭曲.这是因为NER是在正常文本上训练的,所以它不会真正脱离正常文本的上下文.
另外,尝试不要混合不同套件中的NLP工具,它们通常不会很好用,尤其是在NLP工具链的末尾
如果输入是低位的,因为这是原始数据的方式,那么:
注释一小部分数据,或查找小写的注释数据,然后重新训练模型.
解决它并使用普通文本训练一个真实的文本,然后将truecasing模型应用于较低的文本.请参阅https://www.cs.cmu.edu/~llita/papers/lita.truecasing-acl2003.pdf
如果输入有错误的套管,例如`有些大而有些但不是全部都是正确的名词,那么
尝试真正的解决方案.