我写了以下正则表达式来标记某些短语模式
pattern = """ P2: {+ ? * + * *} P1: { ? + ? * ? * +} P3: { } P4: { } """
此模式将正确标记短语,例如:
a = 'The pizza was good but pasta was bad'
并提供2个短语的所需输出:
披萨很好吃
面食很糟糕
但是,如果我的句子是这样的:
a = 'The pizza was awesome and brilliant'
仅匹配短语:
'pizza was awesome'
而不是所期望的:
'pizza was awesome and brilliant'
如何在我的第二个例子中加入正则表达式模式?
首先,让我们来看看NLTK给出的POS标签:
>>> from nltk import pos_tag >>> sent = 'The pizza was awesome and brilliant'.split() >>> pos_tag(sent) [('The', 'DT'), ('pizza', 'NN'), ('was', 'VBD'), ('awesome', 'JJ'), ('and', 'CC'), ('brilliant', 'JJ')] >>> sent = 'The pizza was good but pasta was bad'.split() >>> pos_tag(sent) [('The', 'DT'), ('pizza', 'NN'), ('was', 'VBD'), ('good', 'JJ'), ('but', 'CC'), ('pasta', 'NN'), ('was', 'VBD'), ('bad', 'JJ')]
(注意:以上是NLTK v3.1的输出pos_tag
,旧版本可能不同)
您想要捕获的内容基本上是:
NN VBD JJ CC JJ
NN VBD JJ
所以让我们用这些模式捕捉它们:
>>> from nltk import RegexpParser >>> sent1 = ['The', 'pizza', 'was', 'awesome', 'and', 'brilliant'] >>> sent2 = ['The', 'pizza', 'was', 'good', 'but', 'pasta', 'was', 'bad'] >>> patterns = """ ... P: {} ... { } ... """ >>> PChunker = RegexpParser(patterns) >>> PChunker.parse(pos_tag(sent1)) Tree('S', [('The', 'DT'), Tree('P', [('pizza', 'NN'), ('was', 'VBD'), ('awesome', 'JJ'), ('and', 'CC'), ('brilliant', 'JJ')])]) >>> PChunker.parse(pos_tag(sent2)) Tree('S', [('The', 'DT'), Tree('P', [('pizza', 'NN'), ('was', 'VBD'), ('good', 'JJ')]), ('but', 'CC'), Tree('P', [('pasta', 'NN'), ('was', 'VBD'), ('bad', 'JJ')])])
这就是硬编码的"欺骗"!
让我们回到POS模式:
NN VBD JJ CC JJ
NN VBD JJ
可以简化为:
NN VBD JJ(CC JJ)
所以你可以在正则表达式中使用可选的运算符,例如:
>>> patterns = """ ... P: {( )?} ... """ >>> PChunker = RegexpParser(patterns) >>> PChunker.parse(pos_tag(sent1)) Tree('S', [('The', 'DT'), Tree('P', [('pizza', 'NN'), ('was', 'VBD'), ('awesome', 'JJ'), ('and', 'CC'), ('brilliant', 'JJ')])]) >>> PChunker.parse(pos_tag(sent2)) Tree('S', [('The', 'DT'), Tree('P', [('pizza', 'NN'), ('was', 'VBD'), ('good', 'JJ')]), ('but', 'CC'), Tree('P', [('pasta', 'NN'), ('was', 'VBD'), ('bad', 'JJ')])])
很可能你正在使用旧的标记器,这就是为什么你的模式不同但我猜你看到如何使用上面的例子捕获你需要的短语.
步骤是:
首先,检查使用的POS模式是什么 pos_tag
然后概括模式并简化它们
然后把它们放入 RegexpParser