我正在尝试为足球比赛提供解析器.我非常宽松地使用"自然语言"一词,所以请耐心等待,因为我对这个领域几乎一无所知.
以下是我正在使用的一些示例(格式:TIME | DOWN&DIST | OFF_TEAM | DESCRIPTION):
04:39|4th and 20@NYJ46|Dal|Mat McBriar punts for 32 yards to NYJ14. Jeremy Kerley - no return. FUMBLE, recovered by NYJ.| 04:31|1st and 10@NYJ16|NYJ|Shonn Greene rush up the middle for 5 yards to the NYJ21. Tackled by Keith Brooking.| 03:53|2nd and 5@NYJ21|NYJ|Mark Sanchez rush to the right for 3 yards to the NYJ24. Tackled by Anthony Spencer. FUMBLE, recovered by NYJ (Matthew Mulligan).| 03:20|1st and 10@NYJ33|NYJ|Shonn Greene rush to the left for 4 yards to the NYJ37. Tackled by Jason Hatcher.| 02:43|2nd and 6@NYJ37|NYJ|Mark Sanchez pass to the left to Shonn Greene for 7 yards to the NYJ44. Tackled by Mike Jenkins.| 02:02|1st and 10@NYJ44|NYJ|Shonn Greene rush to the right for 1 yard to the NYJ45. Tackled by Anthony Spencer.| 01:23|2nd and 9@NYJ45|NYJ|Mark Sanchez pass to the left to LaDainian Tomlinson for 5 yards to the 50. Tackled by Sean Lee.|
到目前为止,我已经编写了一个愚蠢的解析器来处理所有简单的东西(playID,季度,时间,向下和距离,进攻团队)以及一些脚本,这些脚本可以获取这些数据并将其清理成上面看到的格式.单行变为"Play"对象以存储到数据库中.
这里的困难部分(至少对我来说)是解析戏剧的描述.以下是我想从该字符串中提取的一些信息:
示例字符串:
"Mark Sanchez pass to the left to Shonn Greene for 7 yards to the NYJ44. Tackled by Mike Jenkins."
结果:
turnover = False interception = False fumble = False to_on_downs = False passing = True rushing = False direction = 'left' loss = False penalty = False scored = False TD = False PA = False FG = False TPC = False SFTY = False punt = False kickoff = False ret_yardage = 0 yardage_diff = 7 playmakers = ['Mark Sanchez', 'Shonn Greene', 'Mike Jenkins']
我对初始解析器的逻辑是这样的:
# pass, rush or kick # gain or loss of yards # scoring play # Who scored? off or def? # TD, PA, FG, TPC, SFTY? # first down gained # punt? # kick? # return yards? # penalty? # def or off? # turnover? # INT, fumble, to on downs? # off play makers # def play makers
描述可以变得非常毛茸茸(多次摸索和恢复与惩罚等),我想知道我是否可以利用一些NLP模块.我可能会在像解析器这样的哑/静态状态机上花几天时间,但如果有人建议如何使用NLP技术来处理它,我想听听它们.