自然语言解析器,用于解析体育逐个播放数据

时间:2011-11-20 02:05:57

标签: python parsing nlp

我正在尝试为足球比赛提供解析器。我在这里使用“自然语言”一词非常松散,所以请耐心等待,因为我对这个领域几乎一无所知。

以下是我正在使用的一些示例 (格式:TIME | DOWN&DIST | OFF_TEAM | DESCRIPTION):

04:39|4th and 20@NYJ46|Dal|Mat McBriar punts for 32 yards to NYJ14. Jeremy Kerley - no return. FUMBLE, recovered by NYJ.|
04:31|1st and 10@NYJ16|NYJ|Shonn Greene rush up the middle for 5 yards to the NYJ21. Tackled by Keith Brooking.|
03:53|2nd and 5@NYJ21|NYJ|Mark Sanchez rush to the right for 3 yards to the NYJ24. Tackled by Anthony Spencer. FUMBLE, recovered by NYJ (Matthew Mulligan).|
03:20|1st and 10@NYJ33|NYJ|Shonn Greene rush to the left for 4 yards to the NYJ37. Tackled by Jason Hatcher.|
02:43|2nd and 6@NYJ37|NYJ|Mark Sanchez pass to the left to Shonn Greene for 7 yards to the NYJ44. Tackled by Mike Jenkins.|
02:02|1st and 10@NYJ44|NYJ|Shonn Greene rush to the right for 1 yard to the NYJ45. Tackled by Anthony Spencer.|
01:23|2nd and 9@NYJ45|NYJ|Mark Sanchez pass to the left to LaDainian Tomlinson for 5 yards to the 50. Tackled by Sean Lee.|

截至目前,我已经编写了一个愚蠢的解析器来处理所有简单的东西(playID,季度,时间,向下和距离,攻击性团队)以及一些脚本,这些脚本可以获取这些数据并将其清理成格式见上文。单行变为“Play”对象以存储到数据库中。

这里的困难部分(至少对我而言)是解析剧本的描述。以下是我想从该字符串中提取的一些信息:

示例字符串:

"Mark Sanchez pass to the left to Shonn Greene for 7 yards to the NYJ44. Tackled by Mike Jenkins."

结果:

turnover = False
interception = False
fumble = False
to_on_downs = False
passing = True
rushing = False
direction = 'left'
loss = False
penalty = False
scored = False
TD = False
PA = False
FG = False
TPC = False
SFTY = False
punt = False
kickoff = False
ret_yardage = 0
yardage_diff = 7
playmakers = ['Mark Sanchez', 'Shonn Greene', 'Mike Jenkins']

我对初始解析器的逻辑是这样的:

# pass, rush or kick
# gain or loss of yards
# scoring play
    # Who scored? off or def?
    # TD, PA, FG, TPC, SFTY?
# first down gained
# punt?
# kick?
    # return yards?
# penalty?
    # def or off?
# turnover?
    # INT, fumble, to on downs?
# off play makers
# def play makers

描述可能变得非常毛茸茸(多次失误和带有惩罚的恢复等),我想知道我是否可以利用一些NLP模块。有可能我会花几天时间在像解析器这样的哑/静态状态机上,但如果有人建议如何使用NLP技术来处理它,我想听听它们。

2 个答案:

答案 0 :(得分:4)

我认为pyparsing在这里非常有用。

您的输入文本看起来非常规则(与真实的自然语言不同),并且pyparsing非常适合这些东西。你应该看看它。

例如,要解析以下句子:

Mat McBriar punts for 32 yards to NYJ14.
Mark Sanchez rush to the right for 3 yards to the NYJ24.

你可以用类似的东西定义一个解析句子(在文档中查找确切的语法):

name = Group(Word(alphas) + Word(alphas)).setResultsName('name')

action = Or(Exact("punts"),Exact("rush")).setResultsName('action') + Optional(Exact("to the")) + Or(Exact("left"), Exact("right")) )

distance = Word(number).setResultsName("distance") + Exact("yards")

pattern = name + action + Exact("for") +  distance + Or(Exact("to"), Exact("to the")) + Word() 

pyparsing会使用这种模式打破字符串。它还将返回一个字典,其中包含项目名称,动作和距离 - 从句子中提取。

答案 1 :(得分:0)

我认为pyparsing会很好用,但基于规则的系统非常脆弱。所以,如果你超越足球,你可能会遇到麻烦。

我认为对于这种情况更好的解决方案是语音标记器和玩家姓名,职位和其他运动术语的词典(阅读字典)。将它转储到您最喜欢的机器学习工具中,找出好的功能,我认为它做得很好。

NTLK是开始NLP的好地方。不幸的是,这个领域并不是很发达,并且没有像bam那样的工具,问题解决了,容易俗气。