Question

假设我有以下字符串：

thestring = "1) My Favorite Pokemon Charizard *22.00 MP* [Pre-Avatar Mode Cost: 15.75 MP] [Post-Avatar Mode Cost: 6.250 MP]"

其他一些样本可能是：

thestring = "1) My Favorite Pokemon Mew *1 MP* [Pre-Avatar Mode Cost: 0.5 MP] [Post-Avatar Mode Cost: 0.5 MP]"

thestring = "1) My Favorite Pokemon Pikachu *6.25 MP* [Pre-Avatar Mode Cost: 5 MP]; [Post-Avatar Mode Cost: 1.25 MP]"

（第三种情况的冒号是有意的）

如何最好地提取“预制费用”和“假设后模式费用”的值？我听说正则表达式，但也有string.find方法，但我不确定实现这一目标的最佳方法是什么。请注意，虽然“预阿凡达模式成本”可能是15.75 MP，但也可能取决于种类，也可能是15.752或包含多个小数位。语法表示赞赏。

更新：

我正在使用Python 2.7。最接近的答案如下：

m = re.match('\[Pre-Avatar Mode Cost: (?P<precost>\d(\.\d*){0,1}) MP\] \[Post-Avatar Mode Cost: (?P<postcost>\d(\.\d*){0,1}) MP\]', '1) My Favorite Pokemon Mew *1 MP* [Pre-Avatar Mode Cost: 0.5 MP] [Post-Avatar Mode Cost: 0.5 MP]')

虽然看起来实际上并没有正确匹配，但由于没有匹配，导致m会导致“Nonetype”。

我使用以下内容稍作改动：

m = re.match('(.*)\[.*(?P<precost>\d+(\.\d*){0,1}).*\].*\[.*(?P<postcost>\d+(\.\d*){0,1}).*\]', '1) My Favorite Pokemon Mew *1 MP* [Pre-Avatar Mode Cost: 0.5 MP] [Post-Avatar Mode Cost: 0.5 MP]')

虽然预付费和后加注费似乎都等于“5”。知道正则表达式可能存在什么问题吗？

Answer 1

http://docs.python.org/2/howto/regex.html

这是所需的分组：

m = re.match('\[Pre-Avatar Mode Cost\: (?P<precost>\d(?:\.\d*)?) MP\] \[Post-Avatar Mode Cost\: (?P<postcost>\d(?:\.\d*)?) MP\]', '1) My Favorite Pokemon Mew *1 MP* [Pre-Avatar Mode Cost: 0.5 MP] [Post-Avatar Mode Cost: 0.5 MP]')

以下是您访问论坛的方式：

m.group('precost')
m.group('postcost')

如果您不关心字符串的内容并且知道值在2个方括号中，您可以：

m = re.match('\[.*?(?P<precost>\d+(?:\.\d*)?).*?\].*?\[.*?(?P<postcost>\d+(?:\.\d*)?).*\]', 'your long string')
m.group('precost')
m.group('postcost')

Answer 2

我认为正则表达式是最好的选择：

pattern = re.compile(r"\[.*?([0-9]+(?:\.[0-9]+)?).*?\]")
pre, post = [float(x) for x in re.findall(pattern, thestring)]

无论小数位数（或缺少）如何，这都应该有效。

Answer 3

这可能会对你正在搜索的文本中的 not 做出太多假设，但肯定会更短，也可能更快：

re.findall('\[Pre[^:]+:\s+(?P<precost>\S+)[^[]+\[Post[^:]+:\s+(?P<postcost>\S+)', 
    thestring)
[('5', '1.25')]

这些假设可能不正确：

成本之后和“PM”之前总是有空格。
方括号内的冒号只发生一次，并且总是放在“费用”之后。
括号内没有任何其他组以“Pre”或“Post”序列开头。

Answer 4

绝对是RegEx，因为它非常精确。我没有看到您正在谈论的“预铸造成本”部分。也许你的意思是“前阿凡达模式”？

但是对于后阿凡达模式成本，您必须考虑某些文本的一致性。如果你知道“后头像模式成本：”始终是一致的分隔符，你可以做一个简单的匹配。

假设您想要浮动值，您可以执行以下操作：

import re
post_avatar_cost = re.match("\[Post-Avatar Mode Cost: (?P<PostCost>[0-9]*\.[0-9]*) MP\]")
post_avatar_cost = post_avatar_cost.group('PostCost')

这会给你漂浮（作为一个字符串）。我在这里做了很多假设，我正在写一些快速的东西来给你一个想法。但你可以在循环中抛出它来找到所有这些值。

此页面将是您最好的朋友：http://docs.python.org/2/library/re.html

在python中从以下字符串中提取值的最佳方法？

4 个答案: