python中的高级字符串解析

时间:2015-05-06 22:33:45

标签: python regex

我在尝试解析复杂的字符串时遇到了问题。字符串非常长而且充满了模式,但我们可以专注于我需要采取的(仅限于此)。

来自巨大字符串的子字符串是:

  

... [span class = \" review-title \"]不会打开[/ span] 我有GS5   而游戏不会打开。当我得到第一个机器人时,我得到了这个游戏。   事实上人们自2013年以来就无法参加比赛了   公牛。请修复此问题,或者即使开启游戏也没有意义   服务器。 [div class = \" review-link \" ...

现在我想采用粗体斜体文字,我有模式,以[span class = ..] * [/ span] 所需文字 开头[ div ...]这个模式在整个字符串中重复。

我究竟如何从整个字符串中获取此特定文本并逐行写入?

3 个答案:

答案 0 :(得分:2)

此模式应该获取字符串,只需获取Group 1值:

r'\[span\b[^]]*class=[\\"\']*review-title\b[^]]*][^[]*\[/span\]\s*([^[]*)\[div\b'

或者更通用的不检查class="review-link"

r'\[span\b[^]]*][^[]*\[/span\]\s*([^[]*)\[div\b'

IDEONE的示例代码:

import re
p = re.compile(ur'\[span\b[^]]*][^[]*\[/span\]\s*([^[]*)\[div\b')
test_str = u"[span class=\"review-title\"]Wont open[/span] I have the GS5 and the game wont open. I got this game when i got the first droid. The fact that people havent been able to play since almost 2013 is bull. Please fix this or there isnt a point in even having the game on the server. [div class=\"review-link\" "
print re.search(p, test_str).group(1)

输出:

I have the GS5 and the game wont open. I got this game when i got the first droid. The fact that people havent been able to play since almost 2013 is bull. Please fix this or there isnt a point in even having the game on the server.

编辑:由于[]实际上是<> s,因此{{3} }:

import re
p = re.compile(ur'<span\b[^>]*>[^<]*</span>\s*([^<]*)<div\b')
test_str = u"<span class=\"review-title\">Wont open</span> I have the GS5 and the game wont open. I got this game when i got the first droid. The fact that people havent been able to play since almost 2013 is bull. Please fix this or there isnt a point in even having the game on the server. <div class=\"review-link\" "
print [x.group(1) for x in re.finditer(p, test_str)]

an updated regex and code来说明class属性:

p = re.compile(ur'<span\b[^>]*class\s*=\s*[\\\'"]*review-title[^>]*>[^<]*</span>\s*([^<]*)<div\b')

答案 1 :(得分:1)

根据您的评论(“即时通讯难以解决,原始[]<>”),很明显您所拥有的是HTML。

Do not try to parse HTML with regex

这里你想要的是一个HTML解析器。例如:

from bs4 import BeautifulSoup

soup = BeautifulSoup(huge_string)
for span in soup.find_all('span', class='review-title'):
    text = span.next_sibling
    print(text)

即使你拥有的是HTML以某种方式转义(反斜杠转义引号,尖括号变成方括号等),你仍然不想用正则表达式解析它。在这种情况下,最多可能需要使用正则表达式作为预处理器将其重新转换为HTML以提供给HTML解析器。

答案 2 :(得分:0)

您似乎只需要这个正则表达式:

(?<=\[/span\])[\s\S]*?(?=\[div)