Python使用正则表达式从Html中选取文本

时间:2015-03-23 08:33:46

标签: python regex

原始文本的一部分如下所示,并存储在txt文件中。 Html源代码相似但不完整。

<span style="cursor:pointer" onmousedown="HI466('1056').click()">Steffen Eddine (PhD) (SEED)</span></span></div><script>HI466("100256").checked=T</script><div id=“k62” style="left:95px;top:15px;width:32;height:25;"><span id="321" name="021"><span style="cursor:pointer" onmousedown="HI466('2321').click()">Petra Schmidt (PESC)</span></span></div><script>HI466("239021").checked=T</script><div id=“k62” style="left:65px;top:15px;width:32;height:25;"><span id="306" name="366"><span style="cursor:pointer" onmousedown="HI466('2366').click()">Peter Kumar (PEKU)</span></span></div><script>HI466("230866").checked=T</script><div id=“k62” style="left:25px;top:35px;width:32;height:25;"><span id="425" name="511"><span style="cursor:pointer" onmousedown="HI466('2421').click()">Raksha Khaldoun (RAKH)</span></span></div><script>HI466("242511").checked=T</script><div id=“k62” style="left:95px;top:35px;width:32;height:25;"><span id="176" name="146"><span style="cursor:pointer" onmousedown="HI466('2176').click()">Yash Chevalier (YACH)</span>

我想要的是从那里拿起诸如“Steffen Eddine(PhD)(SEED)”之类的名字。

显然他们都是从“

import re

with open ("original_text.txt", "r") as myfile:
data = myfile.read()

aa = re.search(""<span style="cursor:pointer" onmousedown="", data)

我该如何挑选出来? (我也尝试使用BeautifulSoup但不太成功)。


用户Aaron在下面提交。我发现它非常接近我需要的东西。

然而它只返回5&#34; span style =&#34; cursor:pointer&#34; onmousedown事件=&#34;&#34 ;.我还需要做些什么?

for m in re.finditer('<span style="cursor:pointer" onmousedown="',data, re.IGNORECASE | re.MULTILINE):
    print m.group(0)

3 个答案:

答案 0 :(得分:1)

永远不要使用regex来解析htmlxml文件,您只需使用lxml等相关模块或beautifulsoup之类的相关模块:

>>> from lxml.html import fromstring
>>> s="""<span style="cursor:pointer" onmousedown="HI466('1056').click()">Steffen Eddine (PhD) (SEED)</span></span></div><script>HI466("100256").checked=T</script><div id=“k62” style="left:95px;top:15px;width:32;height:25;"><span id="321" name="021"><span style="cursor:pointer" onmousedown="HI466('2321').click()">Petra Schmidt (PESC)</span></span></div><script>HI466("239021").checked=T</script><div id=“k62” style="left:65px;top:15px;width:32;height:25;"><span id="306" name="366"><span style="cursor:pointer" onmousedown="HI466('2366').click()">Peter Kumar (PEKU)</span></span></div><script>HI466("230866").checked=T</script><div id=“k62” style="left:25px;top:35px;width:32;height:25;"><span id="425" name="511"><span style="cursor:pointer" onmousedown="HI466('2421').click()">Raksha Khaldoun (RAKH)</span></span></div><script>HI466("242511").checked=T</script><div id=“k62” style="left:95px;top:35px;width:32;height:25;"><span id="176" name="146"><span style="cursor:pointer" onmousedown="HI466('2176').click()">Yash Chevalier (YACH)</span>"""
>>> st=fromstring(s)
>>> [c.text for c in st.getchildren() if c.text]
['Steffen Eddine (PhD) (SEED)', 'HI466("100256").checked=T', 'HI466("239021").checked=T', 'HI466("230866").checked=T', 'HI466("242511").checked=T']

您可以使用lxml提取文本,然后根据需要修改结果!

答案 1 :(得分:1)

在此处查看演示https://regex101.com/r/gE8rD2/1

import re
p = re.compile(ur'">([^<]+)', re.MULTILINE)
test_str = "your string"

re.findall(p, test_str)

答案 2 :(得分:1)

与BeautifulSoup相同:

from BeautifulSoup import BeautifulSoup                                                                                                                       
data = '''<span style="cursor:pointer" onmousedown="HI466('1056').click()">Steffen Eddine (PhD) (SEED)</span></span></div><script>HI466("100256").checked=T</script><div id=“k62” style="left:95px;top:15px;width:32;height:25;"><span id="321" name="021"><span style="cursor:pointer" onmousedown="HI466('2321').click()">Petra Schmidt (PESC)</span></span></div><script>HI466("239021").checked=T</script><div id=“k62” style="left:65px;top:15px;width:32;height:25;"><span id="306" name="366"><span style="cursor:pointer" onmousedown="HI466('2366').click()">Peter Kumar (PEKU)</span></span></div><script>HI466("230866").checked=T</script><div id=“k62” style="left:25px;top:35px;width:32;height:25;"><span id="425" name="511"><span style="cursor:pointer" onmousedown="HI466('2421').click()">Raksha Khaldoun (RAKH)</span></span></div><script>HI466("242511").checked=T</script><div id=“k62” style="left:95px;top:35px;width:32;height:25;"><span id="176" name="146"><span style="cursor:pointer" onmousedown="HI466('2176').click()">Yash Chevalier (YACH)</span>'''
soup = BeautifulSoup(data)                                                                                                                                    
print [s.string for s in soup.findAll('span') if s.string]