Question

所以我已经取得了很大的成功，只要我匹配的数据不超过1行，如果它超过1行我有胃灼热（看似）...这里是一个片段我得到的HTML数据：

<tr>
<td width=20%>3 month
<td width=1% class=bar>
&nbsp;
<td width=1% nowrap class="value chg">+10.03%
<td width=54% class=bar>
<table width=100% cellpadding=0 cellspacing=0 class=barChart>
<tr>

我对“+ 10.03％”号和

感兴趣

<td width=20%>3 month

是让我知道“+ 10.03％”就是我想要的模式。

所以到目前为止我已经在Python中得到了这个：

percent = re.search('<td width=20%>3 month\r\n<td width=1% class=bar>\r\n&nbsp;\r\n<td width=1% nowrap class="value chg">(.*?)', content)

其中变量内容包含我正在搜索的所有html代码。这似乎对我不起作用...任何建议将不胜感激！我已经阅读了几篇关于re.compile（）和re.multiline（）的帖子，但我对它们没有任何好运，主要是因为我不明白它们是如何工作的我猜...

Answer 1

谢谢大家的帮助！你指出了我正确的方向，这是我如何让我的代码与BeautifulSoup一起工作。我注意到我想要的所有数据都在一个名为“value chg”的类后面，然后我的数据始终是该搜索中的第3和第5个元素，所以这就是我所做的：

from BeautifulSoup import BeautifulSoup
import urllib

content = urllib.urlopen(url).read()
soup = BeautifulSoup(''.join(content))

td_list = soup.findAll('td', {'class':'value chg'} )

mon3 = td_list[2].text.encode('ascii','ignore')
yr1 = td_list[4].text.encode('ascii','ignore')

同样，“content”是我下载的HTML ..

Answer 2

您需要添加“多行”正则表达式开关(?m)。您可以使用findall直接提取目标内容，并通过findall(regex, content)[0]获取匹配的第一个元素：

percent = re.findall(r'(?m)<td width=20%>3 month\s*<td width=1% class=bar>\s*&nbsp;\s*<td width=1% nowrap class="value chg">(\S+)', content)[0]

通过使用\s*匹配换行符，正则表达式与unix和windows样式的行终止符兼容。

查看以下测试代码的live demo：

import re
content = '<tr>\n<td width=20%>3 month\n<td width=1% class=bar>\n&nbsp;\n<td width=1% nowrap class="value chg">+10.03%\n<td width=54% class=bar>\n<table width=100% cellpadding=0 cellspacing=0 class=barChart>\n<tr>'        
percent = re.findall(r'(?m)<td width=20%>3 month\s*<td width=1% class=bar>\s*&nbsp;\s*<td width=1% nowrap class="value chg">(\S+)', content)[0]
print(percent)

输出：

+10.03%

使用Python从HTML站点提取多行数据

2 个答案: