Question

我有以下（重复）HTML文本，我需要使用Python和正则表达式从中提取一些值。

<tr>
<td width="35%">Demand No</td>
<td width="65%"><input type="text" name="T1" size="12" onFocus="this.blur()" value="876716001"></td>
</tr>

我可以使用

获得第一个值

match_det = re.compile(r'<td width="35.+?">(.+?)</td>').findall(html_source_det)

但以上是一行。但是，我还需要获取第一个值后面的第二个值，但我无法使其工作。我尝试了以下内容，但我没有赢得比赛

match_det = re.compile('<td width="35.+?">(.+?)</td>\n'
                       '<td width="65.+?value="(.+?)"></td>').findall(html_source_det)

也许我无法让它工作，因为文本是多行的，但我添加了＆＃34; \ n＆＃34;在第一行的末尾，所以我认为这会解决它，但事实并非如此。

我做错了什么？

检索html_source下载它（它不是如上所述的静态HTML文件 - 我只是把它放在这里，所以你可以看到文本）。也许这不是获取信息源的最佳方式。

我正在获取这样的html_source：

new_url = "https://webaccess.site.int/curracc/" + url_details #not a real url
myresponse_det = urllib2.urlopen(new_url)
html_source_det = myresponse_det.read()

Answer 1

请不要尝试使用正则表达式解析HTML，因为它不是常规的。而是使用像BeautifulSoup这样的HTML解析库。它会让你的生活更轻松！以下是BeautifulSoup的示例：

from bs4 import BeautifulSoup

html = '''<tr>
<td width="35%">Demand No</td>
<td width="65%"><input type="text" name="T1" size="12" onFocus="this.blur()" value="876716001"></td>
</tr>'''

soup = BeautifulSoup(html)
print soup.find('td', attrs={'width': '65%'}).findNext('input')['value']

或更简单：

print soup.find('input', attrs={'name': 'T1'})['value']

无法在Python中使用正则表达式模式

1 个答案: