我有一个page.htm文件:
</td></tr>
<tr>
<td height="120" class="box_pic">
<a href="view.php?item=1322679" target="_blank"><img src="http://s.fdert.com/pics.php?q=4iGjVtivCiBKELV%2BVUi27TIgo9KhXusVoizsXDI8FN1HTLACXmZddfsdsgsdghqJXZkz5vSkYq6xISbd2zaUA%3D%3D" alt="[без описания]" width="140" height="105"></a>
</td>
</tr>
<tr align="center" valign="middle">
<td valign="top">
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tr>
<td class="box_prc"><span class="nwr"><img src="/map/gender_pair.gif" width="11" height="11" alt="Сова" border=0> <a class="usernick" href="/index.php?action=user&id=79159" target="_blank">ABird</a></span></td>
</td></tr>
<tr>
<td height="120" class="box_pic">
<a href="view.php?item=1546679" target="_blank"><img src="http://s.fdert.com/pics.php?q=4iGjVtivCiBKELV%2BVUi27TIgo9KhXusVoizsXDI8FN1HTLACXmZddfsdsgsdghqJXZkz5vSkYq6xISbd2zaUA%3D%3D" alt="[без описания]" width="140" height="105"></a>
</td>
</tr>
<tr align="center" valign="middle">
<td valign="top">
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tr>
<td class="box_prc"><span class="nwr"><img src="/map/gender_pair.gif" width="11" height="11" alt="Сова" border=0> <a class="usernick" href="/index.php?action=user&id=78759" target="_blank">ADog</a></span></td>
</td></tr>
<tr>
<td height="120" class="box_pic">
<a href="view.php?item=5622679" target="_blank"><img src="http://s.fdert.com/pics.php?q=4iGjVtivCiBKELV%2BVUi27TIgo9KhXusVoizsXDI8FN1HTLACXmZddfsdsgsdghqJXfdgfdgZkz5vSkYq6xISbd2zaUA%3D%3D" alt="[без описания]" width="140" height="105"></a>
</td>
</tr>
<tr align="center" valign="middle">
<td valign="top">
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tr>
<td class="box_prc"><span class="nwr"><img src="/map/gender_pair.gif" width="11" height="11" alt="Сова" border=0> <a class="usernick" href="/index.php?action=user&id=87159" target="_blank">ACat56</a></span></td>
我需要3组数据:
1) 1322679 79159 ABird
2) 1546679 78759 ADog
3) 5622679 87159 ACat56
我有3个RE请求,可以从这个页面中挖掘元素:
import re
with open('page.htm', 'r') as our_file:
page=our_file.read()
result = re.findall(r'view\.php\?item=(\d+)', page)
result2 = re.findall(r'user&id=(\d+)', page)
result3 = re.findall(r'user&id=.*>(\w+)', page)
print (result, len(result))
print (result2, len(result2))
print (result3, len(result3))
我得到的结果:
['1322679', '1546679', '5622679'] 3
['79159', '78759', '87159'] 3
['ABird', 'ADog', 'ACat56'] 3
你知道在ONE中加入这3个请求的方法吗?那么
1) file would be analized 1 time instead of 3 times
2) only ONE re.findall() would be used
3) data would be joined in the way I need
a) 1322679 79159 ABird
b) 1546679 78759 ADog
c) 5622679 87159 ACat56
结果请求应该是这样的:
result = re.findall(r'view\.php\?item=(\d+) SOMETHING_HERE user&id=(\d+) SOMETHING_HERE .*>(\w+)', page)
答案 0 :(得分:1)
以下是如何使用Python 2中的HTML解析器正确完成它:
from urlparse import parse_qs, urlparse
from bs4 import BeautifulSoup
def only(x):
x = list(x)
assert len(x) == 1
return x[0]
def url_params(a):
return parse_qs(urlparse(a['href']).query)
def main():
with open('page.html') as f:
soup = BeautifulSoup(f, 'html.parser')
rows = soup.find_all('tr', recursive=False)
# Data is in alternating rows, so take pairs of rows at a time
for row1, row2 in zip(rows[::2], rows[1::2]):
a = only(row1.select('td.box_pic a'))
item_id = only(url_params(a)['item'])
a = only(row2.select('a.usernick'))
user_id = only(url_params(a)['id'])
nick = a.text
print item_id, user_id, nick
main()
输出:
1322679 79159 ABird
1546679 78759 ADog
5622679 87159 ACat56
现在,这可能不像re方法那样简洁,但是这段代码知道输入是如何构造的并且使其健壮。如果输入的结构发生变化,例如URL的格式或HTML的形状,此代码将继续正常工作,否则将引发错误,告诉您事情不符合预期。 re方法可能很容易继续运行,但会给你不正确的结果,这不是你想要的情况。如果您希望将来提取更多信息,则可以非常轻松地添加必要的行而不会干扰现有代码。
答案 1 :(得分:-1)
最后,我找到了解决方案:
这是答案,满足所有要求:
import re
with open('page.htm', 'r') as our_file:
page=our_file.read()
page = re.sub(r'[\t\r\n\s]','',page)
re.DOTALL
result = re.findall(r'view\.php\?item=(\d+).*?user&id=(\d+).*?>(\w+)', page)
print (result, len(result))
和
1) results are in needed order
2) 1 request
结果:
[('1322679', '79159', 'ABird'), ('1546679', '78759', 'ADog'), ('5622679', '87159', 'ACat56')] 3