我有一个HTML字符串,如下所示:
<p>
Type: <a href="wee.html">Tough</a><br />
Main Type:
<a href='abnormal.html'>Abnormal</a> <br />
Wheel:
<a href='none.html'>None</a>,<a href='squared.html'>Squared</a>,<a href='triangle.html'>Triangled</a> <br />
Movement type: <a href=forward.html">Forward</a><br />
Level: <a href="beginner.html">Beginner</a><br />
Sport: <a href="no.html">No</a><br/>Force: <a href="pull.html">Pull</a><br/> <span style="float:left;">Your Rating: </span> <div id="headersmallrating" style="float:left; line-height:20px;"><a href="rate.html">Login to rate</a></div><br />
</p>
换句话说,有点非结构化。我希望能够首先检测字符串Type
和Main Type
及其链接(和链接文本)。我试过用正则表达式检测单词,但这没有任何好处。如何处理这种狡猾的数据?
答案 0 :(得分:3)
如果我事先知道类别Type
,Force
等,则提前准备列表可能更容易。
<强>代码:强>
from bs4 import BeautifulSoup as bsoup
import re
ofile = open("test.html", "rb")
soup = bsoup(ofile)
soup.prettify()
categories = ["Type:","Main Type:","Wheel:","Movement type:","Level:","Sport:","Force:"]
for category in categories:
f = soup.find(text=re.compile(category)).next_sibling
string = f.get_text()
ref = f.get("href")
print "%s %s (%s)" % (category, string, ref)
<强>结果:强>
Type: Tough (wee.html)
Main Type: Abnormal (abnormal.html)
Wheel: None (none.html)
Movement type: Forward (forward.html)
Level: Beginner (beginner.html)
Sport: No (no.html)
Force: Pull (pull.html)
[Finished in 0.2s]
如果有帮助,请告诉我。
修改强>
如果它后面有多个元素,这将正确处理Wheel
。
<强>代码:强>
from bs4 import BeautifulSoup as bsoup, Tag
import re
ofile = open("unstructured.html", "rb")
soup = bsoup(ofile)
soup.prettify()
categories = ["Type:","Main Type:","Wheel:","Movement type:","Level:","Sport:","Force:"]
for category in categories:
wheel_list = []
f = soup.find(text=re.compile(category)).next_sibling
if category != "Wheel:":
string = f.get_text()
ref = f.get("href")
print "%s %s (%s)" % (category, string, ref)
else:
while f.name == "a":
content = f.contents[0]
res = f.get("href")
wheel_list.append("%s (%s)" % (content, res))
f = f.find_next()
ref = ", ".join(wheel_list)
print "%s %s" % (category, ref)
<强>结果:强>
Type: Tough (wee.html)
Main Type: Abnormal (abnormal.html)
Wheel: None (none.html), Squared (squared.html), Triangled (triangle.html)
Movement type: Forward (forward.html)
Level: Beginner (beginner.html)
Sport: No (no.html)
Force: Pull (pull.html)
[Finished in 0.3s]
如果有帮助,请告诉我们。
答案 1 :(得分:0)
你可以这样做:
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(html)
for elem in soup(text=re.compile(r'Type:')):
print elem.next_sibling.text, elem.next_sibling.get('href')
并对Size: