从非结构化HTML数据中解析链接和字符串

时间:2014-04-05 18:54:28

标签: python beautifulsoup

我有一个HTML字符串,如下所示:

        <p>
                                Type: <a href="wee.html">Tough</a><br />

                                Main Type:
                <a href='abnormal.html'>Abnormal</a>                    <br />


                                Wheel:
                <a href='none.html'>None</a>,<a href='squared.html'>Squared</a>,<a href='triangle.html'>Triangled</a>                    <br />

                                Movement type: <a href=forward.html">Forward</a><br />

                                Level: <a href="beginner.html">Beginner</a><br />
            Sport: <a href="no.html">No</a><br/>Force: <a href="pull.html">Pull</a><br/>              <span style="float:left;">Your Rating:&nbsp;</span> <div id="headersmallrating" style="float:left; line-height:20px;"><a href="rate.html">Login to rate</a></div><br />

        </p>

换句话说,有点非结构化。我希望能够首先检测字符串TypeMain Type及其链接(和链接文本)。我试过用正则表达式检测单词,但这没有任何好处。如何处理这种狡猾的数据?

2 个答案:

答案 0 :(得分:3)

如果我事先知道类别TypeForce等,则提前准备列表可能更容易。

<强>代码:

from bs4 import BeautifulSoup as bsoup
import re

ofile = open("test.html", "rb")
soup = bsoup(ofile)
soup.prettify()

categories = ["Type:","Main Type:","Wheel:","Movement type:","Level:","Sport:","Force:"]
for category in categories:
    f = soup.find(text=re.compile(category)).next_sibling
    string = f.get_text()
    ref = f.get("href")
    print "%s %s (%s)" % (category, string, ref)

<强>结果:

Type: Tough (wee.html)
Main Type: Abnormal (abnormal.html)
Wheel: None (none.html)
Movement type: Forward (forward.html)
Level: Beginner (beginner.html)
Sport: No (no.html)
Force: Pull (pull.html)
[Finished in 0.2s]

如果有帮助,请告诉我。

修改

如果它后面有多个元素,这将正确处理Wheel

<强>代码:

from bs4 import BeautifulSoup as bsoup, Tag
import re

ofile = open("unstructured.html", "rb")
soup = bsoup(ofile)
soup.prettify()

categories = ["Type:","Main Type:","Wheel:","Movement type:","Level:","Sport:","Force:"]
for category in categories:
    wheel_list = []
    f = soup.find(text=re.compile(category)).next_sibling
    if category != "Wheel:":
        string = f.get_text()
        ref = f.get("href")
        print "%s %s (%s)" % (category, string, ref)
    else:
        while f.name == "a":
            content = f.contents[0]
            res = f.get("href")
            wheel_list.append("%s (%s)" % (content, res))
            f = f.find_next()
        ref = ", ".join(wheel_list)
        print "%s %s" % (category, ref)

<强>结果:

Type: Tough (wee.html)
Main Type: Abnormal (abnormal.html)
Wheel: None (none.html), Squared (squared.html), Triangled (triangle.html)
Movement type: Forward (forward.html)
Level: Beginner (beginner.html)
Sport: No (no.html)
Force: Pull (pull.html)
[Finished in 0.3s]

如果有帮助,请告诉我们。

答案 1 :(得分:0)

你可以这样做:

from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(html)
for elem in soup(text=re.compile(r'Type:')):
    print elem.next_sibling.text, elem.next_sibling.get('href')

并对Size:

执行相同的操作