Question

我有一个HTML字符串，如下所示：

        <p>
                                Type: <a href="wee.html">Tough</a><br />

                                Main Type:
                <a href='abnormal.html'>Abnormal</a>                    <br />


                                Wheel:
                <a href='none.html'>None</a>,<a href='squared.html'>Squared</a>,<a href='triangle.html'>Triangled</a>                    <br />

                                Movement type: <a href=forward.html">Forward</a><br />

                                Level: <a href="beginner.html">Beginner</a><br />
            Sport: <a href="no.html">No</a><br/>Force: <a href="pull.html">Pull</a><br/>              <span style="float:left;">Your Rating:&nbsp;</span> <div id="headersmallrating" style="float:left; line-height:20px;"><a href="rate.html">Login to rate</a></div><br />

        </p>

换句话说，有点非结构化。我希望能够首先检测字符串Type和Main Type及其链接（和链接文本）。我试过用正则表达式检测单词，但这没有任何好处。如何处理这种狡猾的数据？

Answer 1

如果我事先知道类别Type，Force等，则提前准备列表可能更容易。

<强>代码：

from bs4 import BeautifulSoup as bsoup
import re

ofile = open("test.html", "rb")
soup = bsoup(ofile)
soup.prettify()

categories = ["Type:","Main Type:","Wheel:","Movement type:","Level:","Sport:","Force:"]
for category in categories:
    f = soup.find(text=re.compile(category)).next_sibling
    string = f.get_text()
    ref = f.get("href")
    print "%s %s (%s)" % (category, string, ref)

<强>结果：

Type: Tough (wee.html)
Main Type: Abnormal (abnormal.html)
Wheel: None (none.html)
Movement type: Forward (forward.html)
Level: Beginner (beginner.html)
Sport: No (no.html)
Force: Pull (pull.html)
[Finished in 0.2s]

如果有帮助，请告诉我。

修改

如果它后面有多个元素，这将正确处理Wheel。

<强>代码：

from bs4 import BeautifulSoup as bsoup, Tag import re ofile = open("unstructured.html", "rb") soup = bsoup(ofile) soup.prettify() categories = ["Type:","Main Type:","Wheel:","Movement type:","Level:","Sport:","Force:"] for category in categories: wheel_list = [] f = soup.find(text=re.compile(category)).next_sibling if category != "Wheel:": string = f.get_text() ref = f.get("href") print "%s %s (%s)" % (category, string, ref) else: while f.name == "a": content = f.contents[0] res = f.get("href") wheel_list.append("%s (%s)" % (content, res)) f = f.find_next() ref = ", ".join(wheel_list) print "%s %s" % (category, ref)

<强>结果：

Type: Tough (wee.html) Main Type: Abnormal (abnormal.html) Wheel: None (none.html), Squared (squared.html), Triangled (triangle.html) Movement type: Forward (forward.html) Level: Beginner (beginner.html) Sport: No (no.html) Force: Pull (pull.html) [Finished in 0.3s]

如果有帮助，请告诉我们。

Answer 2

你可以这样做：

from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(html)
for elem in soup(text=re.compile(r'Type:')):
    print elem.next_sibling.text, elem.next_sibling.get('href')

并对Size:

执行相同的操作

从非结构化HTML数据中解析链接和字符串

2 个答案: