Question

所以我从本地网站上摘录了一下，我想在一个单独的标题名称下提取“人名”，然后在标题办公室下提取所有“办公室”名称。问题是我不知道如何从这个页面获得这种级别的结构化信息。我所能做的就是这个，它只打印所有<ul><li>而不是我想要的特定的（即只有姓名和办公室）。

soup = BeautifulSoup(urllib2.urlopen('http://localhost/xyz.html'))
for x in soup.findAll('ul'):
    print x.li.b

这是网页的摘录。你可以在这个标题之前和之后看到很多内容 - “人和办公室”。我不想打印任何东西，我只关心这一部分： -

    <h3>...</h3>
    <p>....</p>
    <p>....</p>

    <h2>
       <span class="mw-headline" id="People and Offices">People and Offices</span>
       </h2>
    <p>..</p>
    <p>..</p>
    <ul>..</ul>
    <h3>
       <span class="mw-headline" id="Name">Name</span>
       </h3>
    <ul>
       <li>
           <b>Justin</b>
           "Description about justin........
       </li>
    </ul>
    <ul>
       <li>
           <b>Steve</b>
           "Description about steve........
       </li>
    </ul>

    <h3>
       <span class="mw-headline" id="Offices">Offices</span>
       </h3>
    <ul>
       <li>
           <b>Google HQ</b>
           "Address of google........
       </li>
    </ul>
    <ul>
       <li>
           <b>Microsoft HQ</b>
           "Address of microsoft........
       </li>
    </ul>

我只想进入“人员和办公室”部分，然后打印“名称”和“办公室”。虽然上面的代码只打印了包含所有内容并且不包括“人员和办公室”。预期产出： - ： -

Name,
Justin,
Steve,

==

Offices,
Google HQ,
Microsoft HQ,

请帮忙。

Answer 1

string函数可用于提取给定标记内的内容。例如，要获取<b>的内容，那么

print x.li.b.string

因此代码可以修改为

soup = BeautifulSoup(urllib2.urlopen('http://localhost/xyz.html'))

text=soup.find(text='Name').

for x in text.findAllNext('ul'):
    print x.li.b.string,","

text=soup.find(text='Offices').

for x in text.findAllNext('ul'):
    print x.li.b.string,","

soup.find(text='Name').将返回与文字Name

匹配的标记

现在text.findAllNext('ul'):会找到在给定文字后发生的所有ul。

Offices

相同

Answer 2

id="([^"]*)"(?:(?!<span[^>]*>).)*?<b>(.*?)<\/b>(?:(?!<span[^>]*>).)*?<b>(.*?)<\/b>

虽然您不应该使用正则表达式解析html，但您可以将其用作hack或quickfix。

使用它 import re

print re.findall(r'id="([^"]*)"(?:(?!<span[^>]*>).)*?<b>(.*?)<\/b>(?:(?!<span[^>]*>).)*?<b>(.*?)<\/b>',str,re.DOTALL)。

str是你的html exceerpt。

参见演示。

http://regex101.com/r/vR4fY4/4

Answer 3

您的HTML结构非常线性。＆＃34;名称＆＃34;之间没有界限。和＆＃34;办公室＆＃34;。如果你无法控制这个html文件结构，那么你必须从id="Name"获取id的元素并逐个搜索，直到你点击＆＃34; Office＆＃34;，这会给出所有的名字。同样地从id="Office"开始直到结束以获得所有办公室。

您可以通过

获取元素

soup.find(id="Name")

如果您可以控制html的结构，请确保名称和办事处位于一个组（div或span）下，以便您可以轻松将其删除。

更新：

希望此解决方案能够满足您的需求。更改代码以满足您的要求！

from BeautifulSoup import BeautifulSoup
import urllib2


def scrap(url):

    soup = BeautifulSoup(urllib2.urlopen(url))

    break_tags = ['h2', 'h3']

    h2s = soup.findAll("h2")

    do_tag = None

    for tag in h2s:
        if tag.find(id="Do"):
            do_tag = tag
            break

    next_tag = do_tag.findNextSibling()
    while 1:
        if next_tag.name in ['div', 'p']:
            next_tag = next_tag.findNextSibling()
            continue
        if next_tag.name in break_tags:
            break

        #if next_tag.name in ['ul']:
        #    print next_tag.name

        #
        # The structure of the HTML is very bad.. no consistency !
        if next_tag.find("span"):
            print next_tag.find("span").span.text
        else:
            print next_tag.li.b.text

        next_tag = next_tag.findNextSibling()

Answer 4

从你给出的html中，我认为你可以这样做，

for h3 in soup.find_all('h3'):
    print h3.text.strip()
    next_tag = h3.findNextSibling()
    while next_tag.name == 'ul':
        print next_tag.li.b.text
        next_tag = next_tag.findNextSibling()
        if not next_tag:
            break
    print '====='

它的通用类型。您可能需要根据您的HTML修改它。

仅在开始某个文本后才刮取数据

4 个答案: