Question

我将使用beautifulsoup查找在以下链接中的“内容逻辑定义”中定义的表：

1) https://www.hl7.org/fhir/valueset-account-status.html
2) https://www.hl7.org/fhir/valueset-activity-reason.html
3) https://www.hl7.org/fhir/valueset-age-units.html

可以在页面中定义几个表。我想要的表位于<h2> tag with text “content logical definition”下。某些页面可能缺少“内容逻辑定义”部分中的任何表，因此我希望该表为null。到目前为止，我尝试了几种解决方案，但是每种方法都会为某些页面返回错误的表格。

alecxe提供的最后一个解决方案是：

import requests
from bs4 import BeautifulSoup

urls = [
    'https://www.hl7.org/fhir/valueset-activity-reason.html',
    'https://www.hl7.org/fhir/valueset-age-units.html'
]

for url in urls:
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'lxml')

    h2 = soup.find(lambda elm: elm.name == "h2" and "Content Logical Definition" in elm.text)
    table = None
    for sibling in h2.find_next_siblings():
        if sibling.name == "table":
            table = sibling
            break
        if sibling.name == "h2":
            break
    print(table)

如果在“内容逻辑定义”部分中没有表，则此解决方案返回null，但对于在“内容逻辑定义”中具有表的第二个URL，它返回错误的表，页面末尾的表。 /> 如何编辑此代码以访问在具有“内容逻辑定义”文本的标记之后完全定义的表，并且如果此部分中没有表，则返回null。

Answer 1

看起来alecxe代码的问题在于它返回的表是h2的直接兄弟，但你想要的那个实际上是在div中（这是h2的兄弟）。这对我有用：

import requests
from bs4 import BeautifulSoup

urls = [
    'https://www.hl7.org/fhir/valueset-account-status.html',
    'https://www.hl7.org/fhir/valueset-activity-reason.html',
    'https://www.hl7.org/fhir/valueset-age-units.html'
]


def extract_table(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'lxml')

    h2 = soup.find(lambda elm: elm.name == 'h2' and 'Content Logical Definition' in elm.text)
    div = h2.find_next_sibling('div')
    return div.find('table')


for url in urls:
    print extract_table(url)

访问html标记中的特定表

1 个答案: