Question

我需要在具有文本映射的某个链接之后获取数据，但是当链接后的数据着色时，它不起作用。我怎么得到的？

当前，我正在使用next_sibling，但它只会获取非红色的数据点。

HTML就是这样。我可以从这里读取号码


    <a href="http://scedc.caltech.edu/recent/Maps/118-36.html" class="link2">map</a>
    " 2.8 "

但不是从这里


    <a href="http://scedc.caltech.edu/recent/Maps/118-36.html" class="link2">map</a>
    <font color="red">3.1</font>


    soup=BeautifulSoup(page.content, 'html.parser')
    tags = soup.find_all("a",{'class': 'link2'})
    output=open("file.txt","w")

    for i in tags:

        if i.get_text()=="map":
            # prints each next_sibling
            print(i.next_sibling)
            # Extracts text if needed.
            try:
                output.write(i.next_sibling.get_text().strip()+"\n")
            except AttributeError:
                output.write(i.next_sibling.strip()+"\n")
    output.close()

程序将写入所有非红色的数字，并在有红色数字的地方留出空白。我希望它能显示所有内容。

Answer 1

如果我们可以看到更多的HTML树，则可能是一种更好的方法，但是鉴于您向我们展示的html点点滴滴，这是一种可行的方法。

from bs4 import BeautifulSoup

html = """<a href="http://scedc.caltech.edu/recent/Maps/118-36.html" class="link2">map</a>2.8
    <a href="http://scedc.caltech.edu/recent/Maps/118-37.html" class="link2">map</a>
    <font color="red">3.1</font>"""

soup=BeautifulSoup(html, 'html.parser')
tags = soup.find_all("a",{'class': 'link2'})
output=open("file.txt","w")

for i in tags:
    if i.get_text()=="map":
        siblings = [sib for sib in i.next_siblings]
        map_sibling_text = siblings[0].strip()
        if map_sibling_text == '' and len(siblings) > 1:
            if siblings[1].name == 'font':
                map_sibling_text = siblings[1].get_text().strip()
        output.write("{0}\n".format(map_sibling_text))

output.close()

Answer 2

取决于您的HTML整体情况。例如，该类名是否总是与a标记关联？您可能可以执行以下操作。需要bs4 4.7.1。

import requests

from bs4 import BeautifulSoup as bs

html = '''

<a href="http://scedc.caltech.edu/recent/Maps/118-36.html" class="link2">map</a>
    " 2.8 "
<a href="http://scedc.caltech.edu/recent/Maps/118-36.html" class="link2">map</a>
<font color="red">3.1</font>

'''
soup = bs(html, 'lxml')
data = [item.next_sibling.strip() if item.name == 'a' else item.text.strip()  for item in soup.select('.link2:not(:has(+font)), .link2 + font')]
print(data)

在BS4中将next_sibling与字体颜色一起使用

2 个答案: