Question

当前，我试图读取网页中两个标签之间的文本。

到目前为止，这是我的代码：

soup = BeautifulSoup(r.text, 'lxml')

text = soup.text

tag_one = soup.select_one('div.first-header')


tage_two = soup.select_one('div.second-header')



text = text.split(tag_one)[1]
text = text.split(tage_two)[0]

print(text)

基本上，我正在尝试通过识别标签来获取第一个标题和第二个标题之间的文本。我打算通过按第一个标签和第二个标签进行拆分来实现此目的。这有可能吗？有更聪明的方法吗？

示例：如果您查看：https://en.wikipedia.org/wiki/Python_(programming_language)

我想找到一种方法，通过识别“历史”和“特征与哲学”的标签并按这些标签拆分来提取“历史”下的文本。

Answer 1

您无法按照您希望的方式进行操作，因为BS4可以在dom（树结构）而不是线性对象上工作。

使用您的Wiki示例，您真正要寻找的是

找到id =“历史”（是跨度）
导航到H2元素-请记住这是起点。
找到id =“ Features_and_philosophy”（这是另一个跨度）
导航到最近的H2元素-请记住，将其作为终点。

现在，请注意，两个H2元素是同级元素（它们具有相同的父元素）。因此，您要做的是获取开始H2和结束H2之间的每个同级，并为每个同级获取每个同级的全文。

这并不难，但这是一个循环，您在其中比较每个兄弟姐妹，直到达到末尾。没有您希望的那么简单。

在更一般的情况下，要困难得多（或者确实很乏味），因为您可能不得不在DOM树中上下移动以查找匹配的元素。

Answer 2

使用BeautifulSoup 4.7+，大大提高了CSS选择功能。可以使用BeautifulSoup现在支持的CSS4 :has()选择器来完成此任务：

import requests
from bs4 import BeautifulSoup

website_url = requests.get("https://en.wikipedia.org/wiki/Python_(programming_language)").text
soup = BeautifulSoup(website_url, "lxml")
els = soup.select('h2:has(span#History) ~ *:has(~ h2:has(span#Features_and_philosophy))')
with codecs.open('text.txt', 'w', 'utf-8') as f:
    for el in els:
        print(el.get_text())

输出：

 Guido van Rossum at OSCON 2006.Main article: History of PythonPython was conceived in the late 1980s[31] by Guido van Rossum at Centrum Wiskunde & Informatica (CWI) in the Netherlands as a successor to the ABC language (itself inspired by SETL)[32], capable of exception handling and interfacing with the Amoeba operating system.[7] Its implementation began in December 1989.[33] Van Rossum's long influence on Python is reflected in the title given to him by the Python community: Benevolent Dictator For Life (BDFL) –  a post from which he gave himself permanent vacation on July 12, 2018.[34]
Python 2.0 was released on 16 October 2000 with many major new features, including a cycle-detecting garbage collector and support for Unicode.[35]
Python 3.0 was released on 3 December 2008. It was a major revision of the language that is not completely backward-compatible.[36] Many of its major features were backported to Python 2.6.x[37] and 2.7.x version series.  Releases of Python 3 include the 2to3 utility, which automates (at least partially) the translation of Python 2 code to Python 3.[38]
Python 2.7's end-of-life date was initially set at 2015 then postponed to 2020 out of concern that a large body of existing code could not easily be forward-ported to Python 3.[39][40] In January 2017, Google announced work on a Python 2.7 to Go transcompiler to improve performance under concurrent workloads.[41]

由bs4标签分割/在两个标签之间获取文本

2 个答案: