Question

我写了一个脚本来捕捉维基百科上几个国家的独立日期。

例如，哈萨克斯坦：

URL_QS = 'https://en.wikipedia.org/wiki/Kazakhstan'
r = requests.get(URL_QS)
soup = BeautifulSoup(r.text, 'lxml')

# Only keep the infobox (top right)
infobox = soup.find("table", class_="infobox geography vcard")

if infobox:
    formation = infobox.find_next(text = re.compile("Formation"))

    if formation: 
        independence = formation.find_next(text = re.compile("independence")) 

        if independence:
            independ_date = independence.find_next("td").text
        else:
            independence = formation.find_next(text = re.compile("Independence"))

            if independence:
                independ_date = independence.find_next("td").text


print(independ_date)

我有以下输出：

Almaty

此输出未在信息框中本地化，但在文本之后。这是因为＆＃34; formation.find_next（text = re.compile（＆＃34; independent＆＃34;））＆＃34; 在信息框之外发现了一些东西，但我不知道＃39;理解为什么不应该只在信息框中进行研究？我怎样才能在这个领域进行搜索？

提前感谢您的帮助！

Answer 1

这是因为＆＃34; formation.find_next（text = re.compile（＆＃34; independent＆＃34;））＆＃34;在信息框之外发现了一些东西

将.extract()添加到soup.find()，仅在infobox geography vcard元素内搜索。

infobox = soup.find("table", class_="infobox geography vcard").extract()

Answer 2

您的代码正在搜索第一个"independence"字后面的值，这个字应该是第二个，同样，"Formation"字符串不能像我在某些国家/地区测试的那样进行推广，因此我认为您可以从头开始搜索"Independence"：

infobox = soup.find("table", class_="infobox geography vcard")

if infobox:
    formation = infobox.find_next(text = re.compile("Independence"))

    if formation: 
        independence = formation.find_next(text = re.compile("independence")) 

        if independence:
            independence = infobox.find_next(text = re.compile("Independence"))
            independ_date = independence.find_next("td").text

print(independ_date)

对于任何具有独立日期的国家/地区，这将返回维基百科页面独立部分的第一个日期。

Python＆amp;美丽的汤：只在某一类中搜索

2 个答案: