Question

我正试图刮除另一个细胞的内容，而另一个细胞我知道其名称，例如“ Staatsform ”，“ Amtssprache ”，“ Postleitzahl ”等。在图片中，所需内容始终位于正确的单元格中。

基本代码如下，但我坚持不懈：

cell. getValue = { (xValue) in
// do what you want to do
}

非常感谢提前！

Answer 1

我想谨慎地将搜索范围限制在所谓的“信息框”中。在英语维基百科中。因此，我首先搜索标题＆＃39; Basisdaten＆＃39;，要求它是th元素。或许不完全确定，但更有可能。我发现我在＆＃39; Basisdaten＆＃39;下找到了tr元素。直到我发现另一个tr包括一个（推测不同的）标题。在这种情况下，我搜索了Postleitzahlen：＆＃39;但是这种方法可以找到＆＃39; Basisdaten＆＃39;之间的任何/所有项目。和下一个标题。

PS：我还应该提一下if not current.name的原因。我注意到一些由新线组成的线条，BeautifulSoup将其视为字符串。这些没有名字，因此需要特别在代码中对待它们。

import requests
import bs4
page = requests.get('https://de.wikipedia.org/wiki/Hamburg').text
soup = bs4.BeautifulSoup(page, 'lxml')
def getInfoBoxBasisDaten(s):
    return str(s) == 'Basisdaten' and s.parent.name == 'th'

basisdaten = soup.find_all(string=getInfoBoxBasisDaten)[0]

wanted = 'Postleitzahlen:'
current = basisdaten.parent.parent.nextSibling
while True:
    if not current.name: 
        current = current.nextSibling
        continue
    if wanted in current.text:
        items = current.findAll('td')
        print (items[0])
        print (items[1])
    if '<th ' in str(current): break
    current = current.nextSibling

结果如下：根据要求提供两个单独的td元素。

<td><a href="/wiki/Postleitzahl_(Deutschland)" title="Postleitzahl (Deutschland)">Postleitzahlen</a>:</td>
<td>20095–21149,<br/>
22041–22769,<br/>
<a href="/wiki/Neuwerk_(Insel)" title="Neuwerk (Insel)">27499</a></td>

Answer 2

This works most of the time:

def get_content_from_right_column_for_left_column_containing(text):
    """return the text contents of the cell adjoining a cell that contains `text`"""

    navigable_strings = soup.find_all(text=text)

    if len(navigable_strings) > 1:
        raise Exception('more than one element with that text!')

    if len(navigable_strings) == 0:

        # left-column contents that are links don't have a colon in their text content...
        if ":" in text:
            altered_text = text.replace(':', '')

        # but `td`s and `th`s do.
        else: 
            altered_text = text + ":"

        navigable_strings = soup.find_all(text=altered_text)

    try:
        return navigable_strings[0].find_parent('td').find_next('td').text
    except IndexError:
        raise IndexError('there are no elements containing that text.')

Beautifulsoup在另一个细胞旁边刮掉细胞的内容

2 个答案: