Question

使用Beautifulsoup从维基百科中搜索一列返回最后一行，而我希望所有这些列都在列表中：

from urllib.request import urlopen
from bs4 import BeautifulSoup

site = "https://en.wikipedia.org/wiki/Agriculture_in_India"
html = urlopen(site)
soup = BeautifulSoup(html, "html.parser")
table = soup.find("table", {'class': 'wikitable sortable'})

for row in table.find_all("tr")[1:]:
    col = row.find_all("td")
    if len(col) > 0:
            com = str(col[1].string.strip("\n"))

        list(com)
com

Out: 'ZTS'

所以它只显示字符串的最后一行，我希望得到一个列表，每个字符串行作为一个字符串值。这样我就可以将列表分配给新变量。

"Rice", "Buffalo milk", "Cow milk", "Wheat"

任何人都可以帮助我吗？

Answer 1

您的方法无效，因为您没有向com添加任何内容。

做你想做的事的一种方法是：

from urllib.request import urlopen
from bs4 import BeautifulSoup
site = "https://en.wikipedia.org/wiki/Agriculture_in_India"
html = urlopen(site)
soup = BeautifulSoup(html, "html.parser")
table = soup.find("table", {'class': 'wikitable sortable'})
com=[]
for row in table.find_all("tr")[1:]:
    col = row.find_all("td")
    if len(col)> 0:
        temp=col[1].contents[0]
        try:
            to_append=temp.contents[0]
        except Exception as e:
            to_append=temp
        com.append(to_append)

print(com)

这将为您提供所需的信息。

<强>解释

col[1].contents[0]给出了代码的第一个子代。 .contents为您提供了代码的子项列表。在这里，我们有一个孩子0。

在某些情况下，<tr>标记内的内容为<a href>标记。所以我应用另一个.contents[0]来获取文本。

在其他情况下，它不是链接。为此，我使用了异常声明。如果没有提取子代的后代，我们会得到一个例外。

有关详细信息，请参阅the official documentation

将Beautifulsoup刮表转换为列表

1 个答案: