Question

我写了以下代码行

#!/usr/bin/python
#weather.scrapper

from bs4 import BeautifulSoup
import urllib

def main():
    """weather scrapper"""
    r = urllib.urlopen("https://www.wunderground.com/history/airport/KPHL/2016/1/1/MonthlyHistory.html?&reqdb.zip=&reqdb.magic=&reqdb.wmo=&MR=1").read()
    soup = BeautifulSoup(r, "html.parser")
    table = soup.find_all("table", class_="responsive airport-history-summary-table")
    tr = soup.find_all("tr")
    td = soup.find_all("td")
    print table


if __name__ == "__main__":
    main()

当我打印表格时，我也得到了所有的html（td，tr，span等）。如何在不使用html的情况下打印表格（tr，td）的内容？谢谢！

Answer 1

当您想要获取内容时，必须使用.getText()方法。由于find_all会返回元素列表，因此您必须选择其中一个元素（td[0]）。

或者你可以这样做：

for tr in soup.find_all("tr"):
    print '>>>> NEW row <<<<'
    print '|'.join([x.getText() for x in tr.find_all('td')])

上面的循环打印单元格旁边的每个行单元格。

请注意，您确实找到了所有td和所有tr，但您可能只希望获得table中的{。}}。

如果你想在table中寻找元素，你必须这样做：

table.find('tr')代替soup.find('tr)，因此BeautifulSoup将在tr而不是整个table中寻找html。

您修改的代码（根据您的评论表示还有更多表格）：

#!/usr/bin/python
#weather.scraper

from bs4 import BeautifulSoup
import urllib

def main():
    """weather scraper"""
    r = urllib.urlopen("https://www.wunderground.com/history/airport/KPHL/2016/1/1/MonthlyHistory.html?&reqdb.zip=&reqdb.magic=&reqdb.wmo=&MR=1").read()
    soup = BeautifulSoup(r, "html.parser")
    tables = soup.find_all("table")

    for table in tables:
        print '>>>>>>> NEW TABLE <<<<<<<<<'

        trs = table.find_all("tr")

        for tr in trs:
            # for each row of current table, write it using | between cells
            print '|'.join([x.get_text().replace('\n','') for x in tr.find_all('td')])



if __name__ == "__main__":
    main()

Python数据报废程序

1 个答案: