从HTML表格中抓取每一行

时间:2019-06-11 12:56:01

标签: python web-scraping html-table beautifulsoup python-requests

我正在从网页上抓取HTML表,但这只是一遍又一遍地拉第一行的内容,而不是每一行的唯一值。似乎位置参数(tds [0] -tds [5])仅适用于第一行,我只是不知道如何指示代码继续前进到每一行。

import requests
from bs4 import BeautifulSoup



headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}


url = 'https://www.fdic.gov/bank/individual/failed/banklist.html'
r = requests.get(url, headers = headers)

soup = BeautifulSoup(r.text, 'html.parser')


mylist5 = []
for tr in soup.find_all('table'):
    tds = tr.findAll('td')
    for x in tds:
        output5 = ("Bank: %s, City: %s, State: %s, Closing Date: %s, Cert #: %s, Acquiring Inst: %s \r\n" % (tds[0].text, tds[1].text, tds[2].text, tds[5].text, tds[3].text, tds[4].text))
        mylist5.append(output5)
        print(output5)

3 个答案:

答案 0 :(得分:1)

我稍微修改了您的代码-我忽略了第一行(标题),然后逐行(<mvc:annotation-driven> ... <mvc:message-converters> <bean class="org.springframework.http.converter.json.MappingJackson2HttpMessageConverter"> <property name="objectMapper" ref="jacksonObjectMapper" /> </bean> </mvc:message-converters> ... </mvc:annotation-driven> <bean id="jacksonObjectMapper" class="com.mobizio.rest.spring.CustomObjectMapper" /> )进行迭代,而不仅仅是tr

td

打印:

import requests
from bs4 import BeautifulSoup

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}

url = 'https://www.fdic.gov/bank/individual/failed/banklist.html'
r = requests.get(url, headers = headers)

soup = BeautifulSoup(r.text, 'html.parser')


mylist5 = []
for tr in soup.find_all('table'):
    rows = tr.findAll('tr')[1:]
    for row in rows:
        row = row.findAll('td')
        output5 = ("Bank: %s, City: %s, State: %s, Closing Date: %s, Cert #: %s, Acquiring Inst: %s \r\n" % (row[0].text, row[1].text, row[2].text, row[5].text, row[3].text, row[4].text))
        mylist5.append(output5)
        print(output5)

...等

答案 1 :(得分:0)

您可以将 <DataGridTextColumn Width="50" Binding="{Binding Path=Id, Mode=TwoWay, UpdateSourceTrigger=PropertyChanged}" Header="Id" /><DataGridTemplateColumn Width="80"> <DataGridTemplateColumn.CellTemplate> <DataTemplate> <TextBlock Text="{Binding Id, UpdateSourceTrigger=PropertyChanged}" /> </DataTemplate> </DataGridTemplateColumn.CellTemplate> <DataGridTemplateColumn.CellEditingTemplate> <DataTemplate> <TextBlock Text="{Binding Id, Mode=TwoWay, UpdateSourceTrigger=PropertyChanged}" /> </DataTemplate> </DataGridTemplateColumn.CellEditingTemplate> </DataGridTemplateColumn> 用于列表理解:

find_all

输出(由于SO的字符限制而缩短):

import requests
from bs4 import BeautifulSoup as soup
d = soup(requests.get('https://www.fdic.gov/bank/individual/failed/banklist.html').text, 'html.parser')
h, data = [i.text for i in d.find_all('th')], [[i.text for i in b.find_all('td')] for b in d.find_all('tr')[1:]]

答案 2 :(得分:0)

我个人会在这里使用大熊猫:

OrderItem