如何使用BS4迭代标签?

时间:2017-03-29 14:20:10

标签: python beautifulsoup

使用BS4提取特定元素遇到一些问题。 This is taken from the Texas Department of Corrections Executed Inmates page

I've attached a screenshot for better understanding.

在每个tr标签中,有多个td标签包含有关名字,姓氏,TDCJ号码,年龄,日期等的文本。

如何让BS4跳过第一个tr标签(第一个tr标签是列名),对于每个后续tr标签,从td标签中提取文本?

from urllib.request import urlopen
from bs4 import BeautifulSoup
import csv

def main():
    gettabledata()

lstofinmates = list()

def gettabledata():
    with urlopen('https://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html') as response:

    soup = BeautifulSoup(response, 'html.parser')

    with open('exinmates.csv', 'w', newline='') as output_file:
        inmate_file_writer = csv.DictWriter(output_file,
                                           fieldnames=['First Name', 'Last Name', 'Execution Number',
                                                       'Last Statement', 'TDCJ Number', 'Age', 'Date Executed', 'Race',
                                                       'County'],
                                           extrasaction='ignore',
                                           delimiter=',', quotechar='"')
        inmate_file_writer.writeheader()
        table = soup.find('table').find('tbody')
        print (table)

if __name__ == '__main__':
    main()

我正在考虑创建LOD结构,其中每个字典对应于犯人信息,并且来自td字段的文本被推入字典中,并且每个字典被附加到列表中。问题是我找不到跳过第一个tr标签的方法,以及如何迭代其余的tr标签以将它们附加到字典中。有什么建议/帮助吗?谢谢!

1 个答案:

答案 0 :(得分:1)

这是让你入门的东西:

from bs4 import BeautifulSoup
html = '''<h1>Executed Offenders</h1>
<table class="os" width="100%">
  <tbody>
      <tr><th scope="col">Execution</th><th scope="col">Link</th><th scope="col">Link</th><th scope="col">Last Name</th><th scope="col">First Name</th><th scope="col">TDCJ Number</th><th scope="col">Age</th><th scope="col">Date</th><th scope="col">Race</th><th scope="col">County</th</tr>
      <tr><td>542</td><td><a href="#">Offender Information</a></td><td><a href="#">Last Statement</a></td><td>Bigby</td><td>James</td><td>997</td><td>61</td><td>3/14/2017</td><td>White</td><td>Tarrant</td></tr>
      <tr><td>541</td><td><a href="#">Offender Information</a></td><td><a href="#">Last Statement</a></td><td>Ruiz</td><td>Rolando</td><td>999145</td><td>44</td><td>3/07/2017</td><td>Hispanic</td><td>Bexar</td></tr>
      <tr><td>540</td><td><a href="#">Offender Information</a></td><td><a href="#">Last Statement</a></td><td>Edwards</td><td>Terry</td><td>999463</td><td>43</td><td>1/26/2017</td><td>Black</td><td>Dallas</td></tr>
      <tr><td>539</td><td><a href="#">Offender Information</a></td><td><a href="#">Last Statement</a></td><td>Wilkins</td><td>Christopher</td><td>999533</td><td>48</td><td>01/11/2017</td><td>White</td><td>Tarrant</td></tr>
      <tr><td>538</td><td><a href="#">Offender Information</a></td><td><a href="#">Last Statement</a></td><td>Fuller</td><td>Barney</td><td>999481</td><td>58</td><td>10/05/2016</td><td>White</td><td>Houston</td></tr>
  </tbody>
</table>'''

soup = BeautifulSoup(html, 'html.parser')
rows = iter(soup.find('table').find_all('tr'))

# skip first row
next(rows)

for row in rows:
    for cell in row.find_all('td'):
        print(cell)
    print()