使用BS4提取特定元素遇到一些问题。 This is taken from the Texas Department of Corrections Executed Inmates page
I've attached a screenshot for better understanding.
在每个tr标签中,有多个td标签包含有关名字,姓氏,TDCJ号码,年龄,日期等的文本。
如何让BS4跳过第一个tr标签(第一个tr标签是列名),对于每个后续tr标签,从td标签中提取文本?
from urllib.request import urlopen
from bs4 import BeautifulSoup
import csv
def main():
gettabledata()
lstofinmates = list()
def gettabledata():
with urlopen('https://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html') as response:
soup = BeautifulSoup(response, 'html.parser')
with open('exinmates.csv', 'w', newline='') as output_file:
inmate_file_writer = csv.DictWriter(output_file,
fieldnames=['First Name', 'Last Name', 'Execution Number',
'Last Statement', 'TDCJ Number', 'Age', 'Date Executed', 'Race',
'County'],
extrasaction='ignore',
delimiter=',', quotechar='"')
inmate_file_writer.writeheader()
table = soup.find('table').find('tbody')
print (table)
if __name__ == '__main__':
main()
我正在考虑创建LOD结构,其中每个字典对应于犯人信息,并且来自td字段的文本被推入字典中,并且每个字典被附加到列表中。问题是我找不到跳过第一个tr标签的方法,以及如何迭代其余的tr标签以将它们附加到字典中。有什么建议/帮助吗?谢谢!
答案 0 :(得分:1)
这是让你入门的东西:
from bs4 import BeautifulSoup
html = '''<h1>Executed Offenders</h1>
<table class="os" width="100%">
<tbody>
<tr><th scope="col">Execution</th><th scope="col">Link</th><th scope="col">Link</th><th scope="col">Last Name</th><th scope="col">First Name</th><th scope="col">TDCJ Number</th><th scope="col">Age</th><th scope="col">Date</th><th scope="col">Race</th><th scope="col">County</th</tr>
<tr><td>542</td><td><a href="#">Offender Information</a></td><td><a href="#">Last Statement</a></td><td>Bigby</td><td>James</td><td>997</td><td>61</td><td>3/14/2017</td><td>White</td><td>Tarrant</td></tr>
<tr><td>541</td><td><a href="#">Offender Information</a></td><td><a href="#">Last Statement</a></td><td>Ruiz</td><td>Rolando</td><td>999145</td><td>44</td><td>3/07/2017</td><td>Hispanic</td><td>Bexar</td></tr>
<tr><td>540</td><td><a href="#">Offender Information</a></td><td><a href="#">Last Statement</a></td><td>Edwards</td><td>Terry</td><td>999463</td><td>43</td><td>1/26/2017</td><td>Black</td><td>Dallas</td></tr>
<tr><td>539</td><td><a href="#">Offender Information</a></td><td><a href="#">Last Statement</a></td><td>Wilkins</td><td>Christopher</td><td>999533</td><td>48</td><td>01/11/2017</td><td>White</td><td>Tarrant</td></tr>
<tr><td>538</td><td><a href="#">Offender Information</a></td><td><a href="#">Last Statement</a></td><td>Fuller</td><td>Barney</td><td>999481</td><td>58</td><td>10/05/2016</td><td>White</td><td>Houston</td></tr>
</tbody>
</table>'''
soup = BeautifulSoup(html, 'html.parser')
rows = iter(soup.find('table').find_all('tr'))
# skip first row
next(rows)
for row in rows:
for cell in row.find_all('td'):
print(cell)
print()