我正试图从得克萨斯州2018年大选中scrap取选举结果。我有以下代码,但无法摆脱总计行。还有一种副作用,导致所有非美国代表也被标记。
import requests
from bs4 import BeautifulSoup
import re, os, csv
fileDir = os.path.dirname(__file__)
csvFile = os.path.join(fileDir, 'election2018.csv')
sos_2018_site = 'https://elections.sos.state.tx.us/elchist331_state.htm'
r = requests.get(sos_2018_site)
soup = BeautifulSoup(r.text)
district_campaigns = soup.find_all(text=re.compile('^U. S. Representative District'))
districts = [district.string for district in district_campaigns]
table_rows = soup.find_all('tr')
# print(us_rep)
for district in district_campaigns:
candidate = district.parent.parent.next_sibling.td.next_element
current_district = ''
with open(csvFile, 'w') as csv_file:
writer = csv.writer(csv_file, delimiter=',')
for tr in table_rows:
table_data = []
for td in tr.children:
if td.string in districts:
current_district = td.string
continue
if td.string == None:
continue
table_data.append(td.string)
table_data.append(current_district)
if any("U. S. Representative" in s for s in table_data) and any("-" not in s for s in table_data):
writer.writerow(table_data)
答案 0 :(得分:3)
我希望我可以做一个快速修复,但这是使用迭代器的重写(不知道您是否满意它们-调试时的一个问题,您需要使用{{1将生成器转换为列表}})。
主要思想是使用list(...)
从HTML中提取字符串列表的列表,类似于读取CSV,然后根据需要过滤列表。
将解析逻辑(在下面创建BeautifulSoup
变量)和文件保存操作分开也是一个好主意。这样,代码更易于修改和“合理化”。
output
P.S。编写此代码后,我认为任何真正的民主国家都应使用JSON(而不是HTML)报告选举结果。
答案 1 :(得分:1)
import requests
from bs4 import BeautifulSoup
import re, os
import pandas as pd
fileDir = os.path.dirname(__file__)
csvFile = os.path.join(fileDir, 'election2018.csv')
sos_2018_site = 'https://elections.sos.state.tx.us/elchist331_state.htm'
r = requests.get(sos_2018_site)
soup = BeautifulSoup(r.text)
trs=soup.findAll('tr')
vote_type=''
result_list=[]
for tr in trs[1:]:
tds=tr.findAll('td')
try:
if tds[0]['colspan']=='2':
vote_type=re.sub(' - $', '', tds[0].text)
except KeyError:
if re.search('Race Total',tds[2].text) is None and re.search('-{2,}',tds[3].text) is None:
result_list.append({'TYPE':vote_type, 'NAME':tds[1].text, 'PARTY':tds[2].text, 'VOTE':int(tds[3].text.replace(',', '')), 'PERCENT':float(tds[4].text.replace('%', ''))})
pdf_vote=pd.DataFrame(result_list)
pdf_vote.to_csv(csvFile, sep=';', index=False)
我希望将所有东西塞入大熊猫中,然后过滤所需的东西。而且在熊猫的csv文件中更容易,而且不仅在csv中,还有可能...