我终于得到了它来创建csv文件,但是由于某种原因,它创建了标头,但从未填充数据
import requests
from bs4 import BeautifulSoup
import csv
url = "http://www.scsotx.org/jail-booking"
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html5lib')
inmate_data =[]
table = soup.find('div', attrs = {'class':'sex-offender-info'})
for row in table.findAll('div', attrs = {'class':'jail-content'}):
jaildata = {}
jaildata['Name'] = row.h4.text
jaildata['Agency'] = row.p.text
inmate_data.append(jail-content)
with open('C:\\Users\Cale\Desktop\jail\inmate_data.csv', 'w') as f:
w = csv.DictWriter(f,['Name','Agency'])
w.writeheader()
for jaildata in inmate_data:
w.writerow(jaildata)
它应该解析html数据,然后附加csv文件
答案 0 :(得分:2)
这是您的固定代码:
for row in table.findAll("figcaption", attrs={"class": "jail-content"}):
jaildata = {}
jaildata["Name"] = row.h4.text
jaildata["Agency"] = row.p.text
inmate_data.append(jaildata)
您要查找的数据位于<figcaption>
内部,而不是<div>
,并且在尝试附加数据时还出现了jail-content
而不是jaildata
的错字。
答案 1 :(得分:0)
尽管我强烈建议您将硒用于此类操作,但这是您可以用来改善报废工作的方法:
import requests
import pandas as pd
from bs4 import BeautifulSoup
class ScrapJail:
def __init__(self, url: str = "http://www.scsotx.org/jail-booking"):
self.url = url
def get_table(self):
response = requests.get(self.url)
soup = BeautifulSoup(response.content, 'html5lib')
raw_data = soup.find('div', attrs={'class': 'sex-off-box'})
data_contents = raw_data.findAll('div', attrs={'class': "medium-4 small-6 columns"})
data = []
for i, _ in enumerate(data_contents):
person_data_tags = data_contents[i].findAll('div', attrs={'class': "sex-offender-info"})
person_data_jail = person_data_tags[0].findAll(attrs={'class': "jail-content"})
person_data = person_data_jail[0].findChildren()
person_dict = {}
for tag in person_data:
person_text = tag.text
try:
points = person_text.index(':')
person_dict[person_text[:points]] = person_text[points + 1:]
data.append(person_dict)
except ValueError:
pass
return data
def data_frame(self):
return pd.DataFrame(self.get_table())
def export_csv(self, file_name:str):
df = self.data_frame()
df.to_csv(file_name)
并不完美,收费也不完美,而只是做一个
data = ScrapJail()
csv = data.export_csv('file_name.csv')
答案 2 :(得分:-1)
for row in table.findAll('div', attrs = {'class':'jail-content'}):
jaildata = {}
jaildata['Name'] = row.h4.text
jaildata['Agency'] = row.p.text
inmate_data.append(jail-content)
如果您查看此块,则根本不会声明最后一行的变量jail-content
。我假设您要使用jaildata
?