我写了一段代码,在此网站https://violationtracker.goodjobsfirst.org/上查找公司,并下载了公司页面的csv结果-请在此处查看Nike的示例:https://violationtracker.goodjobsfirst.org/prog.php?parent=&major_industry_sum=&offense_group_sum=&primary_offense_sum=&agency_sum=&agency_sum_st=&hq_id_sum=&company_op=starts&company=nike&major_industry%5B%5D=&case_category=&offense_group=&all_offense%5B%5D=&penalty_op=%3E&penalty=&govt_level=&agency_code%5B%5D=&agency_code_st%5B%5D=&pen_year%5B%5D=&pres_term=&free_text=&case_type=&ownership%5B%5D=&hq_id=&naics%5B%5D=&state=&city=)
代码可以使用很长时间,但是现在我不确定为什么不下载CSV而是下载临时文件,而不下载CSV?该网站没有问题,因为我手动进行了尝试,可以下载csv。
这是我的代码
df_all = []
supplier = ['Nike']
length = len(supplier)
##go to the website
for idx, i in enumerate(supplier):
rem = length-idx
print('This is index: ', idx, ', element: ', i, ', with remaining : ', rem, ' elements')
try:
driver = webdriver.Chrome(executable_path=r"C:\webdrivers\chromedriver.exe")
driver.get("https://www.goodjobsfirst.org/violation-tracker")
##find the iframe with the broweser
driver.switch_to_frame(0)
## Insert text via xpath ##
elem = driver.find_element_by_xpath("//*[@id='edit-field-violation-company-value']")
elem.send_keys(i)
elem.send_keys(Keys.RETURN)
time.sleep(10)
try:
##download the information from the relevant page
button = driver.find_element_by_xpath('//*[@id="content"]/div/div[2]/a[1]/img')
ActionChains(driver).move_to_element(button).click(button).perform()
##upload last csv in the download folder
list_of_files = glob.glob(r'C:\Users\~\Downloads\*.csv')
latest_file = max(list_of_files, key=os.path.getctime)
time.sleep(3)
df = pd.read_csv(latest_file)
print(df)
df_all.append(df)
driver.close()
if os.path.exists(latest_file):
os.remove(latest_file)
else:
print("The file does not exist")
except:
driver.close()
except:
pass
violation_tracker = pd.concat(df_all)
我想念什么?
答案 0 :(得分:0)
此网站看起来非常有趣!谢谢你。
只需在第二个URL的末尾添加“&detail = csv_results”。以下代码有效:
import requests as rq
from bs4 import BeautifulSoup as bs
import csv
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0"}
url = "https://violationtracker.goodjobsfirst.org/prog.php?parent=&major_industry_sum=&offense_group_sum=&primary_offense_sum=&agency_sum=&agency_sum_st=&hq_id_sum=&company_op=starts&company=nike&major_industry[]=&case_category=&offense_group=&all_offense[]=&penalty_op=%3E&penalty=&govt_level=&agency_code[]=&agency_code_st[]=&pen_year[]=&pres_term=&free_text=&case_type=&ownership[]=&hq_id=&naics[]=&state=&city=&detail=csv_results"
resp = rq.get(url, headers=headers)
a = resp.content
wrapper = csv.reader(resp.text.strip().split('\n'))
for record in wrapper:
print(record)