我已经在python中创建了一个脚本,以从csv文件读取不同的id号,以便将其与链接一起使用以填充结果并将结果写入不同的csv文件中。
这是基本链接https://abr.business.gov.au/ABN/View?abn=
,这些数字(存储在csv文件中)78007306283
,70007746536
,95051096649
附加到该链接以使其可用链接。这些数字在csv文件的ids
标题下。 https://abr.business.gov.au/ABN/View?abn=78007306283
是这样一种合格的链接。
我的脚本可以从csv文件中读取数字,在该链接中一个接一个地添加数字,将结果填充到网站中,然后在提取后将其写入另一个csv文件中。
我面临的唯一问题是我新创建的csv文件也包含ids
标头,而我想在新的csv文件中排除该列。
当将结果写入新的csv文件时,如何摆脱旧的csv文件中的可用列?
到目前为止,我已经尝试过:
import csv
import requests
from bs4 import BeautifulSoup
URL = "https://abr.business.gov.au/ABN/View?abn={}"
with open("itemids.csv", "r") as f, open('information.csv', 'w', newline='') as g:
reader = csv.DictReader(f)
newfieldnames = reader.fieldnames + ['Name', 'Status']
writer = csv.DictWriter(g, fieldnames=newfieldnames)
writer.writeheader()
for entry in reader:
res = requests.get(URL.format(entry['ids']))
soup = BeautifulSoup(res.text,"lxml")
item = soup.select_one("span[itemprop='legalName']").text
stat = soup.find("th",string="ABN status:").find_next_sibling().get_text(strip=True)
print(item,stat)
new_row = entry
new_row['Name'] = item
new_row['Status'] = stat
writer.writerow(new_row)
答案 0 :(得分:2)
以下答案基本上指出,使用pandas可以对操纵表提供一些控制(即,您希望摆脱一列)。您当然可以使用csv和BeautifulSoup来做到这一点,但是用更少的代码行,用pandas即可完成。
例如,仅使用您的3个ID列表,就可以生成一个表格,轻松地将其写入文件:
import pandas as pd
import requests
URL = "https://abr.business.gov.au/ABN/View?abn="
# Read in your csv with the ids
id_df = pd.read_csv('path/file.csv')
#create your list of ids from that csv
id_list = list(id_df['ids'])
results = pd.DataFrame()
for entry in id_list:
url = URL+'%s' %(str(entry))
res = requests.get(url)
table = pd.read_html(url)[0]
name = table.iloc[0,1]
status = table.iloc[1,1]
temp_df = pd.DataFrame([[name,status]], columns = ['Name', 'Status'])
results = results.append(temp_df).reset_index(drop=True)
results.to_csv('path/new_file.csv', index=False)
输出:
print(results)
name status
0 AUSTRALIAN NATIONAL MEMORIAL THEATRE LIMITED Active from 30 Mar 2000
1 MCDONNELL INDUSTRIES PTY. LTD. Active from 24 Mar 2000
2 FERNSPOT PTY. LIMITED Active from 01 Nov 1999
3 FERNSPOT PTY. LIMITED Active from 01 Nov 1999
就您正在处理的代码而言,我认为问题在于:
new_row = entry
因为entry
引用文件f,该文件具有该id
列。您可以做的是在编写之前立即删除该列。从技术上讲,我相信这是一本字典,因此您只需要删除key:value的任何内容即可:
我目前无法进行测试,但我认为这应该是这样的:
new_row = entry
new_row['Name'] = item
new_row['Status'] = stat
del new_row ['id'] #or whatever the key is for that id value
writer.writerow(new_row)
编辑/附加
它仍然显示的原因是因为以下这一行:
newfieldnames = reader.fieldnames + ['Name', 'Status']
由于您拥有reader = csv.DictReader(f)
,因此其中包括ids
列。因此,在您的newfieldnames = reader.fieldnames + ['Name', 'Status']
中,您要包括原始csv中的字段名称。只需放下reader.fieldnames +
,然后初始化new_row = {}
我认为应该可以解决
import csv
import requests
from bs4 import BeautifulSoup
URL = "https://abr.business.gov.au/ABN/View?abn={}"
with open("itemids.csv", "r") as f, open('information.csv', 'w', newline='') as g:
reader = csv.DictReader(f)
newfieldnames = ['Name', 'Status']
writer = csv.DictWriter(g, fieldnames=newfieldnames)
writer.writeheader()
for entry in reader:
res = requests.get(URL.format(entry['ids']))
soup = BeautifulSoup(res.text,"lxml")
item = soup.select_one("span[itemprop='legalName']").text
stat = soup.find("th",string="ABN status:").find_next_sibling().get_text(strip=True)
print(item,stat)
new_row = {}
new_row['Name'] = item
new_row['Status'] = stat
writer.writerow(new_row)
答案 1 :(得分:1)
您也可以使用Pandas软件包在Python中进行网页抓取。您知道的代码更少。您可以先获取数据框,然后选择任何列或行。看看我如何https://medium.com/@alcarsil/python-for-cryptocurrencies-absolutely-beginners-how-to-find-penny-cryptos-and-small-caps-72de2eb6deaa