使用反向搜索将数据写入CSV文件时无法摆脱列

时间:2019-01-24 13:38:37

标签: python python-3.x csv web-scraping

我已经在python中创建了一个脚本,以从csv文件读取不同的id号,以便将其与链接一起使用以填充结果并将结果写入不同的csv文件中。

这是基本链接https://abr.business.gov.au/ABN/View?abn=,这些数字(存储在csv文件中)780073062837000774653695051096649附加到该链接以使其可用链接。这些数字在csv文件的ids标题下。 https://abr.business.gov.au/ABN/View?abn=78007306283是这样一种合格的链接。

我的脚本可以从csv文件中读取数字,在该链接中一个接一个地添加数字,将结果填充到网站中,然后在提取后将其写入另一个csv文件中。

我面临的唯一问题是我新创建的csv文件也包含ids标头,而我想在新的csv文件中排除该列。

当将结果写入新的csv文件时,如何摆脱旧的csv文件中的可用列?

到目前为止,我已经尝试过:

import csv
import requests
from bs4 import BeautifulSoup

URL = "https://abr.business.gov.au/ABN/View?abn={}"

with open("itemids.csv", "r") as f, open('information.csv', 'w', newline='') as g:
    reader = csv.DictReader(f)
    newfieldnames = reader.fieldnames + ['Name', 'Status']
    writer = csv.DictWriter(g, fieldnames=newfieldnames)
    writer.writeheader()
    for entry in reader:
        res = requests.get(URL.format(entry['ids']))
        soup = BeautifulSoup(res.text,"lxml")
        item = soup.select_one("span[itemprop='legalName']").text
        stat = soup.find("th",string="ABN status:").find_next_sibling().get_text(strip=True)

        print(item,stat)

        new_row = entry
        new_row['Name'] = item
        new_row['Status'] = stat
        writer.writerow(new_row)

2 个答案:

答案 0 :(得分:2)

以下答案基本上指出,使用pandas可以对操纵表提供一些控制(即,您希望摆脱一列)。您当然可以使用csv和BeautifulSoup来做到这一点,但是用更少的代码行,用pandas即可完成。

例如,仅使用您的3个ID列表,就可以生成一个表格,轻松地将其写入文件:

import pandas as pd
import requests

URL = "https://abr.business.gov.au/ABN/View?abn="

# Read in your csv with the ids
id_df = pd.read_csv('path/file.csv')

#create your list of ids from that csv
id_list = list(id_df['ids'])

results = pd.DataFrame()
for entry in id_list:
    url = URL+'%s' %(str(entry))
    res = requests.get(url)

    table = pd.read_html(url)[0]

    name = table.iloc[0,1]
    status = table.iloc[1,1]

    temp_df = pd.DataFrame([[name,status]], columns = ['Name', 'Status'])
    results = results.append(temp_df).reset_index(drop=True)

results.to_csv('path/new_file.csv', index=False)

输出:

print(results)
                                           name                   status
0  AUSTRALIAN NATIONAL MEMORIAL THEATRE LIMITED  Active from 30 Mar 2000
1                MCDONNELL INDUSTRIES PTY. LTD.  Active from 24 Mar 2000
2                         FERNSPOT PTY. LIMITED  Active from 01 Nov 1999
3                         FERNSPOT PTY. LIMITED  Active from 01 Nov 1999

就您正在处理的代码而言,我认为问题在于:

new_row = entry

因为entry引用文件f,该文件具有该id列。您可以做的是在编写之前立即删除该列。从技术上讲,我相信这是一本字典,因此您只需要删除key:value的任何内容即可:

我目前无法进行测试,但我认为这应该是这样的:

    new_row = entry
    new_row['Name'] = item
    new_row['Status'] = stat
    del new_row ['id'] #or whatever the key is for that id value

    writer.writerow(new_row)

编辑/附加

它仍然显示的原因是因为以下这一行:

newfieldnames = reader.fieldnames + ['Name', 'Status']

由于您拥有reader = csv.DictReader(f),因此其中包括ids列。因此,在您的newfieldnames = reader.fieldnames + ['Name', 'Status']中,您要包括原始csv中的字段名称。只需放下reader.fieldnames +,然后初始化new_row = {}

我认为应该可以解决

import csv
import requests
from bs4 import BeautifulSoup

URL = "https://abr.business.gov.au/ABN/View?abn={}"

with open("itemids.csv", "r") as f, open('information.csv', 'w', newline='') as g:
    reader = csv.DictReader(f)
    newfieldnames = ['Name', 'Status']
    writer = csv.DictWriter(g, fieldnames=newfieldnames)
    writer.writeheader()
    for entry in reader:
        res = requests.get(URL.format(entry['ids']))
        soup = BeautifulSoup(res.text,"lxml")
        item = soup.select_one("span[itemprop='legalName']").text
        stat = soup.find("th",string="ABN status:").find_next_sibling().get_text(strip=True)

        print(item,stat)

        new_row = {}
        new_row['Name'] = item
        new_row['Status'] = stat
        writer.writerow(new_row)

答案 1 :(得分:1)

您也可以使用Pandas软件包在Python中进行网页抓取。您知道的代码更少。您可以先获取数据框,然后选择任何列或行。看看我如何https://medium.com/@alcarsil/python-for-cryptocurrencies-absolutely-beginners-how-to-find-penny-cryptos-and-small-caps-72de2eb6deaa