我已经建立了一个脚本,可以抓取英国的法院列表,生成每个法院地址页面的链接列表,然后想要从所述页面中删除地址。
到目前为止它工作得很好,但我仍然坚持“写入csv”位。我认为这与基于similar problem的iteritems()
缺少get
方法有关。我得知迭代器没有与可迭代相同的方法(我在代码中使用迭代器),但它没有帮助我解决我的问题问题
这是我的代码:
import csv
import time
import random
import requests
from bs4 import BeautifulSoup as bs
# lambda expression to request url and parse it through bs
soup = lambda url: bs((requests.get(url)).text, "html.parser")
def crawl_court_listings(base, buff, char):
""" """
# common URL segment + cuffer URL segment + end character -> URL
url = base + buff + str(chr(char))
# soup lambda expression -> grab first unordered list
links = (soup(url)).find('div', {'class', 'content inner cf'}).find('ul')
# empty dictionary
results = {}
# loop through links, get link title and href
for item in links.find_all('a', href=True):
court_link = item['href']
title = item.string
# generate full court address page url from href
full_court_link = base + court_link
# save title and full URL to results
results[title] = full_court_link
# increment char var by 1
char += 1
# return results dict and incremented char value
return results, char
def get_court_address(court_name, full_court_link):
""" """
# get horrible chunk of poorly formatted address(es)
address_blob = (soup(full_court_link)).find('div', {'id': 'addresses'}).text
# clean the blob
clean_address = ("\n".join(line.strip() for line in address_blob.split("\n")))
# write to csv
with open('court_addresses.csv', 'w') as csvfile:
fieldnames = [court_name, full_court_link, clean_address]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writerow(fieldnames)
if __name__ == "__main__":
base = 'https://courttribunalfinder.service.gov.uk/'
buff = 'courts/'
# 65 = "A". Starting from Char "A", retrieve list of Titles and Links of for Court Addresses. Return Char +1
results, char = crawl_court_listings(base, buff, 65)
# 90 = "Z". Until Z, pass title and list from results into get_court_address(), then wait a few seconds
while char <= 90:
for t, l in results.iteritems():
get_court_address(t, l)
time.sleep(random.randint(0,5))
当我运行时,我得到以下内容:
Traceback (most recent call last):
File ".\CourtScraper.py", line 63, in <module>
get_court_address(t, l)
File ".\CourtScraper.py", line 49, in get_court_address
writer.writerow(fieldnames)
File "c:\python27\Lib\csv.py", line 152, in writerow
return self.writer.writerow(self._dict_to_list(rowdict))
File "c:\python27\Lib\csv.py", line 149, in _dict_to_list
return [rowdict.get(key, self.restval) for key in self.fieldnames]
AttributeError: 'list' object has no attribute 'get'
即使出现错误,它也会生成csv文件,其中单元格A1和A2填充了title
和full-court_link
,但没有address
。地址(打印时)如下所示:
Write to us:
1st Floor
Piccadilly Exchange
Piccadilly Plaza
Manchester
Greater Manchester
M1 4AH
所以我的第一个想法是我试图将多行文本写入导致错误的单个单元格中,但不确定如何确认。我使用print(type(address))
作为unicode
而不是list
,因此我认为这不会导致问题。我不明白这个问题与list
有什么关系,如果有意义的话。
如果导致问题的是iteritems()
方法,我该如何解决?
有人可以解释错误并指出我的解决方法吗?
答案 0 :(得分:4)
你的问题在这里:
writer.writerow(fieldnames)
&#34;字段名&#34;是一个列表的字段名称。您需要传递 dict 键值对。所以看起来应该更像这样:
# write to csv
with open('court_addresses.csv', 'w') as csvfile:
# note - these are strings, not variables
fieldnames = ['court_name', 'full_court_link', 'clean_address']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writerow({"court_name" : court_name,
"full_court_link" : full_court_link},
"clean_address" : clean_address})
PSST:你有另一个问题。您正在为您解析的每个法院重新打开输出文件。您可能想要打开该文件一次(在__main__下),然后将句柄传递给get_court_address()
答案 1 :(得分:2)
对于您正在编写的每一行,您需要传入一个字典 - 您正在传入标题列表
https://docs.python.org/2/library/csv.html#csv.DictWriter
DECLARE @string VARCHAR(MAX) = 'Bank Name: eewweew Chemnitz Bank Account Address: weweweewew Zwickau 12345 dfdfdfdfd fdfdfdfdf. 1-3 Beneficiary Name: Roswitha Haupt-Elster Account Number: TheValueToReturn SWIFT/BIC Code: VVHHH SortCode: sfsffsfsa IBAN: wdffwfafsafsafs'
SELECT LEFT(LTRIM(RTRIM( REPLACE(@string, LEFT(@string, CHARINDEX('Account Number:', @string) + 14), ''))), CHARINDEX(' ', LTRIM(RTRIM( REPLACE(@string, LEFT(@string, CHARINDEX('Account Number:', @string) + 14), '')))))
dict需要看起来像::
{&#39; court_name&#39;:X,&#39; full_court_link&#39;:Y,&#39; clean_address&#39;:Z}
HTH
答案 2 :(得分:2)
with open('court_addresses.csv', 'w') as csvfile:
fieldnames = ['court_name', 'full_court_link', 'clean_address']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writerow({'court_name': court_name, 'full_court_link': full_court_link, 'clean_address': clean_address})