我在从网站上抓取数据后编写CSV文件时遇到问题。我的目的是在美国找到一系列高尔夫球场的名称和地址。我使用.get_text(separator=' ')
作为地址删除<Br>
来分解地址的文本,但是当写入CSV时,我只从我的893的交互中获得三个条目。我能做什么我得到了适当数量的抓取数据,我如何修复我的脚本,以便它能够正确地抓取所有内容。
这是我的剧本:
import csv
import requests
from bs4 import BeautifulSoup
courses_list = []
for i in range(893): #893
url="http://sites.garmin.com/clsearch/courses/search?course=&location=&country=US&state=&holes=&radius=&lang=en&search_submitted=1&per_page={}".format(i*20)
r = requests.get(url)
soup = BeautifulSoup(r.text)
g_data2 = soup.find_all("div",{"class":"result"})
#print g_data
for item in g_data2:
try:
name = item.find_all("div",{"class":"name"})[0].text
except:
name=''
print "No Name found!"
try:
address= item.find_all("div",{"class":"location"})[0].get_text(separator=' ')
print address
except:
address=''
print "No Address found!"
course=[name,address]
courses_list.append(course)
with open ('Garmin_GC.csv','a') as file:
writer=csv.writer(file)
for row in courses_list:
writer.writerow([s.encode("utf-8") for s
答案 0 :(得分:0)
如果那是你的缩进然后是错误的,你需要在循环中添加名称和地址,这应该添加所有数据:
import csv
import requests
from bs4 import BeautifulSoup
courses_list = []
with open('Garmin_GC.csv', 'w') as file:
for i in range(893): #893
url = "http://sites.garmin.com/clsearch/courses/search?course=&location=&country=US&state=&holes=&radius=&lang=en&search_submitted=1&per_page={}".format(
i * 20)
r = requests.get(url)
soup = BeautifulSoup(r.text)
g_data2 = soup.find_all("div", {"class": "result"})
for item in g_data2:
try:
name = item.find_all("div", {"class": "name"})[0].text
except IndexError::
name = ''
print "No Name found!"
try:
address = item.find_all("div", {"class": "location"})[0].get_text(separator=' ')
print address
except IndexError::
address = ''
print "No Address found!"
course = [name, address]
courses_list.append(course)
writer = csv.writer(file)
for row in courses_list:
writer.writerow([s.encode("utf-8") for s in row])
您可以在循环外打开文件,并在完成后写一次,如果您不想将所有数据存储在列表中,只需编写每次迭代:
with open('Garmin_GC.csv', 'w') as file:
writer = csv.writer(file)
for i in range(3): #893
url = "http://sites.garmin.com/clsearch/courses/search?course=&location=&country=US&state=&holes=&radius=&lang=en&search_submitted=1&per_page={}".format(
i * 20)
r = requests.get(url)
soup = BeautifulSoup(r.text)
g_data2 = soup.find_all("div", {"class": "result"})
for item in g_data2:
try:
name = item.find_all("div", {"class": "name"})[0].text
except IndexError:
name = ''
print "No Name found!"
try:
address = item.find_all("div", {"class": "location"})[0].get_text(separator=' ')
print address
except IndexError:
address = ''
print "No Address found!"
writer.writerow([name.encode("utf-8"), address.encode("utf-8")])
如果您没有姓名或地址,那么如果您想忽略其中任何一个或两者都缺失的数据,您可能希望在例外中添加continue
。