将删除的数据写入csv文件时遇到问题。在加载页面并且脚本的第一部分工作时,写入csv会导致问题。现在我尝试从抓取的数据中生成整数,因为这在我的其他项目中运行良好。但是,在这个项目中似乎存在问题。
我得到的错误代码是:
ValueError: invalid literal for int() with base 10: '\nNotes To A Friend: The Experience\n'
我的qustion是:如何以更复杂的方式将数据写入csv?
代码:
import urllib.request
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
import time
from datetime import datetime
from collections import OrderedDict
import re
browser = webdriver.Firefox()
browser.get('https://www.kickstarter.com/discover?ref=nav')
categories = browser.find_elements_by_class_name('category-container')
category_links = []
for category_link in categories:
#Each item in the list is a tuple of the category's name and its link.
category_links.append((str(category_link.find_element_by_class_name('f3').text),
category_link.find_element_by_class_name('bg-white').get_attribute('href')))
scraped_data = []
now = datetime.now()
counter = 1
for category in category_links:
browser.get(category[1])
browser.find_element_by_class_name('sentence-open').click()
time.sleep(2)
browser.find_element_by_id('category_filter').click()
time.sleep(2)
for i in range(27):
try:
time.sleep(2)
browser.find_element_by_id('category_'+str(i)).click()
time.sleep(2)
except:
pass
projects = []
for project_link in browser.find_elements_by_class_name('clamp-3'):
projects.append(project_link.find_element_by_tag_name('a').get_attribute('href'))
for counter, project in enumerate(projects):
page1 = urllib.request.urlopen(projects[counter])
soup1 = BeautifulSoup(page1, "lxml")
page2 = urllib.request.urlopen(projects[counter].split('?')[0]+'/community')
soup2 = BeautifulSoup(page2, "lxml")
time.sleep(2)
print(str(counter)+': '+project+'\nStatus: Started.')
project_dict = OrderedDict()
project_dict['Category'] = category[0]
browser.get(project)
project_dict['Name'] = int(soup1.find(class_='type-24 type-28-sm type-38-md navy-700 medium mb3').text)
project_dict['Home State'] = int(soup1.find(class_='nowrap navy-700 flex items-center medium type-12').text)
try:
project_dict['Backer State'] = int(soup2.find(class_='location-list-wrapper js-location-list-wrapper').text)
except:
pass
print('Status: Done.')
counter+=1
scraped_data.append(project_dict)
later = datetime.now()
diff = later - now
print('The scraping took '+str(round(diff.seconds/60.0,2))+' minutes, and scraped '+str(len(scraped_data))+' projects.')
df = pd.DataFrame(scraped_data)
df.to_csv('kickstarter-data1.csv')
答案 0 :(得分:0)
自停止解析文本的整数转换后,将在此处进行一些更改:
BeautifulSoup
这种方式初始化html5lib
:BeautifulSoup(page1, "html5lib")
BeautifulSoup
需要传递一个str
对象作为第一个参数。
response = urllib.request.urlopen(projects[counter])
page1 = response.read()
soup1 = BeautifulSoup(page1, "html5lib")