我正在编写一个scraper,我想循环遍历一个链接列表,并将所有结果作为列合并到同一个键上的数据帧(如左连接)。
我在Ipython Notebook中运行此代码,结果来自数据帧的csv没有意义,但是如果在运行脚本后我在共享列上合并了df和df2"问题",我获得我需要的连接,但在脚本中出现了错误。
这是整个脚本,我们已经登录了请求,但您不必创建用户,您可以作为访客运行它而您只是无法获得评论中的所有答案。
import requests
from bs4 import BeautifulSoup as bs
import csv
import pandas as pd
get_url = 'https://www.g2crowd.com/login?form=login'
post_url = 'https://www.g2crowd.com/user_sessions'
review_url = 'https://www.g2crowd.com/survey_responses/salesforce-crm-review-29972'
links = []
with open("links.csv", "r") as f:
spamreader = csv.reader(f, delimiter=',')
for row in spamreader:
links.append(row)
links = links[1:]
s = requests.Session()
r = s.get(get_url)
soup = bs(r.text)
token = soup.select('input[name="authenticity_token"]')[0]['value']
username = 'email@gmail.com'
password = 'password'
payload = {"user_session[login_email]": "email@gmail.com", "user_session[password]": "password"}
payload['authenticity_token'] = token
Referer = dict(Referer=get_url)
r = s.post(post_url, data=payload, headers=Referer)
print r.status_code
df = pd.read_csv("data.csv")
#df = df.set_index('questions')
for link in links:
r = s.get(link[0])
soup = bs(r.text)
title = soup.title.contents[0]
question_wrapper = soup.findAll("div", class_="question-wrapper")
print len(question_wrapper)
questions = []
answers = []
scraped_titles = []
tricky_title = 'Salesforce CRM Review by G2 Crowd User in Transportation/Trucking/Railroad - March 13, 2013'
with open("scraped_titles.csv", "r") as f:
spamreader = csv.reader(f, delimiter=',')
for row in spamreader:
scraped_titles.append(row[0])
scraped_titles = set(scraped_titles)
if (title not in scraped_titles and title != tricky_title):
for question in question_wrapper:
q = question.label.contents[0]
a = question.div.contents[0].text
questions.append(q)
answers.append(a)
#qa = zip(questions, answers)
qa = dict(questions=questions, answers=answers)
df2 = pd.DataFrame(qa)
#df2 = df2.set_index('questions', inplace=True)
#df2.to_csv(title + ".csv", encoding='utf-8')
df = pd.merge(df, df2, how='left', on='questions')
with open("scraped_titles.csv", "a") as csvwriter:
spamreader = csv.writer(csvwriter, delimiter=',')
spamreader.writerow([unicode(title).encode("utf-8")])
else:
pass
df.to_csv("all_data.csv", encoding='utf-8')
我还尝试将每个评论保存到.csv,然后将所有内容与Pandas合并,但我得到一个奇怪的,罕见的无证错误:
错误:在未加引号的字段中看到的换行符 - 您需要打开吗? 通用换行模式下的文件?
我一直试图找到我的错误,如果有人可以指出它会非常有帮助。 另外,我希望我按照规则格式化帖子,如果没有请帮我纠正。