Pandas在同一列上合并数据帧

时间:2015-07-31 16:04:13

标签: python csv pandas

我正在编写一个scraper,我想循环遍历一个链接列表,并将所有结果作为列合并到同一个键上的数据帧(如左连接)。

我在Ipython Notebook中运行此代码,结果来自数据帧的csv没有意义,但是如果在运行脚本后我在共享列上合并了df和df2"问题",我获得我需要的连接,但在脚本中出现了错误。

这是整个脚本,我们已经登录了请求,但您不必创建用户,您可以作为访客运行它而您只是无法获得评论中的所有答案。

import requests
from bs4 import BeautifulSoup as bs
import csv
import pandas as pd

get_url = 'https://www.g2crowd.com/login?form=login'
post_url = 'https://www.g2crowd.com/user_sessions'
review_url = 'https://www.g2crowd.com/survey_responses/salesforce-crm-review-29972'

links = []

with open("links.csv", "r") as f:
    spamreader = csv.reader(f, delimiter=',')
    for row in spamreader:
        links.append(row)
links = links[1:]

s = requests.Session()
r = s.get(get_url)
soup = bs(r.text)
token = soup.select('input[name="authenticity_token"]')[0]['value']
username = 'email@gmail.com'
password = 'password'

payload = {"user_session[login_email]": "email@gmail.com", "user_session[password]": "password"}
payload['authenticity_token'] = token
Referer = dict(Referer=get_url)
r = s.post(post_url, data=payload, headers=Referer)
print r.status_code

df = pd.read_csv("data.csv")
#df = df.set_index('questions')
for link in links:
    r = s.get(link[0])
    soup = bs(r.text)
    title = soup.title.contents[0]
    question_wrapper = soup.findAll("div", class_="question-wrapper")
    print len(question_wrapper)
    questions = []
    answers = []
    scraped_titles = []
    tricky_title = 'Salesforce CRM Review by G2 Crowd User in Transportation/Trucking/Railroad - March 13, 2013'
    with open("scraped_titles.csv", "r") as f:
        spamreader = csv.reader(f, delimiter=',')
        for row in spamreader:
            scraped_titles.append(row[0])
        scraped_titles = set(scraped_titles)
        if (title not in scraped_titles and title != tricky_title):
            for question in question_wrapper:
                q = question.label.contents[0]
                a = question.div.contents[0].text
                questions.append(q)
                answers.append(a)
                #qa = zip(questions, answers)
                qa = dict(questions=questions, answers=answers)
                df2 = pd.DataFrame(qa)
                #df2 = df2.set_index('questions', inplace=True)
                #df2.to_csv(title + ".csv", encoding='utf-8')
                df = pd.merge(df, df2, how='left', on='questions')

            with open("scraped_titles.csv", "a") as csvwriter:
                spamreader = csv.writer(csvwriter, delimiter=',')
                spamreader.writerow([unicode(title).encode("utf-8")])
        else:
            pass
df.to_csv("all_data.csv", encoding='utf-8')

我还尝试将每个评论保存到.csv,然后将所有内容与Pandas合并,但我得到一个奇怪的,罕见的无证错误:

  

错误:在未加引号的字段中看到的换行符 - 您需要打开吗?   通用换行模式下的文件?

我一直试图找到我的错误,如果有人可以指出它会非常有帮助。 另外,我希望我按照规则格式化帖子,如果没有请帮我纠正。

0 个答案:

没有答案