Python Webscraper和缺少输出数据

时间:2018-09-19 23:32:08

标签: python python-3.x

我正在尝试从网站上抓取评论,然后使用Python(3.7)和BeautifulSoup将其存储到csv中。看来抓取是成功的,但是当我写入文件时,只有一列包含完整的数据,其余的只是第一个字符。

任何提示将不胜感激,如果很明显,对不起-这是一个新鲜的爱好:)

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

#URL to scrape
my_url = "https://www.indeed.com/cmp/Capital-One/reviews?fcountry=ALL&lang="

#open connection, grab page
uClient = uReq(my_url)
page_html = uClient

#html parsing
page_soup = soup(page_html, "lxml")

#grab all reviews on page
containers = page_soup.findAll("div",{"cmp-review-container"})
uClient.close()
#write to csv
filename = "indeedreviewtest.csv"
f=open(filename, "w")

headers = "review_id, review_score, role, review_text\n"

f.write(headers)

#loop through each review, collect review ID, rating, role & verbatum
for container in containers:
    reviewid_container = container.div["data-tn-entityid"]
    reviewid = reviewid_container[0]
    score_container = container.div.div.div.meta["content"]
    reviewscore = score_container[0]
    role_container = container.find("span", attrs={"class":"cmp-reviewer- job-title"}).text
    reviewerrole = role_container[0]
    reviewtext_container = container.find("span", attrs={"class":"cmp-review-text"}).text
    reviewtext = reviewtext_container

    f.write(reviewid + "," + reviewscore + "," + reviewerrole.replace(",", "|") + "," + reviewtext.replace(",", "|") + "\n")

f.close()

谢谢!

1 个答案:

答案 0 :(得分:0)

也许您混用了find()findAll()

find()在找到与条件匹配的第一个元素时停止,而findAll()将所有条件都包含进来。

通过使用role_container[0],您将从该元素文本中获取第一个字符。

您可以尝试:

reviewerrole = container.find("span", attrs={"class":"cmp-reviewer-job-title"}).text
reviewtext = container.find("span", attrs={"class":"cmp-review-text"}).text

除此之外,请考虑使用csv模块来读取/写入CSV文件。更多信息:https://docs.python.org/3/library/csv.html#csv.writer