我正在尝试从网站上抓取评论,然后使用Python(3.7)和BeautifulSoup将其存储到csv中。看来抓取是成功的,但是当我写入文件时,只有一列包含完整的数据,其余的只是第一个字符。
任何提示将不胜感激,如果很明显,对不起-这是一个新鲜的爱好:)
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
#URL to scrape
my_url = "https://www.indeed.com/cmp/Capital-One/reviews?fcountry=ALL&lang="
#open connection, grab page
uClient = uReq(my_url)
page_html = uClient
#html parsing
page_soup = soup(page_html, "lxml")
#grab all reviews on page
containers = page_soup.findAll("div",{"cmp-review-container"})
uClient.close()
#write to csv
filename = "indeedreviewtest.csv"
f=open(filename, "w")
headers = "review_id, review_score, role, review_text\n"
f.write(headers)
#loop through each review, collect review ID, rating, role & verbatum
for container in containers:
reviewid_container = container.div["data-tn-entityid"]
reviewid = reviewid_container[0]
score_container = container.div.div.div.meta["content"]
reviewscore = score_container[0]
role_container = container.find("span", attrs={"class":"cmp-reviewer- job-title"}).text
reviewerrole = role_container[0]
reviewtext_container = container.find("span", attrs={"class":"cmp-review-text"}).text
reviewtext = reviewtext_container
f.write(reviewid + "," + reviewscore + "," + reviewerrole.replace(",", "|") + "," + reviewtext.replace(",", "|") + "\n")
f.close()
谢谢!
答案 0 :(得分:0)
也许您混用了find()
和findAll()
。
find()
在找到与条件匹配的第一个元素时停止,而findAll()
将所有条件都包含进来。
通过使用role_container[0]
,您将从该元素文本中获取第一个字符。
您可以尝试:
reviewerrole = container.find("span", attrs={"class":"cmp-reviewer-job-title"}).text
reviewtext = container.find("span", attrs={"class":"cmp-review-text"}).text
除此之外,请考虑使用csv
模块来读取/写入CSV文件。更多信息:https://docs.python.org/3/library/csv.html#csv.writer