我正在使用BeautifulSoup4
从网站上抓取信息,并使用Pandas
将数据导出到csv文件。字典中有5列,代表5列数据。但是,由于该网站没有全部5个类别的完整数据,因此某些列表中的列表项少于其他列表。因此,当我尝试导出数据时,熊猫给了我
ValueError:数组的长度必须相同。
处理这种情况的最佳方法是什么?具体而言,项目较少的列表是“作者”和“页面”。提前谢谢!
代码:
import requests as r
from bs4 import BeautifulSoup as soup
import pandas
#make a list of all web pages' urls
webpages=[]
for i in range(15):
root_url = 'https://cross-currents.berkeley.edu/archives?author=&title=&type=All&issue=All®ion=All&page='+ str(i)
webpages.append(root_url)
print(webpages)
#start looping through all pages
titles = []
journals = []
authors = []
pages = []
dates = []
issues = []
for item in webpages:
headers = {'User-Agent': 'Mozilla/5.0'}
data = r.get(item, headers=headers)
page_soup = soup(data.text, 'html.parser')
#find targeted info and put them into a list to be exported to a csv file via pandas
title_list = [title.text for title in page_soup.find_all('div', {'class':'field field-name-node-title'})]
titles += [el.replace('\n', '') for el in title_list]
journal_list = [journal.text for journal in page_soup.find_all('em')]
journals += [el.replace('\n', '') for el in journal_list]
author_list = [author.text for author in page_soup.find_all('div', {'class':'field field--name-field-citation-authors field--type-string field--label-hidden field__item'})]
authors += [el.replace('\n', '') for el in author_list]
pages_list = [pages.text for pages in page_soup.find_all('div', {'class':'field field--name-field-citation-pages field--type-string field--label-hidden field__item'})]
pages += [el.replace('\n', '') for el in pages_list]
date_list = [date.text for date in page_soup.find_all('div', {'class':'field field--name-field-date field--type-datetime field--label-hidden field__item'})]
dates += [el.replace('\n', '') for el in date_list]
issue_list = [issue.text for issue in page_soup.find_all('div', {'class':'field field--name-field-issue-number field--type-integer field--label-hidden field__item'})]
issues += [el.replace('\n', '') for el in issue_list]
# export to csv file via pandas
dataset = {'Title': titles, 'Author': authors, 'Journal': journals, 'Date': dates, 'Issue': issues, 'Pages': pages}
df = pandas.DataFrame(dataset)
df.index.name = 'ArticleID'
df.to_csv('example45.csv', encoding="utf-8")
答案 0 :(得分:0)
如果您确定例如标题的长度始终正确,则可以执行以下操作:
title_list = [title.text for title in page_soup.find_all('div', {'class':'field field-name-node-title'})]
titles_to_add = [el.replace('\n', '') for el in title_list]
titles += titles_to_add
...
author_list = [author.text for author in page_soup.find_all('div', {'class':'field field--name-field-citation-authors field--type-string field--label-hidden field__item'})]
authors_to_add = [el.replace('\n', '') for el in author_list]
if len(authors_to_add) < len(titles_to_add):
while len(authors_to_add) < len(titles_to_add):
authors_to_add += " "
authors += authors_to_add
pages_list = [pages.text for pages in page_soup.find_all('div', {'class':'field field--name-field-citation-pages field--type-string field--label-hidden field__item'})]
pages_to_add = [el.replace('\n', '') for el in pages_list]
if len(pages_to_add) < len(titles_to_add):
while len(pages_to_add) < len(titles_to_add):
pages_to_add += " "
pages += pages_to_add
但是...这只会将元素添加到列中,以便它们具有正确的长度,以便您可以创建数据框。但是在您的数据框中,作者和页面将不在正确的行中。您将需要稍微改变算法以实现最终目标...如果您要遍历页面上的所有行并获得标题等,那就更好了……
rows = page_soup.find_all('div', {'class':'views-row'})
for row in rows:
title_list = [title.text for title in row.find_all('div', {'class':'field field-name-node-title'})]
...
然后,您需要检查标题,作者等是否存在len(title_list)>0
,如果不存在,请在特定列表中添加"None"
或其他内容。然后,您的df
中的所有内容都应该正确。
答案 1 :(得分:0)
您可以仅从第一个列表(df = pandas.DataFrame({'Title': titles})
)中创建一个数据框,然后添加其他列表:
dataset = {'Author': authors, 'Journal': journals, 'Date': dates, 'Issue': issues, 'Pages': pages}
df2 = pandas.DataFrame(dataset)
df_final = pandas.concat([df, df2], axis=1)
这将在您缺少数据的地方给您空白(或NaN
)。
与@WurzelseppQX的答案一样,这样做的麻烦在于数据可能未对齐,这将使其变得毫无用处。因此,最好的办法是更改代码,以便每次循环时都始终在每个列表中添加一些内容,如果没有,只需将其设置为0或blank
。