我目前正在尝试从研究数据库中获取数据 - ScienceDirect。我使用Beautiful Soup获取每篇研究文章的标题,并将其添加到空的pandas数据框中。在此之后,我获得了有关上述文章的研究文章类型的信息。但是,当我尝试将此数据附加到数据框时,它将被添加到底部,即。它不是添加到前100行,而是创建100个新行。
from bs4 import BeautifulSoup as bs
soup = bs(requests.get(browser.current_url).text,"html.parser")
# Using soup to retrieve the elements related to Title, Type of Article, Names of Authors and Abstract
elements = soup.find_all("div", {"class","result-item-content"})
data = pd.DataFrame(columns = ["Abstract","Journal & Dates","Names of Authors","Title","Type of Article"])
for element in elements:
atag = element.find('a')
if atag:
atag = atag.text.split("\n")
data = data.append({"Title": atag}, ignore_index=True)
data.head()
Abstract Journal & Dates Names of Authors Title Type of Article
0 NaN NaN NaN [Morphological, molecular identification and p... NaN
1 NaN NaN NaN [Assessment of soil erosion in a tropical moun... NaN
2 NaN NaN NaN [Ethnomedicinal assessment of Irula tribes of ... NaN
3 NaN NaN NaN [Latitudinal variation in summer monsoon rainf... NaN
4 NaN NaN NaN [IUCN greatly underestimates threat levels of ... NaN
现在,我尝试搜索有关研究文章类型的信息,并将其附加到上述数据框中。
for element in elements:
art_type = element.find("ol",{"class","OpenAccessArchive hor"})
if art_type:
art_type = art_type.text.split("\n")
data = data.append({"Type of Article": art_type}, ignore_index=True)
data.tail()
Abstract Journal & Dates Names of Authors Title Type of Article
194 NaN NaN NaN NaN [Open access, Research article, ]
195 NaN NaN NaN NaN [Research article, ]
196 NaN NaN NaN NaN [Research article, ]
197 NaN NaN NaN NaN [Research article, ]
198 NaN NaN NaN NaN [Research article, ]
如果你看一下数据帧的尾部,奇怪的是,它将信息添加到最后100或90行。我该如何纠正?
另外,我是刮刮和python的新手。关于什么是存储数据的最佳方法的任何建议,以便我可以在以后进行相同的分析?如主题的概率建模?
编辑根据答案,我尝试了以下操作,但收到了错误消息:
data_dict = {}
# Create keys
for key in ["Abstract","Journal & Dates","Names of Authors","Title","Type of Article"]:
data_dict[key] = []
# Loop through the elements object
for element in elements:
# Find all the Title tags
atag = element.find('a')
if atag:
atag = atag.text.split("\n")
data_dict["Title"].append(atag)
# Find all article_type information
art_type = element.find("ol",{"class","OpenAccessArchive hor"})
if art_type:
art_type = art_type.text.split("\n")
data_dict["Type of Article"].append(art_type)
# Find Names of Authors
author = element.find("ol",{"class","Authors hor undefined"})
if author:
author = author.text.split("\n")
data_dict["Names of Authors"].append(author)
# Find Journal Name
journal = element.find("ol",{"class","SubType hor"})
if journal:
journal = journal.text.split("\n")
data_dict["Journal & Dates"].append(journal)
data = pd.DataFrame(data_dict)
*ERROR*
~/miniconda/lib/python3.6/site-packages/pandas/core/frame.py in extract_index(data)
6209 lengths = list(set(raw_lengths))
6210 if len(lengths) > 1:
-> 6211 raise ValueError('arrays must all be same length')
6212
6213 if have_dicts:
ValueError:数组必须全长相同
答案 0 :(得分:0)
根据评论,pd.DataFrame.append
用于附加数据框 - 按照定义添加新行 - 这就是您尝试插入的数据反而附加到新行中的原因。您可以单独将数据插入数据框,但它并不漂亮。例如,您可以使用data.loc[i,'Title'] = atag
插入其中i
是行计数器(即i=3
) note 这也需要您创建一个完整大小的空数据框。
相反,我建议首先用数据填充字典,然后将字典传递给pandas
;即:
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs
import numpy as np
#fetch data
sample_url = "https://www.sciencedirect.com/search/advanced?qs=Climate%20Change&articleTypes=REV%2CFLA&show=100&sortBy=relevance"
soup = bs(requests.get(sample_url).text,"html.parser")
elements = soup.find_all("div", {"class","result-item-content"})
#create data container
data_dict = {}
for key in ["Title","Type of Article","Names of Authors","Journal & Dates"]:
data_dict[key] = []
#fields yet to add: "Abstract"
#convenience function:
def add_item(item,key,data_dict):
#if item is found, add to data
if item is not None:
data_dict[key].append(item.text)
#if nothing is found, add missing flag
if item is None:
data_dict[key].append(np.nan)
#loop over elements
for element in elements:
#find element
art_type = element.find("ol",{"class","OpenAccessArchive hor"})
add_item(art_type, "Type of Article", data_dict)
#repeat for remainder of fields
atag = element.find('a')
add_item(atag, 'Title', data_dict)
author = element.find("ol",{"class","Authors hor undefined"})
add_item(author, 'Names of Authors', data_dict)
# Find Journal Name
journal = element.find("ol",{"class","SubType hor"})
add_item(journal,'Journal & Dates', data_dict)
data = pd.DataFrame(data_dict)
示例输出:
In [21]: data.head(n=3)
Out[21]:
Journal & Dates \
0 Global Environmental Change, Volume 50, May 20...
1 International Journal of Hygiene and Environme...
2 Global Environmental Change, Volume 50, May 20...
Names of Authors \
0 [K.M. Findlater, S.D. Donner, T. Satterfield, ...
1 [Shouro Dasgupta, ]
2 [Thad Kousser, Bruce Tranter, ]
Title Type of Article
0 Integration anxiety: The cognitive isolation o... Research article,
1 Burden of climate change on malaria mortality Research article,
2 The influence of political leaders on climate ... Research article
另请注意,缺少值。我会调查他们......
In [37]: data.isnull().sum()
Out[37]:
Journal & Dates 0
Names of Authors 15
Title 0
Type of Article 0
dtype: int64