Pandas中未正确添加行

时间:2018-04-25 20:02:12

标签: python pandas selenium dataframe web-scraping

我目前正在尝试从研究数据库中获取数据 - ScienceDirect。我使用Beautiful Soup获取每篇研究文章的标题,并将其添加到空的pandas数据框中。在此之后,我获得了有关上述文章的研究文章类型的信息。但是,当我尝试将此数据附加到数据框时,它将被添加到底部,即。它不是添加到前100行,而是创建100个新行。

from bs4 import BeautifulSoup as bs
soup = bs(requests.get(browser.current_url).text,"html.parser")

# Using soup to retrieve the elements related to Title, Type of Article, Names of Authors and Abstract
elements = soup.find_all("div", {"class","result-item-content"})

data = pd.DataFrame(columns = ["Abstract","Journal & Dates","Names of Authors","Title","Type of Article"])

for element in elements:
    atag = element.find('a')
    if atag:
        atag = atag.text.split("\n")
        data = data.append({"Title": atag}, ignore_index=True)

data.head()

Abstract    Journal & Dates Names of Authors    Title   Type of Article
0   NaN NaN NaN [Morphological, molecular identification and p...   NaN
1   NaN NaN NaN [Assessment of soil erosion in a tropical moun...   NaN
2   NaN NaN NaN [Ethnomedicinal assessment of Irula tribes of ...   NaN
3   NaN NaN NaN [Latitudinal variation in summer monsoon rainf...   NaN
4   NaN NaN NaN [IUCN greatly underestimates threat levels of ...   NaN

现在,我尝试搜索有关研究文章类型的信息,并将其附加到上述数据框中。

for element in elements:
    art_type = element.find("ol",{"class","OpenAccessArchive hor"})
    if art_type:
        art_type = art_type.text.split("\n")
        data = data.append({"Type of Article": art_type}, ignore_index=True)

data.tail()

    Abstract    Journal & Dates Names of Authors    Title   Type of Article
194 NaN NaN NaN NaN [Open access, Research article, ]
195 NaN NaN NaN NaN [Research article, ]
196 NaN NaN NaN NaN [Research article, ]
197 NaN NaN NaN NaN [Research article, ]
198 NaN NaN NaN NaN [Research article, ]

如果你看一下数据帧的尾部,奇怪的是,它将信息添加到最后100或90行。我该如何纠正?

另外,我是刮刮和python的新手。关于什么是存储数据的最佳方法的任何建议,以便我可以在以后进行相同的分析?如主题的概率建模?

编辑根据答案,我尝试了以下操作,但收到了错误消息:

data_dict = {}

# Create keys
for key in ["Abstract","Journal & Dates","Names of Authors","Title","Type of Article"]:
    data_dict[key] = []

# Loop through the elements object
for element in elements:

    # Find all the Title tags
    atag = element.find('a')
    if atag:
        atag = atag.text.split("\n")
        data_dict["Title"].append(atag)

    # Find all article_type information
    art_type = element.find("ol",{"class","OpenAccessArchive hor"})
    if art_type:
        art_type = art_type.text.split("\n")
        data_dict["Type of Article"].append(art_type)

    # Find Names of Authors
    author = element.find("ol",{"class","Authors hor undefined"})
    if author:
        author = author.text.split("\n")
        data_dict["Names of Authors"].append(author)

    # Find Journal Name
    journal = element.find("ol",{"class","SubType hor"})
    if journal:
        journal = journal.text.split("\n")
        data_dict["Journal & Dates"].append(journal)

data = pd.DataFrame(data_dict)

*ERROR*
~/miniconda/lib/python3.6/site-packages/pandas/core/frame.py in extract_index(data)
   6209             lengths = list(set(raw_lengths))
   6210             if len(lengths) > 1:
-> 6211                 raise ValueError('arrays must all be same length')
   6212 
   6213             if have_dicts:

ValueError:数组必须全长相同

1 个答案:

答案 0 :(得分:0)

根据评论,pd.DataFrame.append用于附加数据框 - 按照定义添加新行 - 这就是您尝试插入的数据反而附加到新行中的原因。您可以单独将数据插入数据框,但它并不漂亮。例如,您可以使用data.loc[i,'Title'] = atag插入其中i是行计数器(即i=3 note 这也需要您创建一个完整大小的空数据框。

相反,我建议首先用数据填充字典,然后将字典传递给pandas;即:

import pandas as pd
import requests
from bs4 import BeautifulSoup as bs
import numpy as np


#fetch data
sample_url = "https://www.sciencedirect.com/search/advanced?qs=Climate%20Change&articleTypes=REV%2CFLA&show=100&sortBy=relevance"
soup = bs(requests.get(sample_url).text,"html.parser")
elements = soup.find_all("div", {"class","result-item-content"})

#create data container
data_dict = {}
for key in ["Title","Type of Article","Names of Authors","Journal & Dates"]:
    data_dict[key] = []
    #fields yet to add: "Abstract"

#convenience function:
def add_item(item,key,data_dict):
    #if item is found, add to data
    if item is not None:
        data_dict[key].append(item.text)
    #if nothing is found, add missing flag
    if item is None:
        data_dict[key].append(np.nan)

#loop over elements    
for element in elements:
    #find element
    art_type = element.find("ol",{"class","OpenAccessArchive hor"})
    add_item(art_type, "Type of Article", data_dict)

    #repeat for remainder of fields    
    atag = element.find('a')
    add_item(atag, 'Title', data_dict)

    author = element.find("ol",{"class","Authors hor undefined"})
    add_item(author, 'Names of Authors', data_dict)

    # Find Journal Name
    journal = element.find("ol",{"class","SubType hor"})
    add_item(journal,'Journal & Dates', data_dict)

data = pd.DataFrame(data_dict)

示例输出:

In [21]: data.head(n=3)
Out[21]:
                                     Journal & Dates  \
0  Global Environmental Change, Volume 50, May 20...
1  International Journal of Hygiene and Environme...
2  Global Environmental Change, Volume 50, May 20...

                                    Names of Authors  \
0  [K.M. Findlater, S.D. Donner, T. Satterfield, ...
1                                [Shouro Dasgupta, ]
2                    [Thad Kousser, Bruce Tranter, ]

                                               Title     Type of Article
0  Integration anxiety: The cognitive isolation o...  Research article,
1      Burden of climate change on malaria mortality  Research article,
2  The influence of political leaders on climate ...  Research article

另请注意,缺少值。我会调查他们......

In [37]: data.isnull().sum()
Out[37]:
Journal & Dates      0
Names of Authors    15
Title                0
Type of Article      0
dtype: int64