Question

我目前正在尝试从研究数据库中获取数据 - ScienceDirect。我使用Beautiful Soup获取每篇研究文章的标题，并将其添加到空的pandas数据框中。在此之后，我获得了有关上述文章的研究文章类型的信息。但是，当我尝试将此数据附加到数据框时，它将被添加到底部，即。它不是添加到前100行，而是创建100个新行。

from bs4 import BeautifulSoup as bs
soup = bs(requests.get(browser.current_url).text,"html.parser")

# Using soup to retrieve the elements related to Title, Type of Article, Names of Authors and Abstract
elements = soup.find_all("div", {"class","result-item-content"})

data = pd.DataFrame(columns = ["Abstract","Journal & Dates","Names of Authors","Title","Type of Article"])

for element in elements:
    atag = element.find('a')
    if atag:
        atag = atag.text.split("\n")
        data = data.append({"Title": atag}, ignore_index=True)

data.head()

Abstract    Journal & Dates Names of Authors    Title   Type of Article
0   NaN NaN NaN [Morphological, molecular identification and p...   NaN
1   NaN NaN NaN [Assessment of soil erosion in a tropical moun...   NaN
2   NaN NaN NaN [Ethnomedicinal assessment of Irula tribes of ...   NaN
3   NaN NaN NaN [Latitudinal variation in summer monsoon rainf...   NaN
4   NaN NaN NaN [IUCN greatly underestimates threat levels of ...   NaN

现在，我尝试搜索有关研究文章类型的信息，并将其附加到上述数据框中。

for element in elements:
    art_type = element.find("ol",{"class","OpenAccessArchive hor"})
    if art_type:
        art_type = art_type.text.split("\n")
        data = data.append({"Type of Article": art_type}, ignore_index=True)

data.tail()

    Abstract    Journal & Dates Names of Authors    Title   Type of Article
194 NaN NaN NaN NaN [Open access, Research article, ]
195 NaN NaN NaN NaN [Research article, ]
196 NaN NaN NaN NaN [Research article, ]
197 NaN NaN NaN NaN [Research article, ]
198 NaN NaN NaN NaN [Research article, ]

如果你看一下数据帧的尾部，奇怪的是，它将信息添加到最后100或90行。我该如何纠正？

另外，我是刮刮和python的新手。关于什么是存储数据的最佳方法的任何建议，以便我可以在以后进行相同的分析？如主题的概率建模？

编辑根据答案，我尝试了以下操作，但收到了错误消息：

data_dict = {}

# Create keys
for key in ["Abstract","Journal & Dates","Names of Authors","Title","Type of Article"]:
    data_dict[key] = []

# Loop through the elements object
for element in elements:

    # Find all the Title tags
    atag = element.find('a')
    if atag:
        atag = atag.text.split("\n")
        data_dict["Title"].append(atag)

    # Find all article_type information
    art_type = element.find("ol",{"class","OpenAccessArchive hor"})
    if art_type:
        art_type = art_type.text.split("\n")
        data_dict["Type of Article"].append(art_type)

    # Find Names of Authors
    author = element.find("ol",{"class","Authors hor undefined"})
    if author:
        author = author.text.split("\n")
        data_dict["Names of Authors"].append(author)

    # Find Journal Name
    journal = element.find("ol",{"class","SubType hor"})
    if journal:
        journal = journal.text.split("\n")
        data_dict["Journal & Dates"].append(journal)

data = pd.DataFrame(data_dict)

*ERROR*
~/miniconda/lib/python3.6/site-packages/pandas/core/frame.py in extract_index(data)
   6209             lengths = list(set(raw_lengths))
   6210             if len(lengths) > 1:
-> 6211                 raise ValueError('arrays must all be same length')
   6212 
   6213             if have_dicts:

ValueError：数组必须全长相同

Answer 1

根据评论，pd.DataFrame.append用于附加数据框 - 按照定义添加新行 - 这就是您尝试插入的数据反而附加到新行中的原因。您可以单独将数据插入数据框，但它并不漂亮。例如，您可以使用data.loc[i,'Title'] = atag插入其中i是行计数器（即i=3） note 这也需要您创建一个完整大小的空数据框。

相反，我建议首先用数据填充字典，然后将字典传递给pandas;即：

import pandas as pd
import requests
from bs4 import BeautifulSoup as bs
import numpy as np


#fetch data
sample_url = "https://www.sciencedirect.com/search/advanced?qs=Climate%20Change&articleTypes=REV%2CFLA&show=100&sortBy=relevance"
soup = bs(requests.get(sample_url).text,"html.parser")
elements = soup.find_all("div", {"class","result-item-content"})

#create data container
data_dict = {}
for key in ["Title","Type of Article","Names of Authors","Journal & Dates"]:
    data_dict[key] = []
    #fields yet to add: "Abstract"

#convenience function:
def add_item(item,key,data_dict):
    #if item is found, add to data
    if item is not None:
        data_dict[key].append(item.text)
    #if nothing is found, add missing flag
    if item is None:
        data_dict[key].append(np.nan)

#loop over elements    
for element in elements:
    #find element
    art_type = element.find("ol",{"class","OpenAccessArchive hor"})
    add_item(art_type, "Type of Article", data_dict)

    #repeat for remainder of fields    
    atag = element.find('a')
    add_item(atag, 'Title', data_dict)

    author = element.find("ol",{"class","Authors hor undefined"})
    add_item(author, 'Names of Authors', data_dict)

    # Find Journal Name
    journal = element.find("ol",{"class","SubType hor"})
    add_item(journal,'Journal & Dates', data_dict)

data = pd.DataFrame(data_dict)

示例输出：

In [21]: data.head(n=3)
Out[21]:
                                     Journal & Dates  \
0  Global Environmental Change, Volume 50, May 20...
1  International Journal of Hygiene and Environme...
2  Global Environmental Change, Volume 50, May 20...

                                    Names of Authors  \
0  [K.M. Findlater, S.D. Donner, T. Satterfield, ...
1                                [Shouro Dasgupta, ]
2                    [Thad Kousser, Bruce Tranter, ]

                                               Title     Type of Article
0  Integration anxiety: The cognitive isolation o...  Research article,
1      Burden of climate change on malaria mortality  Research article,
2  The influence of political leaders on climate ...  Research article

另请注意，缺少值。我会调查他们......

In [37]: data.isnull().sum()
Out[37]:
Journal & Dates      0
Names of Authors    15
Title                0
Type of Article      0
dtype: int64

Pandas中未正确添加行

1 个答案: