Question

目标：我有一个DataFrame，root_df，其中一列是由逗号分隔的一些单词（例如“门，豹，敬礼”）。然后我有另一个DataFrame，freq_df，其中我有两列，WORD（字符串）和SCORE（浮点数）。我想要做的是创建一个聚合列，其中包含来自第二个DataFrame freq_df的得分，该总计基于第一个DataFrame中的列root_df。让我给你举个例子。在freq_df中，具有“门”的行具有分数342388，单词“panther”根本不在DataFrame中，并且“salute”具有分数9238.07。因此root_df中的列将是342388 + 9238.07，即351626.07。

问题我能够在我的数据子集上使用.apply（）来做到这一点，但是当我尝试在整个数据集上得到它时，它返回“TypeError：＆＃ 39;浮动＆＃39;对象不可迭代“。我想也许这可能是因为他们在“拆分词”列中是NaN所以我用“”替换了所有NaN以查看是否有帮助并且我返回了一个新错误，“TypeError :(＆＃34;不可用类型：＆＃39; list＆＃39;＆＃34;，＆＃39;发生在索引拆分词＆＃39;）“。我很困惑为什么这会对我的数据的一个子集起作用而不是整个事情，我认为所有系列都有相同的数据类型。有人可以解释发生了什么吗？有没有办法返回返回错误的行？任何帮助，将不胜感激。

这是包含来自维基百科表格的DataFrame的整个代码，用于复制问题。如果他们对我的代码有任何疑问或问题，请告诉我。

import numpy as np
import pandas as pd
import urllib.request

def get_score(field):
    words_list = []
    for word in field:
        words_list.append(word)

    mask = freq_df['Word'].isin(words_list)

    return freq_df.loc[mask, 'Count (per billion)'].sum()

#Root DataFrame
root_urls = [r"https://en.wikipedia.org/wiki/List_of_Greek_and_Latin_roots_in_English/A%E2%80%93G",
        r"https://en.wikipedia.org/wiki/List_of_Greek_and_Latin_roots_in_English/H%E2%80%93O",
        r"https://en.wikipedia.org/wiki/List_of_Greek_and_Latin_roots_in_English/P%E2%80%93Z"]

root_dfs = []

for url in root_urls:
    dfs = pd.read_html(url, header=0)
    for i, df in enumerate(dfs):
        if df.shape[1] != 5:
            print('Deleted below DataFrame(s):\n', dfs[i].head())
            del dfs[i]
    root_dfs.extend(dfs)

root_df = pd.concat(root_dfs, ignore_index=True)
root_df.replace(to_replace="\[.*?]", value="", regex=True, inplace=True)

#Frequency DataFrame
url = r"https://en.wiktionary.org/wiki/Wiktionary:Frequency_lists/PG/2006/04/1-10000"

freq_dfs = pd.read_html(url, header=0)

freq_df = pd.concat(freq_dfs)

#Succesful use of apply
test = root_df.head().copy()
a = pd.DataFrame(columns=test.columns)
a.loc[0] = ['Test', 'Test', 'Test', 'Test', 'door, panther, salute'] # Adding the exact example I gave above
test = pd.concat([test, a], ignore_index=True)
test['Split words'] = test['English examples'].str.split(', ')

test_score = test['Split words'].apply(get_score) # LINE IN QUESTION : SUCCESS
print("\nSuccesful test:\n\n", test_score)

#Unsuccesful use of apply
root_df['Split words'] = root_df['English examples'].str.split(', ')
score = root_df['Split words'].apply(get_score) # LINE IN QUESTION : FAIL
print(score)

Answer 1

我认为您不需要使用apply。您可以在一个长系列中获取English Examples中的所有字词，然后使用map映射freq_df中的值，然后汇总每个原始列表English Examples

# First get the score mapping series
score = freq_df.set_index('Word')['Count (per billion)']

# use stack to make one long series of words from
# english examples
stacked_words = root_df['English examples'].str.split(',\s*', expand=True).stack()

# map all the english example words to their score
# and then sum up each group(original row)
stacked_words.map(score).groupby(level=0).sum().fillna(0)

0        56157.78
1            0.00
2            0.00
3            0.00
4            0.00
5            0.00
6            0.00
7            0.00
8            0.00
9            0.00
10           0.00
11           0.00
12       11422.40
13      190547.67
....

df.apply（）在我的df的一部分上工作，但在整个df上返回“TypeError：float object is iterable”

1 个答案: