我有2个数据框:
一个充当字典的列:
另一栏只有一列:“句子”
目标是:
df_sentences = pd.DataFrame([["I run"],
["he walks"],
["we run and walk"]],
columns=['Sentence'])
df_dictionary = pd.DataFrame([[10, "I", "you", "he"],
[20, "running", "runs", "run"],
[30, "walking", "walk", "walks"]],
columns=['score', 'variantA', 'variantB', 'variantC'])
Out[1]:
Sentence Score
0 "I run" 30
1 "he walks" 40
2 "we run and walk" "error 'and' not found"
我对于for循环和列表使用了很多东西,但这很慢,因此我正在寻找一种工作方式,让我可以在pandas数据框中完成所有/大部分工作。
这是我使用for循环的方法:
for sentence in textaslist[:1]:
words = split_into_words(sentence)[0] # returns list of words
length = split_into_words(sentence)[1] #returns number of words
if minsentencelength <= length <= maxsentencelength: # filter out short and long sentences
for word in words:
score = LookupInDictionary.lookup(word, mydictionary)
if str(score) != "None":
do_something()
else:
print(word, " not found in dictionary list")
not_found.append(word) # Add word to not found list
print("The following words were not found in the dictionary: ", not_found)
使用
def lookup(word, df):
if word in df.values: # Check if the dictionary contains the word
print(word,"was found in the dictionary")
lookupreturn = df.loc[df.values == word,'score'] # find the score of each word (first column)
score = lookupreturn.values[0] # take only the first instance of the word in the dictionary
return(bare)
问题是,当我使用熊猫的“合并”功能时,我需要指定使用right_on left_on参数查找的列,而我似乎找不到如何在整个字典数据框中搜索并返回第一个的方法高效地对分数进行排序
答案 0 :(得分:2)
如果您以[word, score]
格式修改字典,则可以按单词拆分句子,然后与字典合并,然后groupby
合并分数。
由于此方法使用了pandas函数,因此对于您的数据集应该足够快,不确定是否有可能使其更快。
df_sentences = pd.DataFrame([["I run"],
["he walks"],
["we run and walk"]],
columns=['Sentence'])
df_dictionary = pd.DataFrame([[10, "I", "you", "he"],
[20, "running", "runs", "run"],
[30, "walking", "walk", "walks"]],
columns=['score', 'variantA', 'variantB', 'variantC'])
df_dictionary = pd.melt(df_dictionary, id_vars=['score'])[['value', 'score']]
df_sentences['words'] = df_sentences['Sentence'].str.split()
df_sentences = df_sentences.explode('words')
sentence_score = df_sentences.merge(df_dictionary, how='left', left_on='words', right_on='value')[['Sentence', 'score']]
sentence_score_sum = sentence_score.fillna('NaN').groupby('Sentence').sum()
# or
sentence_score_max = sentence_score.fillna('NaN').groupby('Sentence').max()
要将字典修改为[word, score]
格式,您可以像这样使用melt
df_dictionary = pd.DataFrame([[10, "I", "you", "he"],
[20, "running", "runs", "run"],
[30, "walking", "walk", "walks"]],
columns=['score', 'variantA', 'variantB', 'variantC'])
df_dictionary = pd.melt(df_dictionary, id_vars=['score'])[['value', 'score']]
这会给你
value score
0 I 10
1 running 20
2 walking 30
3 you 10
4 runs 20
5 walk 30
6 he 10
7 run 20
8 walks 30
现在要使用句子,我们希望能够在跟踪主要句子的同时自行提取每个单词。 让我们添加一个包含单词作为列表的新列
df_sentences = pd.DataFrame([["I run"],
["he walks"],
["we run and walk"]],
columns=['Sentence'])
df_sentences['words'] = df_sentences['Sentence'].str.split()
这会给我们
Sentence words
0 I run [I, run]
1 he walks [he, walks]
2 we run and walk [we, run, and, walk]
然后explode个单词
df_sentences = df_sentences.explode('words')
给你的
Sentence words
0 I run I
0 I run run
1 he walks he
1 he walks walks
2 we run and walk we
2 we run and walk run
2 we run and walk and
2 we run and walk walk
现在我们一起merge
sentence_score = df_sentences.merge(df_dictionary, how='left', left_on='words', right_on='value')[['Sentence', 'score']]
给予我们
Sentence score
0 I run 10.0
1 I run 20.0
2 he walks 10.0
3 he walks 30.0
4 we run and walk NaN
5 we run and walk 20.0
6 we run and walk NaN
7 we run and walk 30.0
现在,我们可以将groupby
与sum
结合起来以按分数为每个句子求和
请注意,大熊猫会将
NaN
视为我们不希望使用的0.0
,因此我们使用fillna
将na填充为字符串“ NaN”。
sentence_score_sum = sentence_score.fillna('NaN').groupby('Sentence').sum()
给你
score
Sentence
I run 30.0
he walks 40.0
we run and walk NaN
您的问题是您想给句子赋予最高的单词分数,但是您的预期输出显示的是总和,如果您需要最高分数,那么这很简单
sentence_score_max = sentence_score.fillna('NaN').groupby('Sentence').max()
给你
score
Sentence
I run 20.0
he walks 30.0
we run and walk NaN
注意:此解决方案依赖于具有 UNIQUE 句子,如果您有重复的句子,则可以在开始之前
drop_duplicates
,也可以在合并之前应用reset_index(drop=False)
以保持索引,然后在旧索引上使用groupby
,而不是Sentence
。
答案 1 :(得分:1)
我将使用以下正则表达式方法:
# Store scores using index
score_dict = {x:0 for x in df_sentences.index}
# Loop through each row in the score df (df_dictionary):
for row in df_dictionary.values:
# Access the score
score = row[0]
# Access the words & convert to a pattern
words = "|".join([re.escape(x) for x in row[1:]])
pattern = re.compile(r"\b(" + words + r")\b", re.I|re.M)
# Loop through each row in the main df (df_sentences):
for idx, row in df_sentences.iterrows():
# Find the number of matches in the sentence
matches = pattern.findall(row["Sentence"])
# Multiply to get the score
n_score = len(matches) * score
# Store it using the index as key
score_dict[idx] += n_score
# Now, add the dict as a column (or map it back to the main df)
df_sentences["score"] = df_sentences.index.map(score_dict)
Sentence score
0 I run 30
1 he walks 40
2 we run and walk 50
答案 2 :(得分:1)
您是否关心重复项?如果我使用“ I I I”之类的字符串,那么从技术上讲就是30分。
还有,您使用数据框存储得分单词是否有特定原因?
使用设置的交集快速并肮脏地删除重复项:
dictionary ={
"I": 10, "you": 10, "he": 10,
"running": 20, "runs": 20, "run": 20,
"walking": 30, "walk": 30, "walks": 30
}
df = pd.DataFrame({
"Sentences":[
"I run and he walks",
"We walk and he runs",
"I Run you run he runs",
"I run he runs",
"I I I I I"
]})
def split_score(sentence):
x = sentence.split(' ')
x = set(x) # remove duplicate words
y = x.intersection(set(dictionary.keys())) # find matches in the dictionary
z = x.difference(set(dictionary.keys())) # find words outside the dictionary
if len(z) > 0:
score = -1 # If non-dictionary words are found, fail
elif len(z) == 0:
score = sum([dictionary[word] for word in y])
return score
df['Points'] = df['Sentences'].apply(lambda x: split_score(x))
df