崩溃

Question

我有2个数据框：

一个充当字典的列：

“得分”
“翻译”
许多带有不同单词变化的列

另一栏只有一列：“句子”

目标是：

将句子分成单词
在字典中查找单词（在不同的列中）并返回分数
将得分最高的单词的得分作为“句子得分”

df_sentences = pd.DataFrame([["I run"], 
    ["he walks"], 
    ["we run and walk"]], 
    columns=['Sentence'])

df_dictionary = pd.DataFrame([[10, "I", "you", "he"], 
    [20, "running", "runs", "run"], 
    [30, "walking", "walk", "walks"]], 
    columns=['score', 'variantA', 'variantB', 'variantC'])

Out[1]: 
   Sentence           Score
0  "I run"             30
1  "he walks"          40
2  "we run and walk"   "error 'and' not found"

我对于for循环和列表使用了很多东西，但这很慢，因此我正在寻找一种工作方式，让我可以在pandas数据框中完成所有/大部分工作。

这是我使用for循环的方法：

for sentence in textaslist[:1]:
words = split_into_words(sentence)[0] # returns list of words
length = split_into_words(sentence)[1] #returns number of words
if minsentencelength <= length <= maxsentencelength: # filter out short and long sentences                                                     
    for word in words:
        score = LookupInDictionary.lookup(word, mydictionary)
        if str(score) != "None":
            do_something()
        else:
            print(word, " not found in dictionary list")
            not_found.append(word)      # Add word to not found list     
                                                   
print("The following words were not found in the dictionary: ", not_found)

使用

def lookup(word, df):
if word in df.values:                                                       # Check if the dictionary contains the word
    print(word,"was found in the dictionary")
    lookupreturn = df.loc[df.values == word,'score']                         # find the score of each word (first column)
    score = lookupreturn.values[0]                                           # take only the first instance of the word in the dictionary
    return(bare)

问题是，当我使用熊猫的“合并”功能时，我需要指定使用right_on left_on参数查找的列，而我似乎找不到如何在整个字典数据框中搜索并返回第一个的方法高效地对分数进行排序

Answer 1

如果您以[word, score]格式修改字典，则可以按单词拆分句子，然后与字典合并，然后groupby合并分数。
由于此方法使用了pandas函数，因此对于您的数据集应该足够快，不确定是否有可能使其更快。

tl; dr

df_sentences = pd.DataFrame([["I run"], 
    ["he walks"], 
    ["we run and walk"]], 
    columns=['Sentence'])

df_dictionary = pd.DataFrame([[10, "I", "you", "he"], 
    [20, "running", "runs", "run"], 
    [30, "walking", "walk", "walks"]], 
    columns=['score', 'variantA', 'variantB', 'variantC'])

df_dictionary = pd.melt(df_dictionary, id_vars=['score'])[['value', 'score']]

df_sentences['words'] = df_sentences['Sentence'].str.split()
df_sentences = df_sentences.explode('words')

sentence_score = df_sentences.merge(df_dictionary, how='left', left_on='words', right_on='value')[['Sentence', 'score']]

sentence_score_sum = sentence_score.fillna('NaN').groupby('Sentence').sum()
# or
sentence_score_max = sentence_score.fillna('NaN').groupby('Sentence').max()

崩溃

要将字典修改为[word, score]格式，您可以像这样使用melt

df_dictionary = pd.DataFrame([[10, "I", "you", "he"], 
    [20, "running", "runs", "run"], 
    [30, "walking", "walk", "walks"]], 
    columns=['score', 'variantA', 'variantB', 'variantC'])
df_dictionary = pd.melt(df_dictionary, id_vars=['score'])[['value', 'score']]

这会给你

     value  score
0        I     10
1  running     20
2  walking     30
3      you     10
4     runs     20
5     walk     30
6       he     10
7      run     20
8    walks     30

现在要使用句子，我们希望能够在跟踪主要句子的同时自行提取每个单词。让我们添加一个包含单词作为列表的新列

df_sentences = pd.DataFrame([["I run"], 
    ["he walks"], 
    ["we run and walk"]], 
    columns=['Sentence'])

df_sentences['words'] = df_sentences['Sentence'].str.split()

这会给我们

          Sentence                 words
0            I run              [I, run]
1         he walks           [he, walks]
2  we run and walk  [we, run, and, walk]

然后explode个单词

df_sentences = df_sentences.explode('words')

给你的

          Sentence  words
0            I run      I
0            I run    run
1         he walks     he
1         he walks  walks
2  we run and walk     we
2  we run and walk    run
2  we run and walk    and
2  we run and walk   walk

现在我们一起merge

sentence_score = df_sentences.merge(df_dictionary, how='left', left_on='words', right_on='value')[['Sentence', 'score']]

给予我们

          Sentence  score
0            I run   10.0
1            I run   20.0
2         he walks   10.0
3         he walks   30.0
4  we run and walk    NaN
5  we run and walk   20.0
6  we run and walk    NaN
7  we run and walk   30.0

现在，我们可以将groupby与sum结合起来以按分数为每个句子求和

请注意，大熊猫会将NaN视为我们不希望使用的0.0，因此我们使用fillna将na填充为字符串“ NaN”。

sentence_score_sum = sentence_score.fillna('NaN').groupby('Sentence').sum()

给你

                 score
Sentence
I run             30.0
he walks          40.0
we run and walk    NaN

您的问题是您想给句子赋予最高的单词分数，但是您的预期输出显示的是总和，如果您需要最高分数，那么这很简单

sentence_score_max = sentence_score.fillna('NaN').groupby('Sentence').max()

给你

                 score
Sentence
I run             20.0
he walks          30.0
we run and walk    NaN

注意：此解决方案依赖于具有 UNIQUE 句子，如果您有重复的句子，则可以在开始之前drop_duplicates，也可以在合并之前应用reset_index(drop=False)以保持索引，然后在旧索引上使用groupby，而不是Sentence。

Answer 2

我将使用以下正则表达式方法：

# Store scores using index
score_dict = {x:0 for x in df_sentences.index}

# Loop through each row in the score df (df_dictionary):
for row in df_dictionary.values:
  # Access the score
  score = row[0]
  # Access the words & convert to a pattern
  words = "|".join([re.escape(x) for x in row[1:]])
  pattern = re.compile(r"\b(" + words + r")\b", re.I|re.M)
  
  # Loop through each row in the main df (df_sentences):
  for idx, row in df_sentences.iterrows():
    # Find the number of matches in the sentence
    matches = pattern.findall(row["Sentence"])
    # Multiply to get the score
    n_score = len(matches) * score
    # Store it using the index as key
    score_dict[idx] += n_score

# Now, add the dict as a column (or map it back to the main df) 
df_sentences["score"] = df_sentences.index.map(score_dict)

    Sentence    score
0   I run   30
1   he walks    40
2   we run and walk 50

Answer 3

您是否关心重复项？如果我使用“ I I I”之类的字符串，那么从技术上讲就是30分。

还有，您使用数据框存储得分单词是否有特定原因？

使用设置的交集快速并肮脏地删除重复项：

dictionary ={
"I": 10, "you": 10, "he": 10, 
"running": 20, "runs": 20, "run": 20, 
"walking": 30, "walk": 30, "walks": 30
}

df = pd.DataFrame({
    "Sentences":[
        "I run and he walks",
        "We walk and he runs",
        "I Run you run he runs",
        "I run he runs",
        "I I I I I"
    ]})

def split_score(sentence):
    x = sentence.split(' ')
    x = set(x) # remove duplicate words
    y = x.intersection(set(dictionary.keys())) # find matches in the dictionary
    z = x.difference(set(dictionary.keys())) # find words outside the dictionary
    if len(z) > 0:
        score = -1 # If non-dictionary words are found, fail
    elif len(z) == 0:
        score = sum([dictionary[word] for word in y])
    return score

df['Points'] = df['Sentences'].apply(lambda x: split_score(x))
df

使用熊猫数据框而不是for循环在“字典”中查找

3 个答案:

tl; dr

崩溃