Question

我有一个函数，可以将字符串拆分成单词，然后在数据帧中找到单词，如果找到它，则使用for循环搜索该行，我不想这样做，因为这样做会使速度太慢数据集。我想使用row [value]，并且不想为每个匹配的单词遍历整个df。

我是python的新手，我已经搜索了很多东西，但是可以得到我想要的东西，我找到了index.tolist（），但是我不想列出一个列表，我只需要第一个匹配值的索引

我们将不胜感激。

def cal_nega_mean(my_string):
  mean = 0.00
  mean_tot = 0
  mean_sum = 0.00
  for word in my_string.split():
    if word in df.values: #at this point if it founds then get index, so that i dont have to use  for loop in next line
      for index, row in df.iterrows(): #want to change 
        if word == row.word:   # this part
          if row['value'] < -0.40:
            mean_tot += 1
            mean += row['value']
            break
  if mean_tot == 0:
    return 0
  mean = mean_sum / mean_tot
  return round(mean,2)

示例字符串输入，有超过30万个字符串

my_string = "i have a problem with my python code" 
cal_nega_mean(my_string)
# and i am using this to get return for all records
df_tweets['intensity'] = df_tweets['tweets'].apply(lambda row: cal_nega_mean(row))

要搜索的数据框

df 

index   word      value  ...

  1     python    -0.56

  2     problem   -0.78

  3     alpha     -0.91

   . . .

 9000   last    -0.41

Answer 1

您可以尝试使用i = df[df.word == word].index[0]来获得满足条件df.word == word的第一行的索引。有了索引后，您可以使用df.loc将该行切出。

def cal_nega_mean(my_string):
    mean = 0.00
    mean_tot = 0
    mean_sum = 0.00
    for word in my_string.split():
        try:
            i = df[df.word == word].index[0]
        except:
            continue
        row = df.loc[i]
        if row['value'] < -0.40:
            mean_tot += 1
            mean += row['value']
            break
    if mean_tot == 0:
        return 0
    mean = mean_sum / mean_tot
    return round(mean,2)

Answer 2

Pandas具有一些有用的文本处理功能，可以为您提供帮助。我建议您使用pd.Series.str.contains()。

def cal_nega_mean(my_string):
    words = '|'.join(my_string.split())
    matches = df['word'].str.contains(words, regex=True)
    mask = (df['value'] >= -0.40) & matches # don't need value >= -0.40 if you just drop those rows
    mean_tot = mask.sum()
    mean_sum = df[mask]['value'].sum()
    mean = mean_sum / mean_tot
    return round(mean, 2)

不相关，但我也建议您删除带有“值” <-0.40的行，因为无论如何它们都将被忽略。

我还没有机会进行测试，但是它应该可以完成工作，并且已经过矢量化处理。

Answer 3

这是使用字典的一种方法，您可以将<br><br>转换为键，值存储并将其用作查找：

word: value

获取匹配记录的索引

3 个答案: