Question

我试图在一个较长的数据帧上运行一个for循环，并计算给定文本（每个文本是一个新行）中英语和非英语单词的数量。

+-------+--------+----+
| Index |  Text  | ID |
+-------+--------+----+
|     1 | Text 1 |  1 |
|     2 | Text 2 |  2 |
|     3 | Text 3 |  3 |
+-------+--------+----+

这是我的代码

c = 0
for text in df_letters['Text_clean']:
    # Counters
    CTEXT= text
    c +=1
    eng_words = 0
    non_eng_words = 0
    text = " ".join(text.split())
    # For every word in text
    for word in text.split(' '):
      # Check if it is english
      if english_dict.check(word) == True:
        eng_words += 1
      else:
        non_eng_words += 1
    # Print the result
    # NOTE that these results are discarded each new text
    df_letters.at[text, 'eng_words'] = eng_words
    df_letters.at[text, 'non_eng_words'] = non_eng_words
    df_letters.at[text, 'Input'] = CTEXT
    #print('Index: {}; EN: {}; NON-EN: {}'.format(c, eng_words, non_eng_words))

但是没有获得与3个新列相同的数据框

+-------+--------+----+---------+-------------+---------+
| Index |  Text  | ID | English | Non-English |  Input  |
+-------+--------+----+---------+-------------+---------+
|     1 | Text 1 |  1 |       1 |           0 | Text 1  |
|     2 | Text 2 |  2 |       1 |           0 | Text 2  |
|     3 | Text 3 |  3 |       0 |           1 | Text 3  |
+-------+--------+----+---------+-------------+---------+

数据帧的长度重复，为每个新文本添加新行。像这样

+--------+--------+-----+---------+-------------+--------+
| Index  |  Text  | ID  | English | Non-English | Input  |
+--------+--------+-----+---------+-------------+--------+
| 1      | Text 1 | 1   | nan     | nan         | nan    |
| 2      | Text 2 | 2   | nan     | nan         | nan    |
| 3      | Text 3 | 3   | nan     | nan         | nan    |
| Text 1 | nan    | nan | 1       | 0           | Text 1 |
| text 2 | nan    | nan | 1       | 0           | Text 2 |
| Text 3 | nan    | nan | 0       | 1           | Text 3 |
+--------+--------+-----+---------+-------------+--------+

我在这里做错什么了？

Answer 1

Series.at通过索引值访问DataFrame。您的DataFrame的索引是[1,2,3]，而不是[Text 1, Text 2, Text 3]。我认为对您来说最好的解决方案是用这样的循环代替：

for index, text in df_letters['Text_clean'].iteritems():

您可以在其中建立索引：

df_letters.at[index, 'eng_words'] = eng_words

循环复制行的熊猫

1 个答案: