我试图在一个较长的数据帧上运行一个for循环,并计算给定文本(每个文本是一个新行)中英语和非英语单词的数量。
+-------+--------+----+
| Index | Text | ID |
+-------+--------+----+
| 1 | Text 1 | 1 |
| 2 | Text 2 | 2 |
| 3 | Text 3 | 3 |
+-------+--------+----+
这是我的代码
c = 0
for text in df_letters['Text_clean']:
# Counters
CTEXT= text
c +=1
eng_words = 0
non_eng_words = 0
text = " ".join(text.split())
# For every word in text
for word in text.split(' '):
# Check if it is english
if english_dict.check(word) == True:
eng_words += 1
else:
non_eng_words += 1
# Print the result
# NOTE that these results are discarded each new text
df_letters.at[text, 'eng_words'] = eng_words
df_letters.at[text, 'non_eng_words'] = non_eng_words
df_letters.at[text, 'Input'] = CTEXT
#print('Index: {}; EN: {}; NON-EN: {}'.format(c, eng_words, non_eng_words))
但是没有获得与3个新列相同的数据框
+-------+--------+----+---------+-------------+---------+
| Index | Text | ID | English | Non-English | Input |
+-------+--------+----+---------+-------------+---------+
| 1 | Text 1 | 1 | 1 | 0 | Text 1 |
| 2 | Text 2 | 2 | 1 | 0 | Text 2 |
| 3 | Text 3 | 3 | 0 | 1 | Text 3 |
+-------+--------+----+---------+-------------+---------+
数据帧的长度重复,为每个新文本添加新行。像这样
+--------+--------+-----+---------+-------------+--------+
| Index | Text | ID | English | Non-English | Input |
+--------+--------+-----+---------+-------------+--------+
| 1 | Text 1 | 1 | nan | nan | nan |
| 2 | Text 2 | 2 | nan | nan | nan |
| 3 | Text 3 | 3 | nan | nan | nan |
| Text 1 | nan | nan | 1 | 0 | Text 1 |
| text 2 | nan | nan | 1 | 0 | Text 2 |
| Text 3 | nan | nan | 0 | 1 | Text 3 |
+--------+--------+-----+---------+-------------+--------+
我在这里做错什么了?
答案 0 :(得分:1)
Series.at
通过索引值访问DataFrame。您的DataFrame的索引是[1,2,3]
,而不是[Text 1, Text 2, Text 3]
。我认为对您来说最好的解决方案是用这样的循环代替:
for index, text in df_letters['Text_clean'].iteritems():
您可以在其中建立索引:
df_letters.at[index, 'eng_words'] = eng_words