首先,我不确定是否是drop_duplicates()
的错误。
我想做什么:
从csv导入文件,对每行执行re.search
,如果匹配,则将该行保留在字典中;如果不匹配,则将该行保留在另一字典中。用字典值的长度做一个图。
问题
我在csv中有1000行,但结果返回1200。
我的代码
import pandas as pd
import re
# import data
filename = 'sample.csv'
# save data as data
data = pd.read_csv(filename, encoding='utf-8')
# create new dictionary for word that is true and false
# but doesn't have the keyword in items
wordNT = {}
wordNF = {}
kaiT = {}
kaiF = {}
# if text is True
def word_in_text(word,text,label):
match = re.search(word,text)
if match and label == True:
kaiT.setdefault('text', []).append(text)
elif match and label == False:
kaiF.setdefault('text', []).append(text)
elif label == True and not match:
wordNT.setdefault('text', []).append(text)
elif label == False and not match:
wordNF.setdefault('text', []).append(text)
# iterate every text in data
for index, row in data.iterrows():
word_in_text('foo', row['text'], row['label'])
word_in_text('bar', row['text'], row['label'])
# make pandas data frame out of dict
wordTDf = pd.DataFrame.from_dict(wordNT)
wordFDf = pd.DataFrame.from_dict(wordNF)
kaiTDf = pd.DataFrame.from_dict(kaiT)
kaiFDf = pd.DataFrame.from_dict(kaiF)
# drop duplicates
wordTDf = wordTDf.drop_duplicates()
wordFDf = wordFDf.drop_duplicates()
kaiTDf = kaiTDf.drop_duplicates()
kaiFDf = kaiFDf.drop_duplicates()
# count how many
wordTrueCount = len(wordTDf.index)
wordFalseCount = len(wordFDf.index)
kaiTrueCount = len(kaiTDf.index)
kaiFalseCount = len(kaiFDf.index)
print(wordTrueCount + wordFalseCount + kaiTrueCount + kaiFalseCount)
当我删除行
word_in_text('bar', row['text'], row['label'])
仅保留
word_in_text('foo', row['text'], row['label'])
print(wordTrueCount + wordFalseCount + kaiTrueCount + kaiFalseCount)
正确返回1000,反之亦然。
但是当我不这样做时,当它应该仅为1000时,它将返回1200?
CSV INPUT示例
文字,标签
“嘿”,TRUE
“晕”,假
“你好吗?”,是
预期输出
1000
输出
1200
答案 0 :(得分:0)
在功能word_in_text
中,您将更新四个命令:wordNT
,wordNF
,kaiT
和kaiF
。
然后您在迭代数据帧时两次调用word_in_text
:
# iterate every text in data
for index, row in data.iterrows():
word_in_text('foo', row['text'], row['label'])
word_in_text('bar', row['text'], row['label'])
因此,搜索结果是'foo'
和'bar'
的结果的组合。
相反,您应该在开始新搜索之前清理这四个字典:
def search(text):
wordNT = {}
wordNF = {}
kaiT = {}
kaiF = {}
# iterate every text in data
for index, row in data.iterrows():
word_in_text(text, row['text'], row['label'])
# make pandas data frame out of dict
wordTDf = pd.DataFrame.from_dict(wordNT)
wordFDf = pd.DataFrame.from_dict(wordNF)
kaiTDf = pd.DataFrame.from_dict(kaiT)
kaiFDf = pd.DataFrame.from_dict(kaiF)
# drop duplicates
wordTDf = wordTDf.drop_duplicates()
wordFDf = wordFDf.drop_duplicates()
kaiTDf = kaiTDf.drop_duplicates()
kaiFDf = kaiFDf.drop_duplicates()
# count how many
wordTrueCount = len(wordTDf.index)
wordFalseCount = len(wordFDf.index)
kaiTrueCount = len(kaiTDf.index)
kaiFalseCount = len(kaiFDf.index)
print(wordTrueCount + wordFalseCount + kaiTrueCount + kaiFalseCount)
search('foo')
search('bar')