我有一个TSV文件,已将其加载到pandas数据框中以进行一些预处理,我想找出其中存在问题的行,并在新列中输出1或0。由于它是TSV,因此这是我的加载方式:
import pandas as pd
df = pd.read_csv('queries-10k-txt-backup', sep='\t')
以下是其外观示例:
QUERY FREQ
0 hindi movies for adults 595
1 are panda dogs real 383
2 asuedraw winning numbers 478
3 sentry replacement keys 608
4 rebuilding nicad battery packs 541
在删除空行,重复项和FREQ列(此操作不需要)之后,我编写了一个简单的函数来检查QUERY列,以查看其中是否包含使字符串成为问题的单词:
df_test = df.drop_duplicates()
df_test = df_test.dropna()
df_test = df_test.drop(['FREQ'], axis = 1)
def questions(row):
questions_list =
["what","when","where","which","who","whom","whose","why","why don't",
"how","how far","how long","how many","how much","how old","how come","?"]
if row['QUERY'] in questions_list:
return 1
else:
return 0
df_test['QUESTIONS'] = df_test.apply(questions, axis=1)
但是一旦我检查了新的数据框,即使它创建了新的列,所有的值都是0。我不确定函数中的逻辑是否错误,我对数据框列使用了类似的东西,有一个单词,如果匹配,它将输出1或0。但是,当该列包含类似此用例的短语/句子时,相同的逻辑似乎不起作用。任何输入都非常感谢!
答案 0 :(得分:1)
IIUC,您需要查找问题列表中字符串的第一个单词,如果是,则返回1,否则返回0。在您的函数中,而不是检查整个字符串是否在问题列表中,请分割字符串并检查如果第一个元素在问题列表中。
def questions(row):
questions_list = ["are","what","when","where","which","who","whom","whose","why","why don't","how","how far","how long","how many","how much","how old","how come","?"]
if row['QUERY'].split()[0] in questions_list:
return 1
else:
return 0
df['QUESTIONS'] = df.apply(questions, axis=1)
你得到
QUERY FREQ QUESTIONS
0 hindi movies for adults 595 0
1 are panda dogs real 383 1
2 asuedraw winning numbers 478 0
3 sentry replacement keys 608 0
4 rebuilding nicad battery packs 541 0
答案 1 :(得分:1)
如果您要检查question_list
中的任何子字符串与数据帧中的字符串的完全匹配,则应使用str.contains
方法:
questions_list = ["what","when","where","which","who","whom","whose","why",
"why don't", "how","how far","how long","how many",
"how much","how old","how come","?"]
pattern = "|".join(questions_list) # generate regex from your list
df_test['QUESTIONS'] = df_test['QUERY'].str.contains(pattern)
简化示例:
df = pd.DataFrame({
'QUERY': ['how do you like it', 'what\'s going on?', 'quick brown fox'],
'ID': [0, 1, 2]})
创建图案:
pattern = '|'.join(['what', 'how'])
pattern
Out: 'what|how'
使用它:
df['QUERY'].str.contains(pattern)
Out[12]:
0 True
1 True
2 False
Name: QUERY, dtype: bool
如果您不熟悉正则表达式,请there's快速参考Python re
。脚号'|'
,解释为
A | B,其中A和B可以是任意RE,它创建一个与A或B匹配的正则表达式。任意数量的RE可以由'|'分隔以此方式