我了解使用&过滤基于多个条件的DataFrame行的概念,但是我该如何以迭代方式编写此代码,具体取决于传递给函数的条件(**个)的数量? / p>
我正在尝试在一个遍历DataFrames列表(replist)的for循环中这样做。传递的关键字是字符串,这些字符串的键与另一个字典(kwarg_dict)相对应。本质上,如果我通过
func(fruit='apple', veggie='kale')
我想从replist中包含的数据帧构建一个DataFrame,其水果列为“ apple”,而素食列为“ kale”。但是,如果我通过了
func(fruit='apple', veggie='kale', dessert='cake')
我想构建一个考虑了所有三个参数的DataFrame。用户应该能够传递任意数量的参数,因此逻辑&s的数量将有所不同。
例如,如果我有两个关键字,我将使用:
sub_df = pd.concat(
[sub_df, replist[i][
(replist[i][kwarg_dict[list(kwargs)[0]]] == kwargs[list(kwargs)[0]]) &
(replist[i][kwarg_dict[list(kwargs)[1]]] == kwargs[list(kwargs)[1]])
]
])
但是对于三个关键字,我将使用:
sub_df = pd.concat(
[sub_df, replist[i][
(replist[i][kwarg_dict[list(kwargs)[0]]] == kwargs[list(kwargs)[0]]) &
(replist[i][kwarg_dict[list(kwargs)[1]]] == kwargs[list(kwargs)[1]]) &
(replist[i][kwarg_dict[list(kwargs)[2]]] == kwargs[list(kwargs)[2]])
]
])
这可以满足我的要求,但是显然对于传递的任何数量的关键字来说,这都不是普遍的。
对我来说很明显,我需要遍历range(len(kwargs)),但是我不确定如何像上面显示的那样将迭代的输出与逻辑&语句一起串起来。谢谢您的帮助!
答案 0 :(得分:0)
使用列表理解来概括您的示例:
import pandas as pd
filepath_dict = {'yelp': 'data/sentiment_analysis/yelp_labelled.txt',
'amazon': 'data/sentiment_analysis/amazon_cells_labelled.txt',
'imdb': 'data/sentiment_analysis/imdb_labelled.txt'}
df_list = []
for source, filepath in filepath_dict.items():
df = pd.read_csv(filepath, names=['sentence', 'label'], sep='\t')
df['source'] = source # Add another column filled with the source name
df_list.append(df)
df = pd.concat(df_list)
from sklearn.model_selection import train_test_split
df_yelp = df[df['source'] == 'yelp']
sentences = df_yelp['sentence'].values
y = df_yelp['label'].values
sentences_train, sentences_test, y_train, y_test = train_test_split(sentences, y, test_size=0.25, random_state=1000)
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectorizer.fit(sentences_train)
X_train = vectorizer.transform(sentences_train)
X_test = vectorizer.transform(sentences_test)
from sklearn.linear_model import LogisticRegression
for source in df['source'].unique():
df_source = df[df['source'] == source]
sentences = df_source['sentence'].values
y = df_source['label'].values
sentences_train, sentences_test, y_train, y_test = train_test_split(
sentences, y, test_size=0.25, random_state=1000)
vectorizer = CountVectorizer()
vectorizer.fit(sentences_train)
X_train = vectorizer.transform(sentences_train)
X_test = vectorizer.transform(sentences_test)
classifier = LogisticRegression(solver='lbfgs')
classifier.fit(X_train, y_train)
score = classifier.score(X_test, y_test)
print('Accuracy for {} data: {:.4f}'.format(source, score))
# predict the string
predict_string = ["this is a mean message"]
predict_text = vectorizer.transform(predict_string)
predicted = classifier.predict(predict_text)
print(predicted)
但是,我认为您最好使用df.Filter
并简单地传递f = [replist[i][kwarg_dict[var]] == kwargs[var] for var in list(kwargs)]
pd.concat([subdf, replist[i][f]])
。
答案 1 :(得分:0)
您可以通过这种方式(虽然不是很优雅,但是可以做到):
test = eval(" & ".join(["(replist[{0}]['{1}'] == '{2}')".format(i, k, v) for k, v in kwarg_dict.items()]))
sub_df = pd.concat([sub_df, replist[i][test]])