如何使用不同数量的关键字条件过滤熊猫数据框

时间:2019-11-22 23:06:19

标签: python python-3.x pandas dataframe

我了解使用&过滤基于多个条件的DataFrame行的概念,但是我该如何以迭代方式编写此代码,具体取决于传递给函数的条件(**个)的数量? / p>

我正在尝试在一个遍历DataFrames列表(replist)的for循环中这样做。传递的关键字是字符串,这些字符串的键与另一个字典(kwarg_dict)相对应。本质上,如果我通过

func(fruit='apple', veggie='kale')

我想从replist中包含的数据帧构建一个DataFrame,其水果列为“ apple”,而素食列为“ kale”。但是,如果我通过了

func(fruit='apple', veggie='kale', dessert='cake')

我想构建一个考虑了所有三个参数的DataFrame。用户应该能够传递任意数量的参数,因此逻辑&s的数量将有所不同。

例如,如果我有两个关键字,我将使用:

sub_df = pd.concat(
         [sub_df, replist[i][
                             (replist[i][kwarg_dict[list(kwargs)[0]]] == kwargs[list(kwargs)[0]]) &
                             (replist[i][kwarg_dict[list(kwargs)[1]]] == kwargs[list(kwargs)[1]])
                              ]
          ])

但是对于三个关键字,我将使用:

sub_df = pd.concat(
         [sub_df, replist[i][
                             (replist[i][kwarg_dict[list(kwargs)[0]]] == kwargs[list(kwargs)[0]]) &
                             (replist[i][kwarg_dict[list(kwargs)[1]]] == kwargs[list(kwargs)[1]]) &
                             (replist[i][kwarg_dict[list(kwargs)[2]]] == kwargs[list(kwargs)[2]])
                              ]
          ])

这可以满足我的要求,但是显然对于传递的任何数量的关键字来说,这都不是普遍的。

对我来说很明显,我需要遍历range(len(kwargs)),但是我不确定如何像上面显示的那样将迭代的输出与逻辑&语句一起串起来。谢谢您的帮助!

2 个答案:

答案 0 :(得分:0)

使用列表理解来概括您的示例:

import pandas as pd

filepath_dict = {'yelp':   'data/sentiment_analysis/yelp_labelled.txt',
                 'amazon': 'data/sentiment_analysis/amazon_cells_labelled.txt',
                 'imdb':   'data/sentiment_analysis/imdb_labelled.txt'}

df_list = []
for source, filepath in filepath_dict.items():
    df = pd.read_csv(filepath, names=['sentence', 'label'], sep='\t')
    df['source'] = source  # Add another column filled with the source name
    df_list.append(df)

df = pd.concat(df_list)

from sklearn.model_selection import train_test_split

df_yelp = df[df['source'] == 'yelp']
sentences = df_yelp['sentence'].values
y = df_yelp['label'].values
sentences_train, sentences_test, y_train, y_test = train_test_split(sentences, y, test_size=0.25, random_state=1000)

from sklearn.feature_extraction.text import CountVectorizer


vectorizer = CountVectorizer()
vectorizer.fit(sentences_train)

X_train = vectorizer.transform(sentences_train)
X_test  = vectorizer.transform(sentences_test)

from sklearn.linear_model import LogisticRegression

for source in df['source'].unique():
    df_source = df[df['source'] == source]
    sentences = df_source['sentence'].values
    y = df_source['label'].values

    sentences_train, sentences_test, y_train, y_test = train_test_split(
        sentences, y, test_size=0.25, random_state=1000)

    vectorizer = CountVectorizer()
    vectorizer.fit(sentences_train)
    X_train = vectorizer.transform(sentences_train)
    X_test  = vectorizer.transform(sentences_test)

    classifier = LogisticRegression(solver='lbfgs')
    classifier.fit(X_train, y_train)
    score = classifier.score(X_test, y_test)
    print('Accuracy for {} data: {:.4f}'.format(source, score))

# predict the string
predict_string = ["this is a mean message"]
predict_text = vectorizer.transform(predict_string)
predicted = classifier.predict(predict_text)
print(predicted)

但是,我认为您最好使用df.Filter并简单地传递f = [replist[i][kwarg_dict[var]] == kwargs[var] for var in list(kwargs)] pd.concat([subdf, replist[i][f]])

答案 1 :(得分:0)

您可以通过这种方式(虽然不是很优雅,但是可以做到):


test = eval(" & ".join(["(replist[{0}]['{1}'] == '{2}')".format(i, k, v) for k, v in kwarg_dict.items()]))

sub_df = pd.concat([sub_df, replist[i][test]])