Question

我将Pandas与名为df的DataFrame一起使用。我正在使用它提取新功能，并将生成的两个新数据帧与pd.concat组合在一起。这是我的职责：

def get_processed_df(df, rare_cols, threshold=10):
    print("df at start", df.shape)

    df = df[pd.notnull(df["FullDescription"]) &  
            pd.notnull(df["Title"]) & 
            pd.notnull(df["SalaryNormalized"])]
    print("df after filtering nulls", df.shape)

    tfidf_desc = get_tfidf_df(df, 
                              "FullDescription", 
                              max_features=100, 
                              prefix="DESC", 
                              tokenize=tokenize)
    print("tfidf_desc shape: ", tfidf_desc.shape)

    tfidf_title = get_tfidf_df(df, 
                               "Title", 
                               max_features=100, 
                               prefix="TITLE", 
                               tokenize=tokenize)
    print("tfidf_title shape: ", tfidf_title.shape)

    df.drop("FullDescription", inplace=True, axis=1)
    df.drop("Title", inplace=True, axis=1)

    final_df = pd.concat([df, tfidf_desc, tfidf_title], axis=1)
    print("final df shape: ", final_df.shape)

    return final_df

当我运行它时，我得到以下输出：

df at start (10000, 12)
df after filtering nulls (9999, 12)
tfidf_desc shape:  (9999, 100)
tfidf_title shape:  (9999, 100)
final df shape:  (10000, 210)

因此，我的过滤功能已删除原始df中的一行，tfidf_desc和tfidf_title数据帧也有9,999行。我使用pd.concat将它们与axis=1连接在一起，并以某种方式最终得到10,000行的数据帧，其中包含所有＆＃34;标题＆＃34;的NaN。和＆＃34; FullDescription＆＃34;基于特征。

知道为什么会这样吗？

谢谢！

Answer 1

过滤后，索引不会重置。这在连接数据帧时会导致问题。过滤df后尝试此操作：

df= df.reset_index(drop=True)

熊猫：Concat意外地增加了一排

1 个答案: