Sklearn:使用TfidfVectorizer令人惊讶的行为

时间:2018-02-20 00:33:06

标签: python pandas scikit-learn

我正在使用Pandas和Sklearn参加Kaggle比赛。我正在使用TfidfVectorizer获取每个TitleFullDescription列中前100个最重要功能的功能。我在Title中发生的单词之前FullDescription和“DESC_”之前的单词前面加上“TITLE_”。

不幸的是,在我修改DataFrame中的任何内容之前,我的一行(第10,000行)是有效的:

df.iloc[9999]

给我:

Id                                                             66190737
Title                                      Experienced RHAD Job Lincoln
FullDescription       We provide a community hearing service to thos...
LocationRaw                                       Lincoln, Lincolnshire
LocationNormalized                                              Lincoln
ContractType                                                  full_time
ContractTime                                                        NaN
Company                                                             NaN
Category                                      Healthcare & Nursing Jobs
SalaryRaw                                     18,000 to 40,000 per year
SalaryNormalized                                                  29000
SourceName                                               careworx.co.uk
Name: 9999, dtype: object

使用TfidfVectorizer后,TitleFullDescription(前缀为“TITLE_”或“DESC_”)的第10,000行中的所有功能均为{{1 }}

这就是我正在做的事情:

NaN

然后,当我看到我的第10,000行:

STEMMER = nltk.stem.porter.PorterStemmer()
tokenizer = RegexpTokenizer(r'\w+')

def stem_tokens(tokens, stemmer=STEMMER):
    return [stemmer.stem(item) for item in tokens]
def tokenize(text, tokenizer=tokenizer):
    tokens = tokenizer.tokenize(text)
    return stem_tokens(tokens)

def get_tfidf_df(df, col, max_features, prefix, tokenize):
    vec = TfidfVectorizer(tokenizer=tokenize, stop_words="english", max_features=max_features)
    tfidf = vec.fit_transform(df[col])
    tfidf_array = tfidf.toarray()
    tfidf_df = pd.DataFrame(tfidf_array, columns=vec.get_feature_names())
    tfidf_df.columns = [prefix + "_" + str(col) for col in tfidf_df.columns]
    return tfidf_df

def group_rare_entries(df, col, threshold):
    mask = df[col].groupby(df[col]).transform("count").lt(threshold)
    df[col][mask] = "RARE"

def address_rare_entries(df, cols, threshold):
    for col in cols:
        group_rare_entries(df, col, threshold)

def get_processed_df(df, rare_cols, threshold=10):
    address_rare_entries(df, rare_cols, threshold)
    df = df[pd.notnull(df["FullDescription"]) & pd.notnull(df["Title"]) & pd.notnull(df["SalaryNormalized"])]
    tfidf_desc = get_tfidf_df(df, "FullDescription", max_features=100, prefix="DESC", tokenize=tokenize)
    tfidf_title = get_tfidf_df(df, "Title", max_features=100, prefix="TITLE", tokenize=tokenize)
    df.drop("FullDescription", inplace=True, axis=1)
    df.drop("Title", inplace=True, axis=1)
    final_df = pd.concat([df, tfidf_desc, tfidf_title], axis=1)
    return final_df


final_df = get_processed_df(df, rare_cols=["LocationNormalized", "Company", "Category", "SourceName"], threshold=10)

我回来了:

final_df.iloc[9999].isnull().sum()

我手动检查,该行中的所有202 TITLE_功能均为DESC_

知道这里发生了什么吗?我有点怀疑,这似乎只发生在最后一排......

谢谢!

0 个答案:

没有答案