我正在使用Pandas和Sklearn参加Kaggle比赛。我正在使用TfidfVectorizer
获取每个Title
和FullDescription
列中前100个最重要功能的功能。我在Title
中发生的单词之前FullDescription
和“DESC_”之前的单词前面加上“TITLE_”。
不幸的是,在我修改DataFrame中的任何内容之前,我的一行(第10,000行)是有效的:
df.iloc[9999]
给我:
Id 66190737
Title Experienced RHAD Job Lincoln
FullDescription We provide a community hearing service to thos...
LocationRaw Lincoln, Lincolnshire
LocationNormalized Lincoln
ContractType full_time
ContractTime NaN
Company NaN
Category Healthcare & Nursing Jobs
SalaryRaw 18,000 to 40,000 per year
SalaryNormalized 29000
SourceName careworx.co.uk
Name: 9999, dtype: object
使用TfidfVectorizer
后,Title
或FullDescription
(前缀为“TITLE_”或“DESC_”)的第10,000行中的所有功能均为{{1 }}
这就是我正在做的事情:
NaN
然后,当我看到我的第10,000行:
STEMMER = nltk.stem.porter.PorterStemmer()
tokenizer = RegexpTokenizer(r'\w+')
def stem_tokens(tokens, stemmer=STEMMER):
return [stemmer.stem(item) for item in tokens]
def tokenize(text, tokenizer=tokenizer):
tokens = tokenizer.tokenize(text)
return stem_tokens(tokens)
def get_tfidf_df(df, col, max_features, prefix, tokenize):
vec = TfidfVectorizer(tokenizer=tokenize, stop_words="english", max_features=max_features)
tfidf = vec.fit_transform(df[col])
tfidf_array = tfidf.toarray()
tfidf_df = pd.DataFrame(tfidf_array, columns=vec.get_feature_names())
tfidf_df.columns = [prefix + "_" + str(col) for col in tfidf_df.columns]
return tfidf_df
def group_rare_entries(df, col, threshold):
mask = df[col].groupby(df[col]).transform("count").lt(threshold)
df[col][mask] = "RARE"
def address_rare_entries(df, cols, threshold):
for col in cols:
group_rare_entries(df, col, threshold)
def get_processed_df(df, rare_cols, threshold=10):
address_rare_entries(df, rare_cols, threshold)
df = df[pd.notnull(df["FullDescription"]) & pd.notnull(df["Title"]) & pd.notnull(df["SalaryNormalized"])]
tfidf_desc = get_tfidf_df(df, "FullDescription", max_features=100, prefix="DESC", tokenize=tokenize)
tfidf_title = get_tfidf_df(df, "Title", max_features=100, prefix="TITLE", tokenize=tokenize)
df.drop("FullDescription", inplace=True, axis=1)
df.drop("Title", inplace=True, axis=1)
final_df = pd.concat([df, tfidf_desc, tfidf_title], axis=1)
return final_df
final_df = get_processed_df(df, rare_cols=["LocationNormalized", "Company", "Category", "SourceName"], threshold=10)
我回来了:
final_df.iloc[9999].isnull().sum()
我手动检查,该行中的所有202
和TITLE_
功能均为DESC_
。
知道这里发生了什么吗?我有点怀疑,这似乎只发生在最后一排......
谢谢!