我有一个csv
,看起来像这样:
screen_name,tweet,following,followers,is_retweet,bot
narutouz16,Grad school is lonely.,59,20,0,0
narutouz16,RT @GetMadz: Sound design in this game is 10/10 game freak lied. ,59,20,1,0
narutouz16,@hbthen3rd I know I don't.,59,20,0,0
narutouz16,"@TonyKelly95 I'm still not satisfied in the ending, even though its longer.",59,20,0,0
narutouz16,I'm currently in second place in my leaderboards in duolongo.,59,20,0,0
我可以使用以下内容将其读入dataframe
:
df = pd.read_csv("file.csv")
那很好。我print(df.shape)
时得到以下尺寸
(1223726, 6)
我有一个用户名列表,如下所示:
bad_names = ['BELOZEROVNIKIT', 'ALTMANBELINDA', '666STEVEROGERS', 'ALVA_MC_GHEE', 'CALIFRONIAREP', 'BECCYWILL', 'BOGDANOVAO2', 'ADELE_BROCK', 'ANN1EMCCONNELL', 'ARONHOLDEN8', 'BISHOLORINE', 'BLACKTIVISTSUS', 'ANGELITHSS', 'ANWARJAMIL22', 'BREMENBOTE', 'BEN_SAR_GENT', 'ASSUNCAOWALLAS', 'AHMADRADJAB', 'AN_N_GASTON', 'BLACK_ELEVATION', 'BERT_HENLEY', 'BLACKERTHEBERR5', 'ARTHCLAUDIA', 'ALBERTA_HAYNESS', 'ADRIANAMFTTT']
我想做的是遍历数据框,如果username
完全在此列表中,则从df
中删除那些行并将其添加到新的df
中称为bad_names_df
。
伪代码如下:
for each row in df:
if row.username in bad_names:
bad_names_df.append(row)
df.remove(row)
else:
continue
我的尝试
for row, col in df.iterrows():
if row['username'] in bad_user_names:
new_df.append(row)
else:
continue
如何(有效)遍历df
,行数超过120万,并且如果用户名位于bad_names
列表中,请删除该行并将该行添加到{{ 1}}?我没有找到其他解决此问题的SO帖子。
答案 0 :(得分:1)
您还可以使用isin
创建遮罩:
mask = df["screen_name"].isin(bad_names)
print (df[mask]) #df of bad names
print (df[~mask]) #df of good names
答案 1 :(得分:0)
您可以应用lambda然后进行如下过滤:
df['keep'] = df['username'].apply(lambda x: False if x in bad_names else True)
df = df[df['keep']==True]