基于堆栈上的this帖子,我尝试了像这样的值计数功能
df2 = df1.join(df1.genres.str.split(",").apply(pd.value_counts).fillna(0))
除了我的数据有22种独特的流派以及分割后得到42个值之外,它的工作正常,这当然不是唯一的。 数据示例:
Action Adventure Casual Design & Illustration Early Access Education Free to Play Indie Massively Multiplayer Photo Editing RPG Racing Simulation Software Training Sports Strategy Utilities Video Production Web Publishing Accounting Action Adventure Animation & Modeling Audio Production Casual Design & Illustration Early Access Education Free to Play Indie Massively Multiplayer Photo Editing RPG Racing Simulation Software Training Sports Strategy Utilities Video Production Web Publishing nan
0 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 1.0 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
(我已粘贴头部和第一行)
我感觉问题是由我的原始数据引起的。好吧,我的专栏(类型)是包含括号的列表列表
示例:[Action,Indie]
所以当python读取它时,它会将[Action and Action and Action]读作不同的值,输出是303个不同的值。
所以我做的是:
for i in df1['genres'].tolist():
if str(i) != 'nan':
i = i[1:-1]
new.append(i)
else:
new.append('nan')
答案 0 :(得分:1)
您必须按功能str.strip
从列[]
中删除第一个和最后一个genres
,然后按函数str.replace
将空格替换为空字符串
import pandas as pd
df = pd.read_csv('test/Copy of AppCrawler.csv', sep="\t")
df['genres'] = df['genres'].str.strip('[]')
df['genres'] = df['genres'].str.replace(' ', '')
df = df.join(df.genres.str.split(",").apply(pd.value_counts).fillna(0))
#temporaly display 30 rows and 60 columns
with pd.option_context('display.max_rows', 30, 'display.max_columns', 60):
print df
#remove for clarity
print df.columns
Index([u'Unnamed: 0', u'appid', u'currency', u'final_price', u'genres',
u'initial_price', u'is_free', u'metacritic', u'release_date',
u'Accounting', u'Action', u'Adventure', u'Animation&Modeling',
u'AudioProduction', u'Casual', u'Design&Illustration', u'EarlyAccess',
u'Education', u'FreetoPlay', u'Indie', u'MassivelyMultiplayer',
u'PhotoEditing', u'RPG', u'Racing', u'Simulation', u'SoftwareTraining',
u'Sports', u'Strategy', u'Utilities', u'VideoProduction',
u'WebPublishing'],
dtype='object')