阅读数据集并使用Pandas分组一些信息

时间:2018-08-04 10:51:07

标签: python pandas dataframe dataset

我在.tsv文件上有一个数据集,其结构如下:

user_000001 2009-05-04T23:08:57Z    f1b1cf71-bd35-4e99-8624-24a6e15f133a    Deep Dish       Fuck Me Im Famous (Pacha Ibiza)-09-28-2007
user_000001 2009-05-04T13:54:10Z    a7f7df4a-77d8-4f12-8acd-5c60c93f4de8    坂本龍一        Composition 0919 (Live_2009_4_15)
user_000002 2009-05-04T13:52:04Z    a7f7df4a-77d8-4f12-8acd-5c60c93f4de8    坂本龍一        Mc2 (Live_2009_4_15)
user_000002 2009-05-04T13:42:52Z    a7f7df4a-77d8-4f12-8acd-5c60c93f4de8    坂本龍一        Hibari (Live_2009_4_15)

这是一个听音乐的用户的数据集,各列的含义是:用户ID,用户收听特定歌曲的日期时间,Artis ID,Artis名称,曲目ID和曲目名称。

以下是我如何读取此数据集的示例:

import io
import csv
import pandas as pd

df = pd.read_csv('lastfm-dataset-1K/fixed.tsv', sep='\t', error_bad_lines=False)
df.columns = ['user', 'date', 'artid', 'artname', 'trackid', 'trackname']
df['date'] = pd.to_datetime(df['date'])
sessid = 0
# The new dataframe will have the following columns
newDF = pd.DataFrame(columns=['sessid', 'user', 'trackid', 'count'])

因此,我想创建一个“会话”以了解每个用户每天每一小时听过哪些歌曲。

会话是一个简单的增量整数,从0开始,表示每天一小时的聆听时间(如果每天要分组的话,每天要滑动一天会很复杂)。

列数是为了知道用户听过一首歌曲的次数。

谁能解释我该怎么做? 谢谢

编辑1

正如Vivek Kalyanarangan所建议的那样,预期的响应将是:

SessionID, user, trackid; count
1, user_00001, id_song1, 1
1, user_00001, id_song2, 4
1, user_00001, id_song3, 2
# Different session id because of different user, but maybe he as listened the same songs (or not, just an example)
2, user_00002, id_song1, 2 
2, user_00002, id_song3, 1

编辑2

我说了一些关于会议的错误。 聆听会话是指在不到一个小时的时间内听到两首不同歌曲的会话。因此,如果像1:20小时后那样收听第二首歌曲,则会话也只能是第一首歌曲。

2 个答案:

答案 0 :(得分:1)

我认为需要GroupBy.size

df = df.sort_values('date')
df['sessid'] = pd.factorize(df['date'].dt.floor('H'))[0] + 1
df = df.groupby(['sessid', 'user','trackid']).size().reset_index(name='count')
print (df)
   sessid         user            trackid  count
0       1  user_000001  Fuck Me Im Famous      1
1       2  user_000001   Composition 0919      1
2       2  user_000002             Hibari      1
3       2  user_000002                Mc2      1

详细信息

首先用floor来创建datehour的{​​{1}}:

print (df['date'].dt.floor('H'))
0   2009-05-04 23:00:00
1   2009-05-04 13:00:00
2   2009-05-04 13:00:00
3   2009-05-04 13:00:00
Name: date, dtype: datetime64[ns]

通过factorize将其转换为数字:

print (pd.factorize(df['date'].dt.floor('H'))[0] + 1)
[1 2 2 2]

答案 1 :(得分:1)

我对Pandas不太满意,因此从@jezrael的解决方案开始,我提出了关于您的第二次编辑的解决方案。

datasetPath = ""
sessionPath = ""
resultPath = r""

# Use this function in order to calculate delta difference in hour
def delta(date1, date2):
    # 2006-08-13T13:59:20Z
    date1 = datetime.strptime(str(date1), "%Y-%m-%d %H:%M:%S")
    date2 = datetime.strptime(str(date2), "%Y-%m-%d %H:%M:%S")
    delta = date1 - date2
    totsec = delta.total_seconds()
    h = totsec/3600

    return h * (-1)

# Read the dataset 
df = pd.read_csv(datasetPath, sep='\t', error_bad_lines=False)
df.columns = ['user', 'date', 'artid', 'artname', 'trackid', 'trackname']

# Order by user and date
df = df.sort_values(by=['user', 'date'])
df['date'] = pd.to_datetime(df['date'])
# Delete useless column
df = df.drop(['artid', 'artname','trackname'], axis=1)

# Get unique user id
id_users = df['user'].unique()

# Convert dataframe into numpy matrix in order to use for cycle
# user  date    trackid
np_matrix = df.as_matrix()
numrows = len(np_matrix) 

# Session start by one
session = 1
count = 0

out_file = open(sessionPath,"a")
out_file.write("session" + "\t" + "user" + "\t" + "trackid\n")

strings=[]

for user in id_users:

    print("Cicle user: " + str(user))
    while count < numrows:

        # Same user 
        if(user == np_matrix[count][0]):
            # Check about overflow
            if(count +1 < numrows):
                # Check delta, if delta < 1 is the same session. Else session++
                if( delta(np_matrix[count][1], np_matrix[count+1][1]) > 1.0):
                    session = session + 1
                else:
                    strings.append( (str(session) + "\t" + str(user) + "\t" + str(np_matrix[count][2])))


        count = count + 1

    session  = session + 1
    count = 0

for string in strings:
  out_file.write("%s\n" % string)

out_file.close()


# Count the same song in one session
df2 = pd.read_csv(sessionPath, sep='\t', error_bad_lines=False)
df2.columns = ['session', 'user', 'trackid']
df2 = df2.groupby(['session', 'user','trackid']).size().reset_index(name='count')
#print (df2)
df2.to_csv(resultPath, header=True, index=False, sep='\t', mode='w')

听起来不错?