我在.tsv文件上有一个数据集,其结构如下:
user_000001 2009-05-04T23:08:57Z f1b1cf71-bd35-4e99-8624-24a6e15f133a Deep Dish Fuck Me Im Famous (Pacha Ibiza)-09-28-2007
user_000001 2009-05-04T13:54:10Z a7f7df4a-77d8-4f12-8acd-5c60c93f4de8 坂本龍一 Composition 0919 (Live_2009_4_15)
user_000002 2009-05-04T13:52:04Z a7f7df4a-77d8-4f12-8acd-5c60c93f4de8 坂本龍一 Mc2 (Live_2009_4_15)
user_000002 2009-05-04T13:42:52Z a7f7df4a-77d8-4f12-8acd-5c60c93f4de8 坂本龍一 Hibari (Live_2009_4_15)
这是一个听音乐的用户的数据集,各列的含义是:用户ID,用户收听特定歌曲的日期时间,Artis ID,Artis名称,曲目ID和曲目名称。
以下是我如何读取此数据集的示例:
import io
import csv
import pandas as pd
df = pd.read_csv('lastfm-dataset-1K/fixed.tsv', sep='\t', error_bad_lines=False)
df.columns = ['user', 'date', 'artid', 'artname', 'trackid', 'trackname']
df['date'] = pd.to_datetime(df['date'])
sessid = 0
# The new dataframe will have the following columns
newDF = pd.DataFrame(columns=['sessid', 'user', 'trackid', 'count'])
因此,我想创建一个“会话”以了解每个用户每天每一小时听过哪些歌曲。
会话是一个简单的增量整数,从0开始,表示每天一小时的聆听时间(如果每天要分组的话,每天要滑动一天会很复杂)。
列数是为了知道用户听过一首歌曲的次数。
谁能解释我该怎么做? 谢谢
正如Vivek Kalyanarangan所建议的那样,预期的响应将是:
SessionID, user, trackid; count
1, user_00001, id_song1, 1
1, user_00001, id_song2, 4
1, user_00001, id_song3, 2
# Different session id because of different user, but maybe he as listened the same songs (or not, just an example)
2, user_00002, id_song1, 2
2, user_00002, id_song3, 1
我说了一些关于会议的错误。 聆听会话是指在不到一个小时的时间内听到两首不同歌曲的会话。因此,如果像1:20小时后那样收听第二首歌曲,则会话也只能是第一首歌曲。
答案 0 :(得分:1)
我认为需要GroupBy.size
:
df = df.sort_values('date')
df['sessid'] = pd.factorize(df['date'].dt.floor('H'))[0] + 1
df = df.groupby(['sessid', 'user','trackid']).size().reset_index(name='count')
print (df)
sessid user trackid count
0 1 user_000001 Fuck Me Im Famous 1
1 2 user_000001 Composition 0919 1
2 2 user_000002 Hibari 1
3 2 user_000002 Mc2 1
详细信息:
首先用floor
来创建date
个hour
的{{1}}:
print (df['date'].dt.floor('H'))
0 2009-05-04 23:00:00
1 2009-05-04 13:00:00
2 2009-05-04 13:00:00
3 2009-05-04 13:00:00
Name: date, dtype: datetime64[ns]
通过factorize
将其转换为数字:
print (pd.factorize(df['date'].dt.floor('H'))[0] + 1)
[1 2 2 2]
答案 1 :(得分:1)
我对Pandas不太满意,因此从@jezrael的解决方案开始,我提出了关于您的第二次编辑的解决方案。
datasetPath = ""
sessionPath = ""
resultPath = r""
# Use this function in order to calculate delta difference in hour
def delta(date1, date2):
# 2006-08-13T13:59:20Z
date1 = datetime.strptime(str(date1), "%Y-%m-%d %H:%M:%S")
date2 = datetime.strptime(str(date2), "%Y-%m-%d %H:%M:%S")
delta = date1 - date2
totsec = delta.total_seconds()
h = totsec/3600
return h * (-1)
# Read the dataset
df = pd.read_csv(datasetPath, sep='\t', error_bad_lines=False)
df.columns = ['user', 'date', 'artid', 'artname', 'trackid', 'trackname']
# Order by user and date
df = df.sort_values(by=['user', 'date'])
df['date'] = pd.to_datetime(df['date'])
# Delete useless column
df = df.drop(['artid', 'artname','trackname'], axis=1)
# Get unique user id
id_users = df['user'].unique()
# Convert dataframe into numpy matrix in order to use for cycle
# user date trackid
np_matrix = df.as_matrix()
numrows = len(np_matrix)
# Session start by one
session = 1
count = 0
out_file = open(sessionPath,"a")
out_file.write("session" + "\t" + "user" + "\t" + "trackid\n")
strings=[]
for user in id_users:
print("Cicle user: " + str(user))
while count < numrows:
# Same user
if(user == np_matrix[count][0]):
# Check about overflow
if(count +1 < numrows):
# Check delta, if delta < 1 is the same session. Else session++
if( delta(np_matrix[count][1], np_matrix[count+1][1]) > 1.0):
session = session + 1
else:
strings.append( (str(session) + "\t" + str(user) + "\t" + str(np_matrix[count][2])))
count = count + 1
session = session + 1
count = 0
for string in strings:
out_file.write("%s\n" % string)
out_file.close()
# Count the same song in one session
df2 = pd.read_csv(sessionPath, sep='\t', error_bad_lines=False)
df2.columns = ['session', 'user', 'trackid']
df2 = df2.groupby(['session', 'user','trackid']).size().reset_index(name='count')
#print (df2)
df2.to_csv(resultPath, header=True, index=False, sep='\t', mode='w')
听起来不错?