我从一个文件中读取并存储到带有列名的artists_tag中。 现在这个文件有多列,我需要生成一个新的数据结构,其中包含来自artists_tag的2列,以及来自' Tag'的最常用值。列为第3列值。 这是我现在写的:
import pandas as pd
from collections import Counter
def parse_artists_tags(filename):
df = pd.read_csv(filename, sep="|", names=["ArtistID", "ArtistName", "Tag", "Count"])
return df
def parse_user_artists_matrix(filename):
df = pd.read_csv(filename)
return df
# artists_tags = parse_artists_tags(DATA_PATH + "\\artists-tags.txt")
artists_tags = parse_artists_tags("C:\\Users\\15-J001TX\\Documents\\ml_task\\artists-tags.txt")
#print(artists_tags)
user_art_mat = parse_user_artists_matrix("C:\\Users\\15-J001TX\\Documents\\ml_task\\userart-mat-training.csv")
#print ("Number of tags {0}".format(len(artists_tags))) # Change this line. Should be 952803
#print ("Number of artists {0}".format(len(user_art_mat))) # Change this line. Should be 17119
# TODO Implement this. You can change the function arguments if necessary
# Return a data structure that contains (artist id, artist name, top tag) for every artist
def calculate_top_tag(all_tags):
temp = all_tags.Tag
a = Counter(temp)
a = a.most_common()
print (a)
top_tags = all_tags.ArtistID,all_tags.ArtistName,a;
return top_tags
top_tags = calculate_top_tag(artists_tags)
# Print the top tag for Nirvana
# Artist ID for Nirvana is 5b11f4ce-a62d-471e-81fc-a69a8278c7da
# Should be 'Grunge'
print ("Top tag for Nirvana is {0}".format(top_tags)) # Complete this line
在上一个方法calculate_top_tag中,我不明白如何从“标签”中选择最常用的值。列,并在返回之前将其作为top_tags的第三列。
我是python的新手,我对语法和数据结构的了解有限。我确实尝试了从列表中找到最常见值的各种解决方案,但它们似乎显示整个列而不是一个特定值。我知道这是一些微不足道的语法问题,但经过长时间的搜索,我仍然无法弄清楚如何获得这个。
编辑1: 我需要为特定艺术家找到最常见的标签,而不是最常见的标签。 但同样,我不知道如何做到。
编辑2: 这是数据文件的链接: https://github.com/amplab/datascience-sp14/raw/master/hw2/hw2data.tar.gz
答案 0 :(得分:0)
确定有更简洁的方法,但这应该让你开始:
# returns a df grouped by ArtistID and Tag
tag_counts = artists_tags.groupby(['ArtistID', 'Tag'])
# sum up tag counts and sort in descending order
tag_counts = tag_counts.sum().sort('Count', ascending=False).reset_index()
# keep only the top ranking tag per artist
top_tags = tag_counts.groupby('ArtistID').first()
# top_tags is now a dataframe which contains the top tag for every artist
# We can simply lookup the top tag for Nirvana via it's index:
top_tags.ix['5b11f4ce-a62d-471e-81fc-a69a8278c7da'][0]
# 'Grunge'