Python:将.txt文件分隔为列,并在其中一列中查找最常用的数据项

时间:2014-03-30 05:35:44

标签: python data-structures syntax pandas

我从一个文件中读取并存储到带有列名的artists_tag中。 现在这个文件有多列,我需要生成一个新的数据结构,其中包含来自artists_tag的2列,以及来自' Tag'的最常用值。列为第3列值。 这是我现在写的:

import pandas as pd
    from collections import Counter

def parse_artists_tags(filename):
    df = pd.read_csv(filename, sep="|", names=["ArtistID", "ArtistName", "Tag", "Count"])
    return df

def parse_user_artists_matrix(filename):
    df = pd.read_csv(filename)
    return df

# artists_tags = parse_artists_tags(DATA_PATH + "\\artists-tags.txt")
artists_tags = parse_artists_tags("C:\\Users\\15-J001TX\\Documents\\ml_task\\artists-tags.txt")

#print(artists_tags)
user_art_mat = parse_user_artists_matrix("C:\\Users\\15-J001TX\\Documents\\ml_task\\userart-mat-training.csv")

#print ("Number of tags {0}".format(len(artists_tags))) # Change this line. Should be 952803
#print ("Number of artists {0}".format(len(user_art_mat))) # Change this line. Should be 17119

# TODO Implement this. You can change the function arguments if necessary
# Return a data structure that contains (artist id, artist name, top tag) for every artist
def calculate_top_tag(all_tags):
    temp = all_tags.Tag
    a = Counter(temp)
    a = a.most_common()
    print (a)
    top_tags = all_tags.ArtistID,all_tags.ArtistName,a;
    return top_tags

top_tags = calculate_top_tag(artists_tags)

# Print the top tag for Nirvana
# Artist ID for Nirvana is 5b11f4ce-a62d-471e-81fc-a69a8278c7da
# Should be 'Grunge'
print ("Top tag for Nirvana is {0}".format(top_tags)) # Complete this line 

在上一个方法calculate_top_tag中,我不明白如何从“标签”中选择最常用的值。列,并在返回之前将其作为top_tags的第三列。

我是python的新手,我对语法和数据结构的了解有限。我确实尝试了从列表中找到最常见值的各种解决方案,但它们似乎显示整个列而不是一个特定值。我知道这是一些微不足道的语法问题,但经过长时间的搜索,我仍然无法弄清楚如何获得这个。

编辑1: 我需要为特定艺术家找到最常见的标签,而不是最常见的标签。 但同样,我不知道如何做到。

编辑2: 这是数据文件的链接: https://github.com/amplab/datascience-sp14/raw/master/hw2/hw2data.tar.gz

1 个答案:

答案 0 :(得分:0)

确定有更简洁的方法,但这应该让你开始:

# returns a df grouped by ArtistID and Tag
tag_counts = artists_tags.groupby(['ArtistID', 'Tag'])
# sum up tag counts and sort in descending order
tag_counts = tag_counts.sum().sort('Count', ascending=False).reset_index()
# keep only the top ranking tag per artist
top_tags = tag_counts.groupby('ArtistID').first()
# top_tags is now a dataframe which contains the top tag for every artist
# We can simply lookup the top tag for Nirvana via it's index:
top_tags.ix['5b11f4ce-a62d-471e-81fc-a69a8278c7da'][0]
# 'Grunge'