DataFrame中List的类别

时间:2017-06-10 12:20:48

标签: python pandas lambda yelp

Pandas DataFrame中有一个进程,我正在尝试为Yelp数据集挑战中的顶点项目做。我找到了一种使用循环的方法,但鉴于我正在使用的大型数据集,它需要很长时间。 (我试着24小时运行它,但仍然没有完成。)

在没有循环的情况下,是否有更有效的方法在Pandas中执行此操作?

注意:business.categories(business是一个DataFrame)提供了一个企业作为字符串存储的类别列表(例如“[restaurant,entertainment,bar,nightlife]”)。它以保存为字符串的列表胸像的格式编写。

# Creates a new DataFrame with businesses as rows and columns as categories tags with 0 or 1 depending on whether the business is that category
categories_list = []

# Makes empty values an string of empty lists. This prevents Null errors later in the code.
business.categories = business.categories.fillna('[]')

# Creates all categories as a single list. Goes through each business's list of categories and adds any unique values to the master list, categories_list
for x in range(len(business)):
    # business.categories is storing each value as a list (even though it's formatted just like a string), so this converts it to a List
    categories = eval(str(business.categories[x]))
    # Looks at each categories, adding it to categories_list if it's not already there
    for category in categories:
        if category not in categories_list:
            categories_list.append(category)

# Makes the list of categories (and business_id) the colums of the new DataFrame
categories_df = pd.DataFrame(columns = ['business_id'] + categories_list, index = business.index)

# Loops through determining whether or not each business has each category, storing this as a 1 or 0 for that category type respectivity.
for x in range(len(business)):
    for y in range(len(categories_list)):
        cat = categories_list[y]
        if cat in eval(business.categories[x]):
            categories_df[cat][x] = 1
        else:
            categories_df[cat][x] = 0

# Imports the original business_id's into the new DataFrame. This allows me to cross-reference this DataFrame with my other datasets for analysis
categories_df.business_id = business.business_id

categories_df

2 个答案:

答案 0 :(得分:0)

鉴于数据存储为类似列表的字符串,我认为你不能避免在Python速度上(使用str方法显式或隐式地)循环数据帧(这似乎是一个存储数据的不幸方式。可以避免上游吗?)。但是,我有一些改进方法的想法。由于您提前了解了结果索引,因此您可以立即开始构建DataFrame,而无需提前了解所有类别,例如

categories_df = pd.DataFrame(index=business.index)
for ix, categories in business.categories.items():
    for cat in eval(categories):
        categories_df.loc[ix, cat] = 1   
        # if cat is not already in the columns this will add it in, with null values in the other rows
categories_df.fillna(0, inplace=True)

如果您事先了解部分或全部类别,那么在循环之前将它们作为列添加也应该有所帮助。

此外,您可以尝试categories[1:-1].split(', ')而不是eval(categories)。快速测试告诉我它应该快15倍左右。 为了确保相同的结果,你应该做

for ix, categories in business.categories.items():
    for cat in categories[1:-1].split(','):
        categories_df.loc[ix, cat.strip()] = 1  

为了安全起见,因为你不知道逗号周围可能有多少空格。避免大部分嵌套循环和in语句可以大大加快程序运行速度。

答案 1 :(得分:0)

不完全确定你最终想要做的是......但是

考虑数据框business

business = pd.DataFrame(dict(
        categories=['[cat, dog]', '[bird, cat]', '[dog, bird]']
    ))

您可以使用

将这些字符串转换为列表
business.categories.str.strip('[]').str.split(', ')

甚至 pd.get_dummies

business.categories.str.strip('[]').str.get_dummies(', ')

   bird  cat  dog
0     0    1    1
1     1    1    0
2     1    0    1