Question

Pandas DataFrame中有一个进程，我正在尝试为Yelp数据集挑战中的顶点项目做。我找到了一种使用循环的方法，但鉴于我正在使用的大型数据集，它需要很长时间。（我试着24小时运行它，但仍然没有完成。）

在没有循环的情况下，是否有更有效的方法在Pandas中执行此操作？

注意：business.categories（business是一个DataFrame）提供了一个企业作为字符串存储的类别列表（例如“[restaurant，entertainment，bar，nightlife]”）。它以保存为字符串的列表胸像的格式编写。

# Creates a new DataFrame with businesses as rows and columns as categories tags with 0 or 1 depending on whether the business is that category
categories_list = []

# Makes empty values an string of empty lists. This prevents Null errors later in the code.
business.categories = business.categories.fillna('[]')

# Creates all categories as a single list. Goes through each business's list of categories and adds any unique values to the master list, categories_list
for x in range(len(business)):
    # business.categories is storing each value as a list (even though it's formatted just like a string), so this converts it to a List
    categories = eval(str(business.categories[x]))
    # Looks at each categories, adding it to categories_list if it's not already there
    for category in categories:
        if category not in categories_list:
            categories_list.append(category)

# Makes the list of categories (and business_id) the colums of the new DataFrame
categories_df = pd.DataFrame(columns = ['business_id'] + categories_list, index = business.index)

# Loops through determining whether or not each business has each category, storing this as a 1 or 0 for that category type respectivity.
for x in range(len(business)):
    for y in range(len(categories_list)):
        cat = categories_list[y]
        if cat in eval(business.categories[x]):
            categories_df[cat][x] = 1
        else:
            categories_df[cat][x] = 0

# Imports the original business_id's into the new DataFrame. This allows me to cross-reference this DataFrame with my other datasets for analysis
categories_df.business_id = business.business_id

categories_df

Answer 1

鉴于数据存储为类似列表的字符串，我认为你不能避免在Python速度上（使用str方法显式或隐式地）循环数据帧（这似乎是一个存储数据的不幸方式。可以避免上游吗？）。但是，我有一些改进方法的想法。由于您提前了解了结果索引，因此您可以立即开始构建DataFrame，而无需提前了解所有类别，例如

categories_df = pd.DataFrame(index=business.index)
for ix, categories in business.categories.items():
    for cat in eval(categories):
        categories_df.loc[ix, cat] = 1   
        # if cat is not already in the columns this will add it in, with null values in the other rows
categories_df.fillna(0, inplace=True)

如果您事先了解部分或全部类别，那么在循环之前将它们作为列添加也应该有所帮助。

此外，您可以尝试categories[1:-1].split(', ')而不是eval(categories)。快速测试告诉我它应该快15倍左右。为了确保相同的结果，你应该做

for ix, categories in business.categories.items():
    for cat in categories[1:-1].split(','):
        categories_df.loc[ix, cat.strip()] = 1

为了安全起见，因为你不知道逗号周围可能有多少空格。避免大部分嵌套循环和in语句可以大大加快程序运行速度。

Answer 2

不完全确定你最终想要做的是......但是

考虑数据框business

business = pd.DataFrame(dict(
        categories=['[cat, dog]', '[bird, cat]', '[dog, bird]']
    ))

您可以使用

将这些字符串转换为列表

business.categories.str.strip('[]').str.split(', ')

甚至 pd.get_dummies

business.categories.str.strip('[]').str.get_dummies(', ')

   bird  cat  dog
0     0    1    1
1     1    1    0
2     1    0    1

DataFrame中List的类别

2 个答案: