Pandas DataFrame中有一个进程,我正在尝试为Yelp数据集挑战中的顶点项目做。我找到了一种使用循环的方法,但鉴于我正在使用的大型数据集,它需要很长时间。 (我试着24小时运行它,但仍然没有完成。)
在没有循环的情况下,是否有更有效的方法在Pandas中执行此操作?
注意:business.categories(business是一个DataFrame)提供了一个企业作为字符串存储的类别列表(例如“[restaurant,entertainment,bar,nightlife]”)。它以保存为字符串的列表胸像的格式编写。
# Creates a new DataFrame with businesses as rows and columns as categories tags with 0 or 1 depending on whether the business is that category
categories_list = []
# Makes empty values an string of empty lists. This prevents Null errors later in the code.
business.categories = business.categories.fillna('[]')
# Creates all categories as a single list. Goes through each business's list of categories and adds any unique values to the master list, categories_list
for x in range(len(business)):
# business.categories is storing each value as a list (even though it's formatted just like a string), so this converts it to a List
categories = eval(str(business.categories[x]))
# Looks at each categories, adding it to categories_list if it's not already there
for category in categories:
if category not in categories_list:
categories_list.append(category)
# Makes the list of categories (and business_id) the colums of the new DataFrame
categories_df = pd.DataFrame(columns = ['business_id'] + categories_list, index = business.index)
# Loops through determining whether or not each business has each category, storing this as a 1 or 0 for that category type respectivity.
for x in range(len(business)):
for y in range(len(categories_list)):
cat = categories_list[y]
if cat in eval(business.categories[x]):
categories_df[cat][x] = 1
else:
categories_df[cat][x] = 0
# Imports the original business_id's into the new DataFrame. This allows me to cross-reference this DataFrame with my other datasets for analysis
categories_df.business_id = business.business_id
categories_df
答案 0 :(得分:0)
鉴于数据存储为类似列表的字符串,我认为你不能避免在Python速度上(使用str
方法显式或隐式地)循环数据帧(这似乎是一个存储数据的不幸方式。可以避免上游吗?)。但是,我有一些改进方法的想法。由于您提前了解了结果索引,因此您可以立即开始构建DataFrame
,而无需提前了解所有类别,例如
categories_df = pd.DataFrame(index=business.index)
for ix, categories in business.categories.items():
for cat in eval(categories):
categories_df.loc[ix, cat] = 1
# if cat is not already in the columns this will add it in, with null values in the other rows
categories_df.fillna(0, inplace=True)
如果您事先了解部分或全部类别,那么在循环之前将它们作为列添加也应该有所帮助。
此外,您可以尝试categories[1:-1].split(', ')
而不是eval(categories)
。快速测试告诉我它应该快15倍左右。
为了确保相同的结果,你应该做
for ix, categories in business.categories.items():
for cat in categories[1:-1].split(','):
categories_df.loc[ix, cat.strip()] = 1
为了安全起见,因为你不知道逗号周围可能有多少空格。避免大部分嵌套循环和in
语句可以大大加快程序运行速度。
答案 1 :(得分:0)
不完全确定你最终想要做的是......但是
考虑数据框business
business = pd.DataFrame(dict(
categories=['[cat, dog]', '[bird, cat]', '[dog, bird]']
))
您可以使用
将这些字符串转换为列表business.categories.str.strip('[]').str.split(', ')
甚至 pd.get_dummies
business.categories.str.strip('[]').str.get_dummies(', ')
bird cat dog
0 0 1 1
1 1 1 0
2 1 0 1