因此,这是我正在使用(link here)的数据集的示例
brand model column1 column2 column3 category sub category value
Dell a aa bb cat1 sc1 aaa
Dell a aa bb cat1 sc2 bbb
Dell a aa cc cat2 sc1 ccc
Dell b aa bb cat1 sc1 ddd
Dell b aa bb cat2 sc1 eee
Dell b aa bb cc cat2 fff
Asus c aa cc bb cat1 sc1 ggg
Asus c aa cc bb cat1 hhh
Asus c aa cat1 sc2 iii
Asus d aa cc bb cat1 sc1 jjj
Asus d aa bb bb cat1 sc2 kkk
Asus d aa bb bb cat1 sc3 lll
我首先需要做的是基于品牌,型号,第1-3列的单独的独特模型,我以此为依据:
import pandas as pd
df = pd.read_csv("abhorrent.csv")
noDupes = df[["brand", "model name", "column1", "column2", "column3"]].drop_duplicates().copy()
这将返回一个这样的表:
brand model column1 column2 column3
Dell a aa bb
Dell a aa cc
Dell b aa bb
Dell b aa bb cc
Asus c aa cc bb
Asus c aa
Asus d aa cc bb
Asus d aa bb bb
但是,我需要根据类别,子类别和值创建列,并分配其值。
列名称是category和sub category的组合,我需要在其中输入这些对的值:
cat1_sc1
cat1_sc2
cat1_sc3
cat1_blank
cat2_sc1
cat2_blank
不需要自动生成列,我可以对其进行硬编码。
问题是我不知道如何基于非唯一数据框填充这些列中的值。
我要寻找的最终结果是:
brand model column1 column2 column3 cat1_sc1 cat1_sc2 cat1_sc3 cat1_blank cat2_sc1 cat2_blank
Dell a aa bb aaa bbb
Dell a aa cc ccc
Dell b aa bb ddd eee
Dell b aa bb cc fff
Asus c aa cc bb ggg hhh
Asus c aa iii
Asus d aa cc bb jjj
Asus d aa bb bb kkk lll
我能够在最初开发我的解决方案的PostrgreSQL中做到这一点,对每个预定义列使用一个UPDATE。类似于:
#fill the cat1_sc1 column
UPDATE transposed_table
SET cat1_sc1 = subquery.value
FROM
(SELECT ... FROM ... WHERE category = 'cat1' AND sub_category = 'sc1') subquery
WHERE brand = subquery.brand AND model = subquery.model etc
编辑:我的实际CSV文件已接近50万行
答案 0 :(得分:0)
您可以执行以下操作:
noDupes['cat1_sc1'] = df[(df["category"] == "cat1") & (df["sub category"] == "sc1")]["value"]
您必须在所有类别和子类别中都执行此操作,但是我想您知道了。
完整代码以使其全部运行:
import pandas as pd
df = pd.read_csv("abhorrent.csv")
cats = df["category"].drop_duplicates().tolist()
sub_cats = df["sub category"].drop_duplicates().tolist()
cat_sc_s = []
for cat in cats:
for sc in sub_cats:
name = str(cat) + '_' + str(sc)
cat_sc_s.append(name)
df[name] = df[(df["category"] == cat) & (df["sub category"] == sc)]["value"]
noDupes = df[["brand", "model", "column1", "column2", "column3"] + cat_sc_s].drop_duplicates().copy()
print(noDupes)
在****:D中有点痛苦,但在某个时候,它变得很私人:D
import pandas as pd
df = pd.read_csv("abhorrent.csv")
df = df.fillna('')
cats = df["category"].drop_duplicates().tolist()
sub_cats = df["sub category"].drop_duplicates().tolist()
cat_sc_s = []
for cat in cats:
for sc in sub_cats:
if sc == '':
sc = 'blanc'
name = str(cat) + '_' + str(sc)
cat_sc_s.append(name)
df[name] = df[(df["category"] == cat) & (df["sub category"] == sc)]["value"]
df = df.fillna('')
df = df.groupby(["brand", "model", "column1", "column2", "column3"], as_index=False).agg(' '.join)
df = df.drop(['category', 'sub category', 'value'], axis = 1)
print(df)
测试一下,让我知道
但是,它会更改顺序。
结果:
brand model column1 column2 column3 cat1_sc1 cat1_sc2 cat1_blanc cat1_sc3 cat2_sc1 cat2_sc2 cat2_blanc cat2_sc3
0 Asus c aa iii
1 Asus c aa cc bb ggg
2 Asus d aa bb bb kkk lll
3 Asus d aa cc bb jjj
4 Dell a aa bb aaa bbb
5 Dell a aa cc ccc
6 Dell b aa bb ddd eee
7 Dell b aa bb cc
答案 1 :(得分:0)
由我的同事提供...
import pandas as pd
import numpy as np
GROUP_BY = ["brand", "model", "column1", "column2", "column3"]
CATEGORY = "category"
SUB_CATEGORY = "sub category"
VALUE = "value"
GROUPING = "grouping"
def combine_model(group):
def combine_value(value):
return value.str.cat(sep=" || ") #in case of multiple values for one category / sub category combination
value = group.groupby(GROUPING)[VALUE].apply(combine_value)
group.loc[:, value.index.tolist()] = value.values
return group
data = pd.read_csv("abhorrent.csv")
for col in GROUP_BY + [SUB_CATEGORY, VALUE]:
data[col].fillna("N/A", inplace=True)
data[GROUPING] = data[CATEGORY] + "_" + data[SUB_CATEGORY]
columns = data[GROUPING].drop_duplicates().tolist()
for col in columns:
data[col] = np.nan
data = data.groupby(GROUP_BY).apply(combine_model)
data.drop_duplicates(subset=GROUP_BY, inplace=True)
Rest只是关于删除不必要的列...
谢谢所有提出解决方案或发表评论的人!