dfaugment = dftrain.sort_values('text', ascending=False).groupby('Category')
countdict = dict(dfaugment['Category'].count())
countdictmax = max(countdict.values())
shortdict = {}
for key, value in countdict.items():
if value <= countdictmax:
shortdict[key] = countdictmax - value
我正在尝试根据类别字段的最大数量为不同的类别字段生成重复的行。
例如:
Category | text
Shoes | "aasdb"
Shoes | "frrrd"
Shoes | "ertbt"
Shoes | "erbete"
Shoes | "ervsss"
Sticks | "14345"
Sticks | "33445"
应该成为
Category | text
Shoes | "aasdb"
Shoes | "frrrd"
Shoes | "ertbt"
Shoes | "erbete"
Shoes | "ervsss"
Sticks | "14345"
Sticks | "33445"
Sticks | "14345" #new row (duplicated from above data)
Sticks | "33445" #new row (duplicated from above data)
Sticks | "14345" #new row (duplicated from above data)
答案 0 :(得分:1)
您可以使用itertools.cycle和zip
df = pd.DataFrame(
[('Shoes',"aasdb"),
('Shoes',"frrrd"),
('Shoes',"ertbt"),
('Shoes',"erbete"),
('Shoes',"ervsss"),
('Sticks',"14345"),
('Sticks',"33445")],
columns=['Category', 'text']
)
首先找到max_size,然后构造元组列表并传递给DataFrame构造函数。
max_size = df.groupby('Category').size().max()
pd.DataFrame(
[(a, b)
for k in df.Category.drop_duplicates()
for a, b in zip([k]*max_size, cycle(df.text[df.Category==k]))]
, columns = df.columns
)
这将输出:
Category text
0 Shoes aasdb
1 Shoes frrrd
2 Shoes ertbt
3 Shoes erbete
4 Shoes ervsss
5 Sticks 14345
6 Sticks 33445
7 Sticks 14345
8 Sticks 33445
9 Sticks 14345
变体1:
我认为正向填充就足够了
要进行预填充,请在Category
上使用iterools.zip_longest
,但不要在cycle
上使用 text
,然后再ffill
pd.DataFrame(
[(a, b)
for k in df.Category.drop_duplicates()
for a, b in zip_longest([k]*max_size, df.text[df.Category==k])]
, columns = df.columns).ffill()
这将输出:
Category text
0 Shoes aasdb
1 Shoes frrrd
2 Shoes ertbt
3 Shoes erbete
4 Shoes ervsss
5 Sticks 14345
6 Sticks 33445
7 Sticks 33445
8 Sticks 33445
9 Sticks 33445
变体2:
随机选择要复制的样本
我不确定确切地在这里是什么意思,但这是获得随机填充的一种方法。
开始类似于向前填充。
df2 = pd.DataFrame(
[(a, b)
for k in df.Category.drop_duplicates()
for a, b in zip_longest([k]*max_size, df.text[df.Category==k])]
, columns = df.columns
)
接下来,为每个组获取大小为text
的{{1}}的样本并将其堆叠。然后使用pandas.combine_first
max_size
示例df2输出(由于您尚未为示例设置种子,因此对您来说可能有所不同
fill = pd.concat(
[df.text[df.Category==k].sample(max_size, replace=True)
for k in df.Category.drop_duplicates()]
).reset_index(drop=True)
df2.text = df2.text.combine_first(fill)
答案 1 :(得分:1)
您可以考虑最大组值来尝试复制单个分组的数据帧,
def DuplicateRows(x,group_max):
Count = int(np.ceil((group_max - len(x))/len(x))) +1
return pd.concat([x]*Count)[:group_max]
group_max = df.groupby('Category').apply(len).max()
df.groupby('Category',group_keys=False).apply(lambda x: DuplicateRows(x,group_max))
出局:
Category text
0 Shoes "aasdb"
1 Shoes "frrrd"
2 Shoes "ertbt"
3 Shoes "erbete"
4 Shoes "ervsss"
5 Sticks "14345"
6 Sticks "33445"
5 Sticks "14345"
6 Sticks "33445"
5 Sticks "14345"