我的数据框有问题。
我的df是:
product power brand
product_1 3 x 1500W brand_A
product_2 2x1000W + 1x100W
product 3 1x1500W + 1x500W brand_B
product 4 500W
我需要乘以乘数乘以乘数(用幂减去)
我的df预期:
product power brand new_product
product_1 1500W brand_A product_1_1
product_1 1500W brand_A product_1_2
product_1 1500W brand_A product_1_3
product_2 1000W product_2_1
product_2 1000W product_2_2
product_2 100W product_2_3
product 3 1500W brand_B product_3_1
product 3 500W brand_B product_3_2
product 4 500W product_4_1
感谢您的帮助
答案 0 :(得分:3)
我将进行字符串提取和合并,然后执行一些清理任务:
df1 = (df.power.str.extractall('(\d+)\s?x\s?(\d+W)')
.reset_index(level=1,drop=True)
)
new_df = df.merge(df1[1].repeat(df1[0]),
left_index=True,
right_index=True,
how='outer')
# update the power column
new_df['power']= np.where(new_df[1].isna(), new_df['power'], new_df[1])
# drop the extra 1 column
new_df.drop(1, axis=1, inplace=True)
# new_product column
new_df['new_product'] = (new_df['product'] + '_' +
new_df.groupby('product').cumcount().add(1).astype(str) )
输出:
product power brand new_product
0 product_1 1500W brand_A product_1_1
0 product_1 1500W brand_A product_1_2
0 product_1 1500W brand_A product_1_3
1 product_2 1000W None product_2_1
1 product_2 1000W None product_2_2
1 product_2 100W None product_2_3
2 product 3 1500W brand_B product 3_1
2 product 3 500W brand_B product 3_2
3 product 4 500W None product 4_1
答案 1 :(得分:1)
@Quang Hoang是一个更正确的答案,因为它仅通过pandas
方法实现。无论如何,我只使用普通的python留下一个解决方案:
import pandas as pd
import numpy as np
cols = ['product', 'power', 'brand']
data = [
['product_1', '3 x 1500W', 'brand_A'],
['product_2', '2x1000W + 1x100W', np.nan],
['product 3', '1x1500W + 1x500W', 'brand_B'],
['product 4', '500W', np.nan]
]
df = pd.DataFrame(columns=cols, data=data)
print(df)
原始数据:
product power brand
0 product_1 3 x 1500W brand_A
1 product_2 2x1000W + 1x100W NaN
2 product 3 1x1500W + 1x500W brand_B
3 product 4 500W NaN
items = df.power.values.tolist()
brands = df.brand.values.tolist()
res = zip(items, brands)
new_data = []
for idx, aux in enumerate(res):
item, brand = aux
for idx2, power_model in enumerate(item.split('+')):
res = power_model.strip().split('x')
if len(res) == 2:
units, val = res
else:
units = 1
val = res[0]
for _ in range(int(units)):
new_data.append(
[
f'product_{idx + 1}',
val,
brand,
f'product_{idx + 1}_{idx2 + 1}'
]
)
new_cols = ['product', 'power', 'brand', 'new_product']
df2 = pd.DataFrame(columns=new_cols, data=new_data)
print(df2)
product power brand new_product
0 product_1 1500W brand_A product_1_1
1 product_1 1500W brand_A product_1_1
2 product_1 1500W brand_A product_1_1
3 product_2 1000W NaN product_2_1
4 product_2 1000W NaN product_2_1
5 product_2 100W NaN product_2_2
6 product_3 1500W brand_B product_3_1
7 product_3 500W brand_B product_3_2
8 product_4 500W NaN product_4_1