我有一个如下的数据框,包含许多功能列,但下面只提到了3个:
productid |feature1 |value1 |feature2 |value2 | feature3 |value3
100001 |weight | 130g | |price |$140.50
100002 |weight | 200g |pieces |12 pcs | dimensions |150X75cm
100003 |dimensions |70X30cm |price |$22.90
100004 |price |$12.90 |manufacturer| ABC |calories |556Kcal
100005 |calories |1320Kcal|dimensions |20X20cm |manufacturer | XYZ
我希望使用pandas以下列方式构建它:
productid weight dimensions price calories no. of pieces manufacturer
100001 130g $140.50
100002 200g 150X75cm 12 pcs
100003 70X30cm $22.90
100004 $12.90 556Kcal ABC
100005 20X20cm 1320Kcal XYZ
我研究了各种pandas方法,如reset_index,stack等,但没有按要求的方式进行转换。
答案 0 :(得分:1)
您正在寻找解压缩数据帧的代码。直截了当的方式是(具有许多功能并可能重复产品):
import pandas as pd
import numpy as np
def expand(frame):
df = pd.DataFrame()
for row in frame.iterrows():
data = row[1]
for feature_name, feature_value in zip(data[1::2], data[2::2]):
if feature_name:
df.loc[data.productid, feature_name] = feature_value
return df.replace(np.nan, '')
df = pd.DataFrame([("100001", "weight", "130g", None, None, "price", "$140.50"),
("100002", "weight", "200g", "pieces", "12 pcs", "dimensions", "150X75cm"),
("100003", "dimensions", "70X30cm", "price", "$22.90"),
("100004", "price", "$12.90", "manufacturer", "ABC", "calories", "556Kcal"),
("100005", "calories", "1320Kcal", "dimensions", "20X20cm", "manufacturer", "XYZ")],
columns=["productid", "feature1", "value1", "feature2", "value2", "feature3", "value3"])
xdf = expand(df)
print(xdf)
输出:
weight price pieces dimensions manufacturer calories
100001 130g $140.50
100002 200g 12 pcs 150X75cm
100003 $22.90 70X30cm
100004 $12.90 ABC 556Kcal
100005 20X20cm XYZ 1320Kcal
EDIT1:稍微压缩的形式:(慢!)
def expand2(frame):
return pd.DataFrame.from_dict(
{data.productid: {f: v for f, v in zip(data[1::2], data[2::2]) if f} for _, data in frame.iterrows()},
orient='index')
EDIT2:使用生成器表达式:
def expand3(frame):
return pd.DataFrame.from_records(
({f: v for f, v in itertools.chain((('productid', data.productid),), zip(data[1::2], data[2::2])) if f}
for _, data
in frame.iterrows()), index='productid').replace(np.nan, '')
一些测试(使用@timeit
装饰函数):
def timeit(f):
@functools.wraps(f)
def timed(*args, **kwargs):
try:
start_time = time.time()
return f(*args, **kwargs)
finally:
end_time = time.time()
function_invocation = "x"
sys.stdout.flush()
print(f'Function {f.__name__}({function_invocation}), took: {end_time - start_time:2.4f} seconds.',
flush=True, file=sys.stderr)
return timed
def generate_wide_df(n_rows, n_features):
possible_labels = [f'label_{i}' for i in range(n_features)]
columns = ['productid']
for i in range(1, n_features):
columns.append(f'feature_{i}')
columns.append(f'value_{i}')
df = pd.DataFrame(columns=columns)
for row_n in range(n_rows):
df.loc[row_n, 'productid'] = int(1000000 + row_n)
for _ in range(n_features):
feature_num = random.randint(1, n_features)
df.loc[row_n, f'feature_{feature_num}'] = random.choice(possible_labels)
df.loc[row_n, f'value_{feature_num}'] = random.randint(1, 10000)
return df.where(df.notnull(), None)
df = generate_wide_df(4000, 30)
expand(df)
expand3(df)
expand2(df)
结果:
Function expand(x), took: 1.1576 seconds.
Function expand3(x), took: 1.1185 seconds.
Function expand2(x), took: 16.3055 seconds.
答案 1 :(得分:1)
这是一个可重复的示例,请查看注释以获取详细信息。
productid dimensions manufacturer pieces price calories weight
0 100001 NaN NaN NaN $140.50 NaN 130g
1 100002 150X75cm NaN 12pcs NaN NaN 200g
2 100003 70X30cm NaN NaN $22.90 NaN NaN
3 100004 NaN ABC NaN $12.90 556Kcal NaN
4 100005 20X20cm XYZ NaN NaN 1320Kcal NaN
输出:
has_many :images
答案 2 :(得分:0)
这里的难点在于您有多个功能和多个值列。对于大熊猫而言,如果不给它一点帮助,就很难认识到这一点。例如,如果您的DataFrame的子部分只有一个要素和一个值,
subdf = df[['productid', 'feature1', 'value1']].copy()
print(subdf)
productid feature1 value1
0 100001 weight 130g
1 100002 weight 200g
2 100003 dimensions 70X30cm
3 100004 price $12.90
4 100005 calories 1320Kcal
...你可以使用FOSUserBundle的单行代码:
print(subdf.pivot(index='productid', columns='feature1',
values='value1'))
feature1 calories dimensions price weight
productid
100001 None None None 130g
100002 None None None 200g
100003 None 70X30cm None None
100004 None None $12.90 None
100005 1320Kcal None None None
在更复杂的情况下,一种入门方式是首先堆叠所有要素列和所有值列。然后,您的中间结果是具有一个要素和一个值列的单个DataFrame。这会以pivot
将接受的形式获取内容。它还避免了需要构建涉及进一步迭代的混乱函数。
features = pd.concat([df[col] for col in df.filter(like='feature')])
values = pd.concat([df[col] for col in df.filter(like='value')])
res = pd.concat((features, values), axis=1)
# unfortunately, `res` has lost its product ids but we can map them
# back from their index ids from the original df
ids = df.productid.to_dict()
res.index = res.index.map(lambda x: ids[x])
现在在pivot
上调用res
非常简单:
res = res.dropna().pivot(columns=0, values=1)
res.index.name = 'productid'
print(res)
calories dimensions manufacturer pieces price weight
productid
100001 None None None None $140.50 130g
100002 None 150X75cm None 12pcs None 200g
100003 None 70X30cm None None $22.90 None
100004 556Kcal None ABC None $12.90 None
100005 1320Kcal 20X20cm XYZ None None None
此解决方案的优势在于您只需调用pivot
一次而不是每个子帧。唯一涉及的迭代是pd.concat
,对于大型数据集应该有显着的加速。