结构数据集从行到列pandas python

时间:2017-08-21 09:16:26

标签: python pandas dataframe

我有一个如下的数据框,包含许多功能列,但下面只提到了3个:

productid   |feature1   |value1 |feature2    |value2     | feature3    |value3
100001      |weight     | 130g   |                       |price        |$140.50
100002      |weight     | 200g   |pieces     |12 pcs     | dimensions  |150X75cm
100003      |dimensions |70X30cm |price      |$22.90        
100004      |price      |$12.90  |manufacturer| ABC    |calories    |556Kcal
100005      |calories   |1320Kcal|dimensions |20X20cm  |manufacturer   | XYZ

我希望使用pandas以下列方式构建它:

productid   weight  dimensions  price   calories    no. of pieces   manufacturer
100001       130g              $140.50          
100002       200g    150X75cm                         12 pcs    
100003               70X30cm    $22.90          
100004                          $12.90   556Kcal                          ABC
100005               20X20cm            1320Kcal                         XYZ

我研究了各种pandas方法,如reset_index,stack等,但没有按要求的方式进行转换。

3 个答案:

答案 0 :(得分:1)

您正在寻找解压缩数据帧的代码。直截了当的方式是(具有许多功能并可能重复产品):

import pandas as pd
import numpy as np

def expand(frame):
    df = pd.DataFrame()
    for row in frame.iterrows():
        data = row[1]
        for feature_name, feature_value in zip(data[1::2], data[2::2]):
            if feature_name:
                df.loc[data.productid, feature_name] = feature_value
    return df.replace(np.nan, '')


df = pd.DataFrame([("100001", "weight", "130g", None, None, "price", "$140.50"),
("100002", "weight", "200g", "pieces", "12 pcs", "dimensions", "150X75cm"),
("100003", "dimensions", "70X30cm", "price", "$22.90"),
("100004", "price", "$12.90", "manufacturer", "ABC", "calories", "556Kcal"),
("100005", "calories", "1320Kcal", "dimensions", "20X20cm", "manufacturer", "XYZ")],
                  columns=["productid", "feature1", "value1", "feature2", "value2", "feature3", "value3"])

xdf = expand(df)
print(xdf)

输出:

       weight    price  pieces dimensions manufacturer  calories
100001   130g  $140.50                                          
100002   200g           12 pcs   150X75cm                       
100003          $22.90            70X30cm                       
100004          $12.90                             ABC   556Kcal
100005                            20X20cm          XYZ  1320Kcal

EDIT1:稍微压缩的形式:(慢!)

def expand2(frame):
    return pd.DataFrame.from_dict(
        {data.productid: {f: v for f, v in zip(data[1::2], data[2::2]) if f} for _, data in frame.iterrows()},
        orient='index')

EDIT2:使用生成器表达式:

def expand3(frame):
    return pd.DataFrame.from_records(
        ({f: v for f, v in itertools.chain((('productid', data.productid),), zip(data[1::2], data[2::2])) if f}
         for _, data
         in frame.iterrows()), index='productid').replace(np.nan, '')

一些测试(使用@timeit装饰函数):

def timeit(f):
    @functools.wraps(f)
    def timed(*args, **kwargs):
        try:
            start_time = time.time()
            return f(*args, **kwargs)
        finally:
            end_time = time.time()
            function_invocation = "x"
            sys.stdout.flush()
            print(f'Function {f.__name__}({function_invocation}), took: {end_time - start_time:2.4f} seconds.',
                  flush=True, file=sys.stderr)

    return timed

def generate_wide_df(n_rows, n_features):
    possible_labels = [f'label_{i}' for i in range(n_features)]
    columns = ['productid']
    for i in range(1, n_features):
        columns.append(f'feature_{i}')
        columns.append(f'value_{i}')

    df = pd.DataFrame(columns=columns)
    for row_n in range(n_rows):
        df.loc[row_n, 'productid'] = int(1000000 + row_n)
        for _ in range(n_features):
            feature_num = random.randint(1, n_features)
            df.loc[row_n, f'feature_{feature_num}'] = random.choice(possible_labels)
            df.loc[row_n, f'value_{feature_num}'] = random.randint(1, 10000)
    return df.where(df.notnull(), None)


df = generate_wide_df(4000, 30)


expand(df)
expand3(df)
expand2(df)

结果:

Function expand(x), took: 1.1576 seconds.
Function expand3(x), took: 1.1185 seconds.
Function expand2(x), took: 16.3055 seconds.

答案 1 :(得分:1)

这是一个可重复的示例,请查看注释以获取详细信息。

   productid dimensions manufacturer pieces    price  calories weight
0     100001        NaN          NaN    NaN  $140.50       NaN   130g
1     100002   150X75cm          NaN  12pcs      NaN       NaN   200g
2     100003    70X30cm          NaN    NaN   $22.90       NaN    NaN
3     100004        NaN          ABC    NaN   $12.90   556Kcal    NaN
4     100005    20X20cm          XYZ    NaN      NaN  1320Kcal    NaN

输出:

has_many :images

答案 2 :(得分:0)

这里的难点在于您有多个功能和多个值列。对于大熊猫而言,如果不给它一点帮助,就很难认识到这一点。例如,如果您的DataFrame的子部分只有一个要素和一个值,

subdf = df[['productid', 'feature1', 'value1']].copy()    
print(subdf)
   productid    feature1    value1
0     100001      weight      130g
1     100002      weight      200g
2     100003  dimensions   70X30cm
3     100004       price    $12.90
4     100005    calories  1320Kcal

...你可以使用FOSUserBundle的单行代码:

print(subdf.pivot(index='productid', columns='feature1', 
      values='value1'))
feature1   calories dimensions   price weight
productid                                    
100001         None       None    None   130g
100002         None       None    None   200g
100003         None    70X30cm    None   None
100004         None       None  $12.90   None
100005     1320Kcal       None    None   None

在更复杂的情况下,一种入门方式是首先堆叠所有要素列和所有值列。然后,您的中间结果是具有一个要素和一个值列的单个DataFrame。这会以pivot将接受的形式获取内容。它还避免了需要构建涉及进一步迭代的混乱函数。

features = pd.concat([df[col] for col in df.filter(like='feature')])
values = pd.concat([df[col] for col in df.filter(like='value')])
res = pd.concat((features, values), axis=1)

# unfortunately, `res` has lost its product ids but we can map them
# back from their index ids from the original df
ids = df.productid.to_dict()
res.index = res.index.map(lambda x: ids[x])

现在在pivot上调用res非常简单:

res = res.dropna().pivot(columns=0, values=1)
res.index.name = 'productid'

print(res)
           calories dimensions manufacturer pieces    price weight
productid                                                         
100001         None       None         None   None  $140.50   130g
100002         None   150X75cm         None  12pcs     None   200g
100003         None    70X30cm         None   None   $22.90   None
100004      556Kcal       None          ABC   None   $12.90   None
100005     1320Kcal    20X20cm          XYZ   None     None   None

此解决方案的优势在于您只需调用pivot一次而不是每个子帧。唯一涉及的迭代是pd.concat,对于大型数据集应该有显着的加速。