根据另一个数据集熊猫中的值创建一个新的数据集

时间:2019-07-12 04:56:24

标签: python pandas

我有一个结构如下的数据集:

enter image description here

正如您在标头中看到的那样,值在/之后,即group_activityrevenue_freqrevenuemonthly - calc'd和{{1} }该组的名称即为/dairy等。

我正在用Python编写逻辑,该逻辑首先检查例如第一行livestockdairy中没有可用的值,但是livestock是填充。所以当检测到这个 我想将这些值构造为:

enter image description here

此处poultry是一个数字,用于跟踪连续多少种不同类型的活动。从Sr. Number开始,其中IA[x]可以是x

我应该怎么做?要在工作表中查看以上数据,这里是GOOGLE SHEET LINK,其中有两张工作表01-13Input

2 个答案:

答案 0 :(得分:0)

这里是MultiIndex的必要预处理列名称-第一级别按/之前的值,第二个值对于每个第一级别相同-因此按Index.to_series创建帮助器DataFrame df1Series.str.split

df = pd.read_csv('Sample Dataset - Input.csv')

df1 = df.columns.to_series().str.split('/', expand=True)
df1[['a','b','c']] = df1[0].str.partition(' monthly ')
df1[1] = df1[1].str.split('_', n=1).str[1]
df1[1] = df1[1].fillna(df1['b'].str.cat(df1['c'].str.strip('>')))
df1['a'] = df1['a'].str.strip('<')
print (df1[['a', 1]])
                                          a                  1
dairy/group_activity                  dairy           activity
dairy/dairy_revenue_freq              dairy       revenue_freq
dairy/dairy_revenue                   dairy            revenue
<dairy monthly - calc'd>              dairy   monthly - calc'd
livestock/group_activity          livestock           activity
livestock/livestock_revenue_freq  livestock       revenue_freq
livestock/livestock_revenue       livestock            revenue
<livestock monthly - calc'd>      livestock   monthly - calc'd
poultry/group_activity              poultry           activity
poultry/poultry_revenue_freq        poultry       revenue_freq
poultry/poultry_revenue             poultry            revenue
<poultry monthly - calc'd>          poultry   monthly - calc'd

然后通过MultiIndex.from_arrays创建MultiIndex,因此可以通过DataFrame.stack进行整形,通过DataFrame.dropna删除每行的缺失值,并通过{{ 3}}:

Sr. Number

答案 1 :(得分:0)

您可以尝试以下方法:

df = pd.read_excel('Sample Dataset.xlsx')
cols = {'dairy': (1, 4), 'livestock': (5, 8), 'poultry': (9, 12)}

data = list()
for index, row in df.iterrows():
    counter = 0
    for group in cols.keys():
        if row[cols[group][0]-1:cols[group][1]].isna().sum() == 0:
            counter+=1
            data.append(['AI'+str(counter).zfill(2), group] + list(row[cols[group][0]-1:cols[group][1]].values))

pd.DataFrame(data, columns=['Sr. Number', 'Raw', 'group_activity', 'time period', 'amount', 'monthly_amount'])

结果:

  Sr. Number        Raw group_activity time period   amount  monthly_amount
0       AI01    poultry            yes     monthly  10000.0       10000.000
1       AI01    poultry            yes      weekly    725.0        2900.000
2       AI01    poultry             no      yearly   3000.0         250.000
3       AI01  livestock             no      yearly   4500.0         375.000
4       AI02    poultry             no     monthly    600.0         600.000
5       AI01  livestock             no      yearly   8000.0         666.667
6       AI02    poultry             no     monthly   2000.0        2000.000
7       AI01    poultry             no     monthly   5000.0        5000.000
8       AI01      dairy            yes     monthly   2000.0        2000.000
9       AI01    poultry             no      weekly    480.0        1920.000