基于值将数据从一列拆分为单独的列

时间:2016-01-25 21:37:13

标签: python pandas

以下是我更大的脚本中的问题代码。我在"测量"下有5到7个不同类别(例如:身高,体重,BMI等)的数据。专栏&相应的测量值。对于下游流处理,我希望将各个列中的值放在一起。

# Import Packages
# -----------------
import re
import pandas as pd


# Sample Data Input
# -----------------
result = [
'XD59876,KEN,name="height",value="5.9",name="weight",value="180",name="Ivef",value="0.09",name="o2_saturation",value="2",name="BMI",value="27",name="heart_rate",value="66"',
'FC00187,ROW,name="height",value="5.11",name="weight",value="210"',
'AN66521,ZEN,name="Ivef",value="0.7",name="o2_saturation",value="62",name="BMI",value="26"',
'NW0098,PLO,name="height",value="6.2",name="weight",value="240",name="o2_saturation",value="2.3",name="heart_rate",value="68"',
'XD57776,KIT,name="BMI",value="32"',
'FC98763,ABC,name="Ivef",value="0.87",name="o2_saturation",value="2.67",name="heart_rate",value="68"'
]


# Output List
# -----------------
output = []


# Regular Expressions Used To Pull Measurement Values
# ---------------------------------------------------
measurement_nameRegex = r'name="([^"]+)"'
measurement_valueRegex = r'value="([^"]+)"'


# Iterate through list
# ---------------------------------------------------
for line in result:
    # CSV values
    key, fac, measurements = line.split(',', 2)

    # Create list using regular expression
    measurement_name = re.findall(measurement_nameRegex, measurements)
    measurement_value = re.findall(measurement_valueRegex, measurements)

    # Check to see we collect only complete data
    if len(measurement_name) == len(measurement_value):

        # Zip up measurement name with corresponding values & units
        row = zip(measurement_name, measurement_value)
        if row != []:
            for index, value in enumerate(row):
                output.append([key, fac, value[0], value[1]])

df = pd.DataFrame(output, columns=["Key", "Facility", "Measurement", "Value"])

# df_pivot = df.pivot_table(index=["Key", "Facility"], columns="Measurement", values="Value")

print(df)

当前输出:

        Key Facility    Measurement Value
0   XD59876      KEN         height   5.9
1   XD59876      KEN         weight   180
2   XD59876      KEN           Ivef  0.09
3   XD59876      KEN  o2_saturation     2
4   XD59876      KEN            BMI    27
5   XD59876      KEN     heart_rate    66
6   FC00187      ROW         height  5.11

期望输出:

Key          Facility    height   weight  Ivef  o2_saturation  BMI  heart_rate
XD59876      KEN         5.9      180     0.09  2              27   66

我尝试过Pandas pivot& pivot_table但他们会聚合。我不想聚合任何东西。我想要的只是改变数据的组织方式。

4 个答案:

答案 0 :(得分:1)

这个使用numpy模块在​​开始时提取所有名称,然后使用循环,如下面列出的解决方案的问题代码中所使用的那样 -

import re
import pandas as pd
import numpy as np

meas_nms = [re.findall(r'\"(.+?)\"',item) for item in result]
all_names = ['Key','Facility'] + np.unique(np.concatenate(meas_nms)[::2]).tolist()

output = []
df = pd.DataFrame(output, columns=all_names)
for i,line in enumerate(result):
    K,F,meas = line.split(',',2)
    meas_split = meas.split(',')

    nms = [re.findall(r'\"(.+?)\"',item)[0] for item in meas_split[::2]]
    vals = [re.findall(r'\"(.+?)\"',item)[0] for item in meas_split[1::2]]

    df.loc[i, ['Key','Facility']] = [K,F]
    df.loc[i, nms] = vals

过帐样本数据的输出 -

>>> df
       Key Facility  BMI  Ivef heart_rate height o2_saturation weight
0  XD59876      KEN   27  0.09         66    5.9             2    180
1  FC00187      ROW  NaN   NaN        NaN   5.11           NaN    210
2  AN66521      ZEN   26   0.7        NaN    NaN            62    NaN
3   NW0098      PLO  NaN   NaN         68    6.2           2.3    240
4  XD57776      KIT   32   NaN        NaN    NaN           NaN    NaN
5  FC98763      ABC  NaN  0.87         68    NaN          2.67    NaN

答案 1 :(得分:1)

纯粹的熊猫解决方案:

import pandas as pd

# some sample data...
rows = [('XD59876','KEN','height','5.9'),
        ('XD59876','KEN','weight','0.09'),
        ('XD59876','KEN','o2_sat','2'),
        ('FC00187 ','ROW','height','5.11')]
df = pd.DataFrame(rows, columns=['Key','Facility','Measurement','Value'])

# move everything but Value to the index
df.set_index(['Key', 'Facility', 'Measurement'], inplace=True)
# convert the Measurement index to column labels
df = df.unstack('Measurement')
# get rid of 'Measurement' label in the columns index
df.columns = df.columns.droplevel()
# get rid of Value label
df.columns.name = ''
# make Key and Facility regular columns again
df.reset_index(inplace=True)

print df

输出是:

        Key Facility height o2_sat weight
0  FC00187       ROW   5.11    NaN    NaN
1   XD59876      KEN    5.9      2   0.09

答案 2 :(得分:1)

Divakar& SPKoder完美运作。 这是我在路上学到的东西。

# Lists
# -----------------
column_header = []


# Regular Expressions Used To Pull Measurement Values
# ---------------------------------------------------
measurement_nameRegex = r'name="([^"]+)"'
measurement_valueRegex = r'value="([^"]+)"'


# Processing
# -----------------

# Create A List Of Values That Needs To Be Transposed
for index, line in enumerate(result):
    measurement_name = re.findall(measurement_nameRegex, line)
    column_header.extend(measurement_name)

# Create Column Header
all_names = ['Key', 'Facility'] + list(set(column_header))

# Create Empty Dataframe With Column Header
df = pd.DataFrame(columns=all_names)


# Iterate through list
# ---------------------------------------------------

# Hold On To Index For Each Record
for index, line in enumerate(result):

    # Extract CSV values
    key, fac, measurements = line.split(',', 2)

    # Create list using regular expression
    measurement_name = re.findall(measurement_nameRegex, measurements)
    measurement_value = re.findall(measurement_valueRegex, measurements)

    # Insert Values Into Dataframe Based On Index
    df.loc[index, ['Key', 'Facility']] = [key, fac]
    df.loc[index, measurement_name] = measurement_value

df.to_csv(output_file_path)

答案 3 :(得分:0)

我认为你可以用pandas.pivot_table

来做到这一点
In[75]: import pandas as pd

In[76]: df = pd.DataFrame({'Key': [1] * 9 + [2] * 9, 'Facility': (['a'] * 3 + ['b'] * 3) * 3, 'Measurement': range(10, 19) * 2, 'value': range(18)})

In[77]: df

Out[77]:
   Facility  Key  Measurement  value
0         a    1           10      0
1         a    1           11      1
2         a    1           12      2
3         b    1           13      3
4         b    1           14      4
5         b    1           15      5
6         a    1           16      6
7         a    1           17      7
8         a    1           18      8
9         b    2           10      9
10        b    2           11     10
11        b    2           12     11
12        a    2           13     12
13        a    2           14     13
14        a    2           15     14
15        b    2           16     15
16        b    2           17     16
17        b    2           18     17

In[78]: pd.pivot_table(df, values='value', index=['Key', 'Facility'], columns=['Measurement'])

Out[78]:
Measurement   10  11  12  13  14  15  16  17  18
Key Facility
1   a          0   1   2 NaN NaN NaN   6   7   8
    b        NaN NaN NaN   3   4   5 NaN NaN NaN
2   a        NaN NaN NaN  12  13  14 NaN NaN NaN
    b          9  10  11 NaN NaN NaN  15  16  17

或者,如果您不想拥有'设施'和' Key'作为索引而不是常规列,只需附加reset_index()

In[79]: pd.pivot_table(df, values='value', index=['Key', 'Facility'], columns=['Measurement']).reset_index()
Out[79]:
Measurement  Key Facility  10  11  12  13  14  15  16  17  18
0              1        a   0   1   2 NaN NaN NaN   6   7   8
1              1        b NaN NaN NaN   3   4   5 NaN NaN NaN
2              2        a NaN NaN NaN  12  13  14 NaN NaN NaN
3              2        b   9  10  11 NaN NaN NaN  15  16  17

请注意,所有' NaN的结果来自密钥,工具和管理的组合,这些组合不会出现在我的示例表中。