以下是我更大的脚本中的问题代码。我在"测量"下有5到7个不同类别(例如:身高,体重,BMI等)的数据。专栏&相应的测量值。对于下游流处理,我希望将各个列中的值放在一起。
# Import Packages
# -----------------
import re
import pandas as pd
# Sample Data Input
# -----------------
result = [
'XD59876,KEN,name="height",value="5.9",name="weight",value="180",name="Ivef",value="0.09",name="o2_saturation",value="2",name="BMI",value="27",name="heart_rate",value="66"',
'FC00187,ROW,name="height",value="5.11",name="weight",value="210"',
'AN66521,ZEN,name="Ivef",value="0.7",name="o2_saturation",value="62",name="BMI",value="26"',
'NW0098,PLO,name="height",value="6.2",name="weight",value="240",name="o2_saturation",value="2.3",name="heart_rate",value="68"',
'XD57776,KIT,name="BMI",value="32"',
'FC98763,ABC,name="Ivef",value="0.87",name="o2_saturation",value="2.67",name="heart_rate",value="68"'
]
# Output List
# -----------------
output = []
# Regular Expressions Used To Pull Measurement Values
# ---------------------------------------------------
measurement_nameRegex = r'name="([^"]+)"'
measurement_valueRegex = r'value="([^"]+)"'
# Iterate through list
# ---------------------------------------------------
for line in result:
# CSV values
key, fac, measurements = line.split(',', 2)
# Create list using regular expression
measurement_name = re.findall(measurement_nameRegex, measurements)
measurement_value = re.findall(measurement_valueRegex, measurements)
# Check to see we collect only complete data
if len(measurement_name) == len(measurement_value):
# Zip up measurement name with corresponding values & units
row = zip(measurement_name, measurement_value)
if row != []:
for index, value in enumerate(row):
output.append([key, fac, value[0], value[1]])
df = pd.DataFrame(output, columns=["Key", "Facility", "Measurement", "Value"])
# df_pivot = df.pivot_table(index=["Key", "Facility"], columns="Measurement", values="Value")
print(df)
当前输出:
Key Facility Measurement Value
0 XD59876 KEN height 5.9
1 XD59876 KEN weight 180
2 XD59876 KEN Ivef 0.09
3 XD59876 KEN o2_saturation 2
4 XD59876 KEN BMI 27
5 XD59876 KEN heart_rate 66
6 FC00187 ROW height 5.11
期望输出:
Key Facility height weight Ivef o2_saturation BMI heart_rate
XD59876 KEN 5.9 180 0.09 2 27 66
我尝试过Pandas pivot
& pivot_table
但他们会聚合。我不想聚合任何东西。我想要的只是改变数据的组织方式。
答案 0 :(得分:1)
这个使用numpy
模块在开始时提取所有名称,然后使用循环,如下面列出的解决方案的问题代码中所使用的那样 -
import re
import pandas as pd
import numpy as np
meas_nms = [re.findall(r'\"(.+?)\"',item) for item in result]
all_names = ['Key','Facility'] + np.unique(np.concatenate(meas_nms)[::2]).tolist()
output = []
df = pd.DataFrame(output, columns=all_names)
for i,line in enumerate(result):
K,F,meas = line.split(',',2)
meas_split = meas.split(',')
nms = [re.findall(r'\"(.+?)\"',item)[0] for item in meas_split[::2]]
vals = [re.findall(r'\"(.+?)\"',item)[0] for item in meas_split[1::2]]
df.loc[i, ['Key','Facility']] = [K,F]
df.loc[i, nms] = vals
过帐样本数据的输出 -
>>> df
Key Facility BMI Ivef heart_rate height o2_saturation weight
0 XD59876 KEN 27 0.09 66 5.9 2 180
1 FC00187 ROW NaN NaN NaN 5.11 NaN 210
2 AN66521 ZEN 26 0.7 NaN NaN 62 NaN
3 NW0098 PLO NaN NaN 68 6.2 2.3 240
4 XD57776 KIT 32 NaN NaN NaN NaN NaN
5 FC98763 ABC NaN 0.87 68 NaN 2.67 NaN
答案 1 :(得分:1)
纯粹的熊猫解决方案:
import pandas as pd
# some sample data...
rows = [('XD59876','KEN','height','5.9'),
('XD59876','KEN','weight','0.09'),
('XD59876','KEN','o2_sat','2'),
('FC00187 ','ROW','height','5.11')]
df = pd.DataFrame(rows, columns=['Key','Facility','Measurement','Value'])
# move everything but Value to the index
df.set_index(['Key', 'Facility', 'Measurement'], inplace=True)
# convert the Measurement index to column labels
df = df.unstack('Measurement')
# get rid of 'Measurement' label in the columns index
df.columns = df.columns.droplevel()
# get rid of Value label
df.columns.name = ''
# make Key and Facility regular columns again
df.reset_index(inplace=True)
print df
输出是:
Key Facility height o2_sat weight
0 FC00187 ROW 5.11 NaN NaN
1 XD59876 KEN 5.9 2 0.09
答案 2 :(得分:1)
Divakar& SPKoder完美运作。 这是我在路上学到的东西。
# Lists
# -----------------
column_header = []
# Regular Expressions Used To Pull Measurement Values
# ---------------------------------------------------
measurement_nameRegex = r'name="([^"]+)"'
measurement_valueRegex = r'value="([^"]+)"'
# Processing
# -----------------
# Create A List Of Values That Needs To Be Transposed
for index, line in enumerate(result):
measurement_name = re.findall(measurement_nameRegex, line)
column_header.extend(measurement_name)
# Create Column Header
all_names = ['Key', 'Facility'] + list(set(column_header))
# Create Empty Dataframe With Column Header
df = pd.DataFrame(columns=all_names)
# Iterate through list
# ---------------------------------------------------
# Hold On To Index For Each Record
for index, line in enumerate(result):
# Extract CSV values
key, fac, measurements = line.split(',', 2)
# Create list using regular expression
measurement_name = re.findall(measurement_nameRegex, measurements)
measurement_value = re.findall(measurement_valueRegex, measurements)
# Insert Values Into Dataframe Based On Index
df.loc[index, ['Key', 'Facility']] = [key, fac]
df.loc[index, measurement_name] = measurement_value
df.to_csv(output_file_path)
答案 3 :(得分:0)
我认为你可以用pandas.pivot_table
:
In[75]: import pandas as pd
In[76]: df = pd.DataFrame({'Key': [1] * 9 + [2] * 9, 'Facility': (['a'] * 3 + ['b'] * 3) * 3, 'Measurement': range(10, 19) * 2, 'value': range(18)})
In[77]: df
Out[77]:
Facility Key Measurement value
0 a 1 10 0
1 a 1 11 1
2 a 1 12 2
3 b 1 13 3
4 b 1 14 4
5 b 1 15 5
6 a 1 16 6
7 a 1 17 7
8 a 1 18 8
9 b 2 10 9
10 b 2 11 10
11 b 2 12 11
12 a 2 13 12
13 a 2 14 13
14 a 2 15 14
15 b 2 16 15
16 b 2 17 16
17 b 2 18 17
In[78]: pd.pivot_table(df, values='value', index=['Key', 'Facility'], columns=['Measurement'])
Out[78]:
Measurement 10 11 12 13 14 15 16 17 18
Key Facility
1 a 0 1 2 NaN NaN NaN 6 7 8
b NaN NaN NaN 3 4 5 NaN NaN NaN
2 a NaN NaN NaN 12 13 14 NaN NaN NaN
b 9 10 11 NaN NaN NaN 15 16 17
或者,如果您不想拥有'设施'和' Key'作为索引而不是常规列,只需附加reset_index()
:
In[79]: pd.pivot_table(df, values='value', index=['Key', 'Facility'], columns=['Measurement']).reset_index()
Out[79]:
Measurement Key Facility 10 11 12 13 14 15 16 17 18
0 1 a 0 1 2 NaN NaN NaN 6 7 8
1 1 b NaN NaN NaN 3 4 5 NaN NaN NaN
2 2 a NaN NaN NaN 12 13 14 NaN NaN NaN
3 2 b 9 10 11 NaN NaN NaN 15 16 17
请注意,所有' NaN的结果来自密钥,工具和管理的组合,这些组合不会出现在我的示例表中。