这是我正在尝试根据收到的医疗服务数量为每个客户端压缩多行的csv文件的一系列问题的一部分。对于每项服务,他们都有一排。我已将数据框包含在底部。
我正在尝试计算客户端(使用ID_profile编号标识)获得每种类型服务的次数,并将其添加到以服务类型命名的列中。因此,如果客户获得3个早期干预服务,我会将数字“3”添加到“eisserv”列。完成后,我想将所有客户端行合并为一个。
我遇到的问题是填充3个不同的列,其中的数据基于一列。我试图使用一些字符串迭代行,以便比较函数。该功能有效,但由于我无法理解的原因,所有字符串都会变为“25”,因为该功能有效。
import pandas as pd
df = pd.read_csv('fakeRWclient.csv')
df['PrimaryServiceCategory'] = df['PrimaryServiceCategory'].map({'Referral for Health Care/Supportive Services': '33', 'Health Education/Risk reduction': '25', 'Early Intervention Services (Parts A and B)': '11'})
df['ServiceDate'] = pd.to_datetime(df['ServiceDate'], format="%m/%d/%Y")
df['id_profile'] = df['id_profile'].apply(str)
df['served'] = df['id_profile'] + " " + df['PrimaryServiceCategory']
df['count'] = df['served'].map(df['served'].value_counts())
eis = "11"
ref = "33"
her = "25"
print("Here are the string values")
print(eis)
print(ref)
print(her)
df['herrserv']=""
df['refserv']=""
df['eisserv']=""
for index in df.itertuples():
for eis in df['PrimaryServiceCategory']:
df['eisserv'] = df['count']
for her in df['PrimaryServiceCategory']:
df['herrserv'] = df['count']
for ref in df['PrimaryServiceCategory']:
df['refserv'] = df['count']
print("Here are the string values")
print(eis)
print(ref)
print(her)
这是输出:
Here are the string values
11
33
25
Here are the string values
25
25
25
id_profile ServiceDate PrimaryServiceCategory served count herrserv
\
0 439 2017-12-05 25 439 25 1 1
1 444654 2017-01-25 25 444654 25 2 2
2 56454 2017-12-05 33 56454 33 1 1
3 56454 2017-01-25 25 56454 25 2 2
4 444654 2017-03-01 25 444654 25 2 2
5 56454 2017-01-01 25 56454 25 2 2
6 12222 2017-01-05 11 12222 11 1 1
7 12222 2017-01-30 25 12222 25 3 3
8 12222 2017-03-01 25 12222 25 3 3
9 12222 2017-03-20 25 12222 25 3 3
refserv eisserv
0 1 1
1 2 2
2 1 1
3 2 2
4 2 2
5 2 2
6 1 1
7 3 3
8 3 3
9 3 3
为什么字符串值会切换?这是否是我正在做的事情的正确功能?
答案 0 :(得分:2)
将整数映射到类别后,可以使用pandas.get_dummies
,然后与数据框合并。
您可以添加一个“计数”列,将3个类别计数加在一起。
df = pd.DataFrame({'id_profile': [439, 444654, 56454, 56454, 444654, 56454, 12222, 12222, 12222, 12222],
'ServiceDate': ['2017-12-05', '2017-01-25', '2017-12-05', '2017-01-25', '2017-03-01', '2017-01-01', '2017-01-05', '2017-01-30', '2017-03-01', '2017-03-20'],
'PrimaryServiceCategory': [25, 25, 33, 25, 25, 25, 11, 25, 25, 25]})
d = {11: 'eis', 33: 'ref', 25: 'her'}
df['Service'] = df['PrimaryServiceCategory'].map(d)
df = df.set_index('id_profile')\
.join(pd.get_dummies(df.drop('PrimaryServiceCategory', 1), columns=['Service'])\
.groupby(['id_profile']).sum())
# ServiceDate PrimaryServiceCategory Service Service_eis \
# id_profile
# 439 2017-12-05 25 her 0
# 12222 2017-01-05 11 eis 1
# 12222 2017-01-30 25 her 1
# 12222 2017-03-01 25 her 1
# 12222 2017-03-20 25 her 1
# 56454 2017-12-05 33 ref 0
# 56454 2017-01-25 25 her 0
# 56454 2017-01-01 25 her 0
# 444654 2017-01-25 25 her 0
# 444654 2017-03-01 25 her 0
# Service_her Service_ref
# id_profile
# 439 1 0
# 12222 3 0
# 12222 3 0
# 12222 3 0
# 12222 3 0
# 56454 2 1
# 56454 2 1
# 56454 2 1
# 444654 2 0
# 444654 2 0
答案 1 :(得分:1)
我仅对您现有的代码进行了更改。
import pandas as pd
df = pd.read_csv('fakeRWclient.csv')
df['PrimaryServiceCategory'] = df['PrimaryServiceCategory'].map({'Referral for Health Care/Supportive Services': '33', 'Health Education/Risk reduction': '25', 'Early Intervention Services (Parts A and B)': '11'})
df['ServiceDate'] = pd.to_datetime(df['ServiceDate'], format="%m/%d/%Y")
df['id_profile'] = df['id_profile'].apply(str)
print(df.groupby('id_profile').PrimaryServiceCategory.count())
上面的代码会给出如下输出:
id_profile
439 1
12222 4
56454 3
444654 2
答案 2 :(得分:1)
eis
,ref
和her
的值切换为" 25"因为你循环变量PrimaryServiceCategory
,该系列中的最后一个值是" 25"。您使用eis
,ref
和her
作为迭代器变量的名称,因此它们在每个循环中都会发生变化。
我认为这是一种效率低下的方法。如果你使用groupby和transform,它会更好:
df['count'] = df.groupby(['id_profile','PrimaryServiceCategory']).transform('count')