在Python中汇总一列值,包括字母

时间:2016-07-25 05:11:23

标签: python csv pandas dataframe

我有一个输入CSV文件,需要在其中一列中添加所有值,但这些值不是普通整数,我不确定如何处理它。

总输出应该在15k左右,这是整列的总和。我正在使用pandas数据帧来存储.csv文件。

以下是我的输入.csv文件中的一列:

DAMAGE_PROPERTY
0K
0K
2.5K
2.5K
.25K
.25K
2.5K
25K
2.5K
.25K
25K
25K
250K
2.5K
25K
2.5K
2.5K
2.5K
0K
2.5K
.25K
2.5K
25K

4 个答案:

答案 0 :(得分:4)

我认为您需要先按str.replace删除K,然后按astypesum投降到float

print (df.DAMAGE_PROPERTY.str.replace('K','').astype(float).sum())
401.0

然后可以通过1000

进行复用
print (df.DAMAGE_PROPERTY.str.replace('K','').astype(float).sum() * 1000)
401000.0

如果需要添加K

print (str(df.DAMAGE_PROPERTY.str.replace('K','').astype(float).sum()) + 'K')
401.0K

通过评论编辑:

如果需要K输出:

print (df)
  DAMAGE_PROPERTY
0            2.5K
1            2.5K
2             25M

#create mask where values `M`
mask = df.DAMAGE_PROPERTY.str.contains('M')
print (mask)
0    False
1    False
2     True
Name: DAMAGE_PROPERTY, dtype: bool

#multiple by 1000 where is mask
df['DAMAGE_PROPERTY'] = df.DAMAGE_PROPERTY.str.replace(r'[KM]','').astype(float)
df['DAMAGE_PROPERTY'] = df.DAMAGE_PROPERTY.mask(mask, df.DAMAGE_PROPERTY*1000)
print (df)
   DAMAGE_PROPERTY
0              2.5
1              2.5
2          25000.0

print (df['DAMAGE_PROPERTY'].sum())
25005.0

print (str(df['DAMAGE_PROPERTY'].sum()) + 'K' )
25005.0K

如果需要输出数字:

df['DAMAGE_PROPERTY'] = df.DAMAGE_PROPERTY.str.replace(r'[KM]','').astype(float)
df['DAMAGE_PROPERTY'] = df.DAMAGE_PROPERTY.mask(mask, df.DAMAGE_PROPERTY*1000) * 1000
print (df)
   DAMAGE_PROPERTY
0           2500.0
1           2500.0
2       25000000.0

print (df['DAMAGE_PROPERTY'].sum())
25005000.0

EDIT1:

如果值为B

print (df)
  DAMAGE_PROPERTY
0            2.5K
1            2.5B
2             25M

maskM = df.DAMAGE_PROPERTY.str.contains('M')
print (maskM)
0    False
1    False
2     True
Name: DAMAGE_PROPERTY, dtype: bool

maskB = df.DAMAGE_PROPERTY.str.contains('B')
print (maskB)
0    False
1     True
2    False
Name: DAMAGE_PROPERTY, dtype: bool

df['DAMAGE_PROPERTY'] = df.DAMAGE_PROPERTY.str.replace(r'[KMB]','').astype(float)
df['DAMAGE_PROPERTY'] = df.DAMAGE_PROPERTY.mask(maskM, df.DAMAGE_PROPERTY*1000)
df['DAMAGE_PROPERTY'] = df.DAMAGE_PROPERTY.mask(maskB, df.DAMAGE_PROPERTY*1000000)
print (df)
   DAMAGE_PROPERTY
0              2.5
1        2500000.0
2          25000.0

print (df['DAMAGE_PROPERTY'])
0          2.5
1    2500000.0
2      25000.0
Name: DAMAGE_PROPERTY, dtype: float64

答案 1 :(得分:3)

试试这个:

遵循此模式,您可以为数十亿添加“B”。并且对没有“K”或“M”的值不做任何事情。

def chgFormat(x):
        newFormat = 0
        if   x[-1] == 'K': newFormat = float(x[:-1])
        elif x[-1] == 'H': newFormat = float(x[:-1])/10    
        elif x[-1] == 'M': newFormat = float(x[:-1])*1000
        elif x[-1] == 'B': newFormat = float(x[:-1])*1000000    
        return newFormat

print str(sum(df['DAMAGE_PROPERTY'].dropna().apply(chgFormat)))+'K'
print str(sum(df['DAMAGE_PROPERTY'].dropna().apply(chgFormat))/1000)+'M''

Results:
401.0K
0.401M

使用此:如果有NaNs:

    print str(sum(df3['DAMAGE_PROPERTY'].dropna().apply(chgFormat)))+'K'
    print str(sum(df3['DAMAGE_PROPERTY'].dropna().apply(chgFormat))/1000)+'M'

编辑#3:

    print sum(df3['DAMAGE_PROPERTY'].dropna().apply(chgFormat))

答案 2 :(得分:1)

我不熟悉pandas / dataframe,但你可以使用简单的Python逻辑。假设您的文件遵循与"K"作为每行中最后一个字符相同的模式,请考虑以下事项:

>>> float("2.0K"[:-1])
2.0
>>> float("2.0M"[:-1])
2.0

您可以在每一行使用上面的位。例如:

# assuming you've read the contents into a list called "lines"
values = []
for s in lines:
    try:
        values.append(float(s[:-1])))
    except ValueError:
        # found something else; log it or something
        pass

最后,您只需将它们与Python内置的sum函数一起添加:

total = sum(values)

答案 3 :(得分:1)

我写这些函数:

import re

mapper = dict(k=1e3, K=1e3,
              m=1e6, M=1e6,
              b=1e9, B=1e9)
pot = ('K', 'M', 'B')

def revmap(value):
    powers_of_K = int(np.log10(value) // 3)
    if powers_of_K > len(pot): 
        suffix = pot[-1]
    else:
        suffix = pot[powers_of_K - 1]

    k = mapper[suffix]
    f = ("%f" % (value / k)).rstrip('0').rstrip('.')
    return f + suffix

def sum_with_units(s):
    regex = r'(?P<value>.*)(?P<unit>k|m)'
    s_ = s.str.extract(regex, expand=True, flags=re.IGNORECASE)
    summed = (s_.value.astype(float) * s_.unit.map(mapper)).sum()
    return revmap(summed)

sum_with_units(df.DAMAGE_PROPERTY)

'401K'

取:

df_plus = pd.concat([df for _ in range(2500)])

sum_with_units(df.DAMAGE_PROPERTY)

'1.0025B'