我有一个输入CSV文件,需要在其中一列中添加所有值,但这些值不是普通整数,我不确定如何处理它。
总输出应该在15k左右,这是整列的总和。我正在使用pandas数据帧来存储.csv文件。
以下是我的输入.csv
文件中的一列:
DAMAGE_PROPERTY
0K
0K
2.5K
2.5K
.25K
.25K
2.5K
25K
2.5K
.25K
25K
25K
250K
2.5K
25K
2.5K
2.5K
2.5K
0K
2.5K
.25K
2.5K
25K
答案 0 :(得分:4)
我认为您需要先按str.replace
删除K
,然后按astype
和sum
投降到float
:
print (df.DAMAGE_PROPERTY.str.replace('K','').astype(float).sum())
401.0
然后可以通过1000
:
print (df.DAMAGE_PROPERTY.str.replace('K','').astype(float).sum() * 1000)
401000.0
如果需要添加K
:
print (str(df.DAMAGE_PROPERTY.str.replace('K','').astype(float).sum()) + 'K')
401.0K
通过评论编辑:
如果需要K
输出:
print (df)
DAMAGE_PROPERTY
0 2.5K
1 2.5K
2 25M
#create mask where values `M`
mask = df.DAMAGE_PROPERTY.str.contains('M')
print (mask)
0 False
1 False
2 True
Name: DAMAGE_PROPERTY, dtype: bool
#multiple by 1000 where is mask
df['DAMAGE_PROPERTY'] = df.DAMAGE_PROPERTY.str.replace(r'[KM]','').astype(float)
df['DAMAGE_PROPERTY'] = df.DAMAGE_PROPERTY.mask(mask, df.DAMAGE_PROPERTY*1000)
print (df)
DAMAGE_PROPERTY
0 2.5
1 2.5
2 25000.0
print (df['DAMAGE_PROPERTY'].sum())
25005.0
print (str(df['DAMAGE_PROPERTY'].sum()) + 'K' )
25005.0K
如果需要输出数字:
df['DAMAGE_PROPERTY'] = df.DAMAGE_PROPERTY.str.replace(r'[KM]','').astype(float)
df['DAMAGE_PROPERTY'] = df.DAMAGE_PROPERTY.mask(mask, df.DAMAGE_PROPERTY*1000) * 1000
print (df)
DAMAGE_PROPERTY
0 2500.0
1 2500.0
2 25000000.0
print (df['DAMAGE_PROPERTY'].sum())
25005000.0
EDIT1:
如果值为B
:
print (df)
DAMAGE_PROPERTY
0 2.5K
1 2.5B
2 25M
maskM = df.DAMAGE_PROPERTY.str.contains('M')
print (maskM)
0 False
1 False
2 True
Name: DAMAGE_PROPERTY, dtype: bool
maskB = df.DAMAGE_PROPERTY.str.contains('B')
print (maskB)
0 False
1 True
2 False
Name: DAMAGE_PROPERTY, dtype: bool
df['DAMAGE_PROPERTY'] = df.DAMAGE_PROPERTY.str.replace(r'[KMB]','').astype(float)
df['DAMAGE_PROPERTY'] = df.DAMAGE_PROPERTY.mask(maskM, df.DAMAGE_PROPERTY*1000)
df['DAMAGE_PROPERTY'] = df.DAMAGE_PROPERTY.mask(maskB, df.DAMAGE_PROPERTY*1000000)
print (df)
DAMAGE_PROPERTY
0 2.5
1 2500000.0
2 25000.0
print (df['DAMAGE_PROPERTY'])
0 2.5
1 2500000.0
2 25000.0
Name: DAMAGE_PROPERTY, dtype: float64
答案 1 :(得分:3)
试试这个:
遵循此模式,您可以为数十亿添加“B”。并且对没有“K”或“M”的值不做任何事情。
def chgFormat(x):
newFormat = 0
if x[-1] == 'K': newFormat = float(x[:-1])
elif x[-1] == 'H': newFormat = float(x[:-1])/10
elif x[-1] == 'M': newFormat = float(x[:-1])*1000
elif x[-1] == 'B': newFormat = float(x[:-1])*1000000
return newFormat
print str(sum(df['DAMAGE_PROPERTY'].dropna().apply(chgFormat)))+'K'
print str(sum(df['DAMAGE_PROPERTY'].dropna().apply(chgFormat))/1000)+'M''
Results:
401.0K
0.401M
使用此:如果有NaNs:
print str(sum(df3['DAMAGE_PROPERTY'].dropna().apply(chgFormat)))+'K'
print str(sum(df3['DAMAGE_PROPERTY'].dropna().apply(chgFormat))/1000)+'M'
编辑#3:
print sum(df3['DAMAGE_PROPERTY'].dropna().apply(chgFormat))
答案 2 :(得分:1)
我不熟悉pandas / dataframe,但你可以使用简单的Python逻辑。假设您的文件遵循与"K"
作为每行中最后一个字符相同的模式,请考虑以下事项:
>>> float("2.0K"[:-1])
2.0
>>> float("2.0M"[:-1])
2.0
您可以在每一行使用上面的位。例如:
# assuming you've read the contents into a list called "lines"
values = []
for s in lines:
try:
values.append(float(s[:-1])))
except ValueError:
# found something else; log it or something
pass
最后,您只需将它们与Python内置的sum
函数一起添加:
total = sum(values)
答案 3 :(得分:1)
我写这些函数:
import re
mapper = dict(k=1e3, K=1e3,
m=1e6, M=1e6,
b=1e9, B=1e9)
pot = ('K', 'M', 'B')
def revmap(value):
powers_of_K = int(np.log10(value) // 3)
if powers_of_K > len(pot):
suffix = pot[-1]
else:
suffix = pot[powers_of_K - 1]
k = mapper[suffix]
f = ("%f" % (value / k)).rstrip('0').rstrip('.')
return f + suffix
def sum_with_units(s):
regex = r'(?P<value>.*)(?P<unit>k|m)'
s_ = s.str.extract(regex, expand=True, flags=re.IGNORECASE)
summed = (s_.value.astype(float) * s_.unit.map(mapper)).sum()
return revmap(summed)
sum_with_units(df.DAMAGE_PROPERTY)
'401K'
取:
df_plus = pd.concat([df for _ in range(2500)])
sum_with_units(df.DAMAGE_PROPERTY)
'1.0025B'