Question

我有一个包含大量数据的csv文件，但csv文件中包含的数据未被清除.csv数据的示例如下

country     branch      no_of_employee     total_salary    count_DOB   count_email
  x            a            30                 2500000        20            25
  x            b            20                 350000         15            20
  y            c            30                 4500000        30            30
  z            d            40                 5500000        40            40
  z            e            10                 1000000        10            10
  z            f            15                 1500000        15            15

申请小组后，我没有得到正确的结果。

df = data_df.groupby(['country', 'customer_branch']).count()

结果是

的形式

country  branch    no of employees   
x          1           30   
x          1           20
y          1           30
z          3           65

国家x正在重复twise。这是因为源文件数据，在源文件中，country字段包含＆＃34; X＆＃34;和＆＃34; X＆＃34;。这就是为什么它显示X twise。如何使用pandas

忽略这个问题

Answer 1

您可以调用向量化str.strip来修剪前导和尾随空格：

df['country'] = df['country'].str.strip(' ')

因此，以上操作应该可以清理您的数据，然后您可以致电groupby以获得所需的结果，或set_index以便您可以在索引级别上sum看起来像您真的想要

示例：

In [4]:
df = pd.DataFrame({'country':['x', 'x ','y','z','z','z'], 'branch':list('abcdef'), 'no_of_employee':[30,20,30,40,10,15]})
df

Out[4]:
  branch country  no_of_employee
0      a       x              30
1      b      x               20
2      c       y              30
3      d       z              40
4      e       z              10
5      f       z              15

In [9]:
df['country'] = df['country'].str.strip()
df.set_index(['country', 'branch']).sum(level=0)

Out[9]:
         no_of_employee
country                
x                    50
y                    30
z                    65

如何使用pandas编辑源csv文件数据

1 个答案: