熊猫csv-清理错误列中的数据

时间:2019-12-30 06:28:35

标签: python pandas csv

我正在使用一个数据集,其中某些行缺少一列,而随后的列被错误地移到丢失列的位置,所以看起来可能像这样:

              date    tap     time    count
0         20160730     on     02:30   415.0
1         20160730     on     02:30    18.0
2         20160730     on     02:30    24.0
3         20160730     on     02:30    31.0
4         20160730     on     13:30    64.0
...            ...    ...       ...     ...
169549    20170101  23:45        29     NaN
169550    20170101  23:45        34     NaN
169551    20170101  23:45        43     NaN
169552    20170101  23:45        42     NaN
169553    20170101  23:45        60     NaN

请注意,在最后5行中,“时间”的值如何在“轻击”列中,而“计数”的值如何在“时间”列中。这不仅发生在最后几行,而且遍及整个数据集。

我正在尝试创建一个执行此操作的函数:

for each item in the 'tap' column
   if item is neither 'on' or 'off', then
      the value of the 'count' column in that row takes on the value of the 'time' column
      the value of the 'time' column in that row takes on the value of the 'tap' column
      the value of the 'tap' column in that row is replaced by a string "N/A"

因此,希望最终结果将如下所示:

              date   tap    time    count
0         20160730    on    02:30   415.0
1         20160730    on    02:30    18.0
2         20160730    on    02:30    24.0
3         20160730    on    02:30    31.0
4         20160730    on    13:30    64.0
...            ...   ...      ...     ...
169549    20170101   N/A    23:45      29
169550    20170101   N/A    23:45      34
169551    20170101   N/A    23:45      43
169552    20170101   N/A    23:45      42
169553    20170101   N/A    23:45      60

到目前为止,我只加载了csv文件...

import pandas as pd 

df = pd.read_csv('data.csv', dtype={
    'date': str,
    'tap': str,
    'time': str,
    'count': float})

我敢肯定,我确实缺少一些简单的东西,但是我已经在Google上花费了数小时,却找不到合适的语法来做到这一点。请让我知道如何进行这项工作。

2 个答案:

答案 0 :(得分:2)

使用DataFrame.shift,条件为Series.isin,只需将所有列转换为字符串,以避免缺少不匹配dtypes的值(如上一列):

m = df['tap'].isin(['on','off'])
cols = ['tap','time','count']
df.loc[~m, cols] = df.loc[~m, cols].astype(str).shift(axis=1)
df['count'] = df['count'].astype(int)
print (df)
            date  tap   time  count
0       20160730   on  02:30    415
1       20160730   on  02:30     18
2       20160730   on  02:30     24
3       20160730   on  02:30     31
4       20160730   on  13:30     64
169549  20170101  NaN  23:45     29
169550  20170101  NaN  23:45     34
169551  20170101  NaN  23:45     43
169552  20170101  NaN  23:45     42
169553  20170101  NaN  23:45     60

如果要分配新列而不移动:

m = df['tap'].isin(['on','off'])
df.loc[~m, ['time','count']] = df.loc[~m, ['tap','time']].to_numpy()
df.loc[~m, 'tap'] = np.nan
df['count'] = df['count'].astype(int)
print (df)
            date  tap   time  count
0       20160730   on  02:30    415
1       20160730   on  02:30     18
2       20160730   on  02:30     24
3       20160730   on  02:30     31
4       20160730   on  13:30     64
169549  20170101  NaN  23:45     29
169550  20170101  NaN  23:45     34
169551  20170101  NaN  23:45     43
169552  20170101  NaN  23:45     42
169553  20170101  NaN  23:45     60

答案 1 :(得分:0)

尝试

wrong_vals = df['tap'].isin(['on', 'off'])
df.loc[~wrong_vals, 'time'] = df['tap']
df.loc[wrong_vals, 'tap'] = None