仅当3+列的值不是特定数据类型时,才可以串联

时间:2019-08-02 19:40:41

标签: python pandas numpy dataframe

我有一个从SQL Server中提取的数据框。转换为.csv时,数据解析不正确,现在我的列包含错误的数据。我正在尝试使用熊猫将所有东西放回原处。具体来说,我有一个字段应包含“简短说明”。一些描述被划分到单独的字段中,我想将它们全部合并到适当的字段中。问题是,某些字段包含的日期恰好属于该日期,因此在连接时需要跳过它们。

我试图以多种不同的方式使用df.apply(),但是我似乎无法“跳过”包含pd.Timestamp数据类型的值。

例如:

df_test.apply(lambda x: ' '.join(x) if type(x) != pd.Timestamp else '')

示例df:

df_so_test = pd.DataFrame([[1, 2, 'some description', pd.to_datetime('2019-01-01'), 'some more text', '']
                          , [2, 3, 'another description', 'some other text', '', pd.to_datetime('2019-01-02')]
                          , [3, 4, 'a third descirption', '', pd.to_datetime('2019-01-03'), pd.to_datetime('2019-01-04')]]
                          , columns=['random_col_1','random_col_2', 'short_desc', 'date_1', 'date_2', 'random_col_3'])

预期输出:

df_expected = pd.DataFrame([[1, 2, 'some description some more text', pd.to_datetime('2019-01-01'), '', '']
                          , [2, 3, 'another description some other text', pd.to_datetime('2019-01-02'), '', '']
                          , [3, 4, 'a third descirption', pd.to_datetime('2019-01-03'), pd.to_datetime('2019-01-04'), '']]
                          , columns=['random_col_1','random_col_2', 'short_desc', 'date_1', 'date_2', 'random_col_3'])

2 个答案:

答案 0 :(得分:2)

这里是一个使用apply的示例。我需要做的假设:

  1. 我假设带有字符串对象的唯一列是'short_desc',否则很难理解'short_desc'中有哪些文本,而没有,因为我看不到未对齐数据中的常规模式。

  2. 我还假设您有两个日期,仅在需要时才移位,并且您的'random_col_3'是由错误的读取生成的,因此我将其放在结束。

如果实际列名与发布的示例不符,则可能需要修复它们。

def fixdb(row):
    found = [x for x in row if isinstance(x, str)]
    if len(found) > 1:
        row['short_desc'] = ' '.join(found)
        dates = [x for x in row if isinstance(x, pd.Timestamp)]

        try:
            row['date_1'] = dates[0]
        except IndexError:
            row['date_1'] = np.nan

        try:
            row['date_2'] = dates[1]
        except IndexError:
            row['date_2'] = np.nan

    return row

df_out = df_so_test.apply(fixdb, axis=1).drop('random_col_3', axis=1)

这是df_out,使用提供的输出:

   random_col_1  random_col_2                            short_desc     date_1     date_2
0             1             2       some description some more text 2019-01-01        NaT
1             2             3  another description some other text  2019-01-02        NaT
2             3             4                  a third descirption  2019-01-03 2019-01-04

答案 1 :(得分:2)

这是一种实现方法:

def f(y):
    desc = ['' if pd.isnull(x) else x if type(x)!=pd.Timestamp else '' for x in y]
    return desc

res = df_so_test[df_test.columns[2:]].apply(f)
res["new"] = res["short_desc"]+" "+res["date_1"]+" "+res["date_2"]+" "+res["random_col_3"]
df_so_test["short_desc"]= res["new"].apply(lambda x: re.sub("\s+", " ", x))

def f(y):
    times = ["" if pd.isnull(x) else x if type(x)==pd.Timestamp else '' for x in y]
    return times

res = (df_so_test[df_so_test.columns[3:]].apply(f)).to_numpy()
times = [[x for x in y if not np.isnat(x)] for y in res]
[a.extend([''] * (3 - len(a))) for a in times]

df_expected = df_test.copy()
df_expected[df_expected.columns[-3:]] = times

print(df_expected)

输出:

   random_col_1  random_col_2                            short_desc     date_1     date_2 random_col_3
0             1             2      some description some more text  2019-01-01        NaT
1             2             3  another description some other text  2019-01-02        NaT
2             3             4                  a third descirption  2019-01-03 2019-01-04