Question

我可能会获得具有可变列数的文本文件。数据如下所示。

1,a,x
2,b
3,c,y,z

现在我必须将所有行加载到数据库中，例如postgres \ sql server。表架构如下

Table : test
columns : col1 (nvarchar(max)),col2 (nvarchar(max)),col3 (nvarchar(max))

数据应按以下方式加载

col1   col2   col3
1       a       x
2       b       Null
3       c       y,z

这些是加载的规则 1）如果文件中的列数少于表中的列，则缺少的列应替换为null 2）如果文件中的列数大于表中的列，则所有多余的数据都应保存在最后一列中。

有人可以建议我实现这一目标的最佳方法

Answer 1

尝试以下操作，使用pd.read_fwf读取文件。

df = pd.read_fwf(filename, delimiter=',', header=None)

现在我们必须将'col3'之后的所有列连接到col3：

df.iloc[:,2] = df.iloc[:,2:].astype(str).apply(tuple, axis=1).str.join(',').str.replace(',nan', '')

df = df.iloc[:,:3]
df.columns = ['col1', 'col2', 'col3']

示例

filename:中的数据

1,a,x
2,b
3,c,y,z
4,d,s,f,d,s

使用pd.read_fwf读取文件时的DF：

    0   1   2   3   4   5
0   1   a   x   NaN NaN NaN
1   2   b   NaN NaN NaN NaN
2   3   c   y   z   NaN NaN
3   4   d   s   f   d   s

上述操作后的输出：

   col1  col2   col3
0   1      a    x
1   2      b    nan
2   3      c    y,z
3   4      d    s,f,d,s

使用python将具有可变列数的定界文件加载到数据库中

1 个答案: