错误的结果

Question

我有一个csv文件，所有数据都列在该列中，我想将该列中的数字数据分成几列。我拥有的数据（读取到数据框之后）如下所示：

      0
0     13:25:09 -> mm [ -5,  4,  15 ] dd [ 4, 77, 8 ]
1     13:25:09 -> mm [ -4,  9,  10 ] dd [ 8, 6, 10 ]
2     13:25:09 -> mm [ 0,  -4,  19 ] dd [ 3, 1, 66 ]

我该怎么办？

Answer 1

我相信您需要Series.str.extractall和Series.unstack：

df = df[0].str.extractall('(\d+)')[0].unstack()
print (df)
match   0   1   2  3  4   5  6   7   8
0      13  25  09  5  4  15  4  77   8
1      13  25  09  4  9  10  8   6  10
2      13  25  09  0  4  19  3   1  66

Answer 2

具有此csv文件

csvfile = '''13:25:09 -> mm [ -5,  4,  15 ] dd [ 4, 77, 8 ]
13:25:09 -> mm [ -4,  9,  10 ] dd [ 8, 6, 10 ]
13:25:09 -> mm [ 0,  -4,  19 ] dd [ 3, 1, 66 ]'''

错误的结果

这样做

import pandas as pd

lines = csvfile.split('\n')
df = pd.DataFrame(lines)

您得到错误的结果：

                                                0
0  13:25:09 -> mm [ -5,  4,  15 ] dd [ 4, 77, 8 ]
1  13:25:09 -> mm [ -4,  9,  10 ] dd [ 8, 6, 10 ]
2  13:25:09 -> mm [ 0,  -4,  19 ] dd [ 3, 1, 66 ]

更好的结果

您应该这样做：

import pandas as pd

lines = csvfile.split('\n')

df = pd.DataFrame({'id': [1,2,3], 
                   'time': [line[:8] for line in lines], 
                   'mm': [line[15:30] for line in lines],
                   'dd': [line[34:50] for line in lines]})

你会得到

   id      time               mm            dd
0   1  13:25:09  [ -5,  4,  15 ]  [ 4, 77, 8 ]
1   2  13:25:09  [ -4,  9,  10 ]  [ 8, 6, 10 ]
2   3  13:25:09  [ 0,  -4,  19 ]  [ 3, 1, 66 ]

如果我不想要字符串而是整数，该怎么办

请注意， mm 将是一个字符串

print(type(df['mm'][0]))
<class 'str'>

最好有一个整数列表

df['mm_list'] = df['mm'].str.replace('[', '').str.replace(']', '').str.split(',').values.tolist()
df['mm_list_int'] = [[int(i) for i in x] for x in df['mm_list']]
df

导致一个新列 mm_list_int

   id      time               mm            dd            mm_list  mm_list_int
0   1  13:25:09  [ -5,  4,  15 ]  [ 4, 77, 8 ]  [ -5,   4,   15 ]  [-5, 4, 15]
1   2  13:25:09  [ -4,  9,  10 ]  [ 8, 6, 10 ]  [ -4,   9,   10 ]  [-4, 9, 10]
2   3  13:25:09  [ 0,  -4,  19 ]  [ 3, 1, 66 ]  [ 0,   -4,   19 ]  [0, -4, 19]

类型正确

print(type(df['mm_list_int'][0]))
<class 'list'>

print(type(df['mm_list_int'][0][0]))
<class 'int'>

这是整数列表

如果我希望三个mm值位于不同的列中会怎样？

使用

objs = [df, pd.DataFrame(df['mm_list_int'].tolist(), columns=['mm_x', 'mm_y', 'mm_z'])]
df_final = pd.concat(objs, axis=1)
df_final = df_final[['id', 'time', 'mm', 'dd', 'mm_x', 'mm_y', 'mm_z']]

获取

   id      time               mm            dd  mm_x  mm_y  mm_z
0   1  13:25:09  [ -5,  4,  15 ]  [ 4, 77, 8 ]    -5     4    15
1   2  13:25:09  [ -4,  9,  10 ]  [ 8, 6, 10 ]    -4     9    10
2   3  13:25:09  [ 0,  -4,  19 ]  [ 3, 1, 66 ]     0    -4    19

最终触感

对 dd 做同样的操作，

df['dd_list'] = df['dd'].str.replace('[', '').str.replace(']', '').str.split(',').values.tolist()
df['dd_list_int'] = [[int(i) for i in x] for x in df['dd_list']]

objs = [df, 
        pd.DataFrame(df['mm_list_int'].tolist(), columns=['mm_x', 'mm_y', 'mm_z']),
        pd.DataFrame(df['dd_list_int'].tolist(), columns=['dd_x', 'dd_y', 'dd_z'])]
df_final = pd.concat(objs, axis=1)
df_final = df_final[['id', 'time', 'mm_x', 'mm_y', 'mm_z', 'dd_x', 'dd_y', 'dd_z']]

最终结果

   id      time  mm_x  mm_y  mm_z  dd_x  dd_y  dd_z
0   1  13:25:09    -5     4    15     4    77     8
1   2  13:25:09    -4     9    10     8     6    10
2   3  13:25:09     0    -4    19     3     1    66

如何基于分隔符将csv的一个单元格拆分为数据帧的列

2 个答案:

错误的结果

更好的结果

如果我不想要字符串而是整数，该怎么办

如果我希望三个mm值位于不同的列中会怎样？

最终触感