Question

我具有以下格式的CSV文件中的数据（dataframe中的一列）。这本质上就像Word文档中的大纲，我在这里显示的标题是字母，是主要标题，而数字项是子标题：

A
1
2
3
B
1
2
C
1
2
3
4

我想将此转换为以下格式（dataframe中的两列）：

A 1
A 2
A 3
B 1
B 2
C 1
C 2
C 3
C 4

我正在使用pandas read_csv将数据转换为dataframe，并且试图重新格式化for循环，但是由于数据重复，我遇到了困难并被覆盖。例如，在循环的稍后部分，A 3将被C 3覆盖（导致只需要C 3的两个实例，而A 3则全部丢失）。最好的方法是什么？

不好意思的道歉，这是该网站的新功能。

Answer 1

使用：

#if no csv header use names parameter
df = pd.read_csv(file, names=['col'])

df.insert(0, 'a', df['col'].mask(df['col'].str.isnumeric()).ffill())

df = df[df['a'] != df['col']]
print (df)
    a col
1   A   1
2   A   2
3   A   3
5   B   1
6   B   2
8   C   1
9   C   2
10  C   3
11  C   4

详细信息：

检查isnumeric值：

print (df['col'].str.isnumeric())
0     False
1      True
2      True
3      True
4     False
5      True
6      True
7     False
8      True
9      True
10     True
11     True
Name: col, dtype: bool

将True替换为NaN，mask并向前填充缺失值：

print (df['col'].mask(df['col'].str.isnumeric()).ffill())
0     A
1     A
2     A
3     A
4     B
5     B
6     B
7     C
8     C
9     C
10    C
11    C
Name: col, dtype: object

通过DataFrame.insert将新列添加到第一位置：

df.insert(0, 'a', df['col'].mask(df['col'].str.isnumeric()).ffill())
print (df)
    a col
0   A   A
1   A   1
2   A   2
3   A   3
4   B   B
5   B   1
6   B   2
7   C   C
8   C   1
9   C   2
10  C   3
11  C   4

最后通过boolean indexing删除具有相同值的行。

将CSV格式的大纲格式转换为两列

1 个答案: