在通过read_csv导入* .txt文件之后,我目前在熊猫数据框中具有以下数据结构:
label text
0 ###24293578 NaN
1 INTRO Some text...
2 METHODS Some text...
3 METHODS Some text...
4 METHODS Some text...
5 RESULTS Some text...
6 ###24854809 NaN
7 BACKGROUND Some text...
8 INTRO Some text...
9 METHODS Some text...
10 METHODS Some text...
11 RESULTS Some text...
12 ###25165090 NaN
13 BACKGROUND Some text...
14 METHODS Some text...
...
我想实现的是每行的运行索引,该索引是从标有“ ###”的ID中检索的:
id label text
24293578 INTRO Some text...
24293578 METHODS Some text...
24293578 ... ...
24854809 BACKGROUND Some text...
24854809 ... ...
25165090 BACKGROUND Some text...
25165090 ... ...
我目前使用以下代码转换数据:
m = df['label'].str.contains("###", na=False)
df['new'] = df['label'].where(m).ffill()
df = df[df['label'] != df['new']].copy()
df['label'] = df.pop('new').str.lstrip('#') + ' ' + df['label']
df[['id','area']] = df['label'].str.split(' ',expand=True)
df = df.drop(columns=['label'])
df
出局:
text id area
1 Some text... 24293578 OBJECTIVE
...
6 Some text... 24854809 BACKGROUND
...
它能完成工作,但我觉得这不是最好的方法。 是否可以编写代码更简洁的代码或使其更高效?我也很好奇,是否可以将一个函数直接嵌入到read_csv步骤中。
谢谢!
答案 0 :(得分:2)
您可以在3个步骤中完成此操作:
# put in the label column into id where text is null, and strip out the #.
# The rest will be NaN
df['id'] = df.loc[df['text'].isnull(),'label'].str.strip('#')
# forward fill in ID
df['id'].ffill(inplace=True)
# Remove the columns where text is null
df.dropna(subset=['text'], inplace=True)
>>> df
label text id
1 INTRO Some text... 24293578
2 METHODS Some text... 24293578
3 METHODS Some text... 24293578
4 METHODS Some text... 24293578
5 RESULTS Some text... 24293578
7 BACKGROUND Some text... 24854809
8 INTRO Some text... 24854809
9 METHODS Some text... 24854809
10 METHODS Some text... 24854809
11 RESULTS Some text... 24854809
13 BACKGROUND Some text... 25165090
14 METHODS Some text... 25165090