将非结构化数据解析为熊猫数据框

时间:2018-11-09 17:30:51

标签: pandas indexing transformation

在通过read_csv导入* .txt文件之后,我目前在熊猫数据框中具有以下数据结构:

    label   text
0   ###24293578 NaN
1   INTRO   Some text...
2   METHODS Some text...
3   METHODS Some text...
4   METHODS Some text...
5   RESULTS Some text...
6   ###24854809 NaN
7   BACKGROUND  Some text...
8   INTRO   Some text...
9   METHODS Some text...
10  METHODS Some text...
11  RESULTS Some text...
12  ###25165090 NaN
13  BACKGROUND  Some text...
14  METHODS Some text...
...

我想实现的是每行的运行索引,该索引是从标有“ ###”的ID中检索的:

id        label       text
24293578  INTRO       Some text...
24293578  METHODS     Some text...
24293578  ...         ...
24854809  BACKGROUND  Some text...
24854809  ...         ...
25165090  BACKGROUND  Some text...
25165090  ...         ...

我目前使用以下代码转换数据:

m = df['label'].str.contains("###", na=False) 
df['new'] = df['label'].where(m).ffill()
df = df[df['label'] != df['new']].copy()
df['label'] = df.pop('new').str.lstrip('#') + ' ' + df['label']
df[['id','area']] = df['label'].str.split(' ',expand=True)
df = df.drop(columns=['label'])
df

出局:

    text            id          area
1   Some text...    24293578    OBJECTIVE
...
6   Some text...    24854809    BACKGROUND
...

它能完成工作,但我觉得这不是最好的方法。 是否可以编写代码更简洁的代码或使其更高效?我也很好奇,是否可以将一个函数直接嵌入到read_csv步骤中。

谢谢!

1 个答案:

答案 0 :(得分:2)

您可以在3个步骤中完成此操作:

# put in the label column into id where text is null, and strip out the #. 
# The rest will be NaN
df['id'] = df.loc[df['text'].isnull(),'label'].str.strip('#')

# forward fill in ID
df['id'].ffill(inplace=True)

# Remove the columns where text is null
df.dropna(subset=['text'], inplace=True)

>>> df
         label          text        id
1        INTRO  Some text...  24293578
2      METHODS  Some text...  24293578
3      METHODS  Some text...  24293578
4      METHODS  Some text...  24293578
5      RESULTS  Some text...  24293578
7   BACKGROUND  Some text...  24854809
8        INTRO  Some text...  24854809
9      METHODS  Some text...  24854809
10     METHODS  Some text...  24854809
11     RESULTS  Some text...  24854809
13  BACKGROUND  Some text...  25165090
14     METHODS  Some text...  25165090