如何通过标记现有数据框的内容来创建新数据框?

时间:2019-07-16 08:22:37

标签: python pandas dataframe

我是python / pandas的新手,需要社区的帮助。这就是我想要做的。

我已经阅读了一个包含以下数据的json文件:

  1. (文章的)内容
  2. ID(唯一标识符)
  3. 标题(文章标题)

使用此代码:

import pandas as pd
df = pd.read_json(path_to_file, lines=True)

所需的输出:我想创建一个新的数据框,使其具有两列

  1. ID(唯一标识符)
  2. 句子(将df的“内容”列拆分为句子)

到目前为止,我已经能够做到:

发现令牌生成器来自 nltk ,以及如何将其传递给 apply 功能

  result = df["content"].apply(sent_tokenize) 

我的问题是如何如上所述获得所需格式的结果。

2 个答案:

答案 0 :(得分:1)

使用itertuples遍历数据框

import pandas as pd
df = pd.DataFrame([['hi how are you. i am fine. hope this help you','ABC']], columns = ['sent','ID'])

 df
                                              sent  ID
 0   hi how are you. i am fine. hope this help you  ABC

new_sent =[]
for row in df.itertuples():
    for sent in sent_tokenize(row[1]):
        new_sent.append((sent, row[2]))

#creating dataframe for new_sent
df_new = pd.DataFrame(new_sent, columns = ['tokenized_sent', 'ID'])
#o/p

    tokenized_sent      ID
0   hi how are you.     ABC
1   i am fine.          ABC
2   hope this help you  ABC

解释

for row in df.itertuples():
    print(row)

#o/p
Pandas(Index=0, sent='hi how are you. i am fine. hope this help you', ID='ABC')

print(row[0])
0

print(row[1])
'hi how are you. i am fine. hope this help you'

print(row[2])
'ABC'

现在,我们在第二个元素上执行标记化,并将带有id的句子添加到new_list

new_list = []
for sent in sent_tokenize(row[1]):
    new_list.append((sent, row[2]))
    print((sent, row[2]))

o/p
('hi how are you.', 'ABC')
('i am fine.', 'ABC')
('hope this help you', 'ABC')

# now  create dataframe with this new_sent
df_new = pd.DataFrame(new_sent, columns = ['tokenized_sent', 'ID'])

答案 1 :(得分:0)

您可以将apply的返回值分配为df中的新列,

df["sentence"] = df["content"].apply(sent_tokenize) 

,如果要删除其他列(标题和内容),也可以通过分配来完成:

df = df[["ID", "sentence"]]