Question

我是python / pandas的新手，需要社区的帮助。这就是我想要做的。

我已经阅读了一个包含以下数据的json文件：

（文章的）内容
ID（唯一标识符）
标题（文章标题）

使用此代码：

import pandas as pd
df = pd.read_json(path_to_file, lines=True)

所需的输出：我想创建一个新的数据框，使其具有两列

ID（唯一标识符）
句子（将df的“内容”列拆分为句子）

到目前为止，我已经能够做到：

发现令牌生成器来自 nltk ，以及如何将其传递给 apply 功能

  result = df["content"].apply(sent_tokenize)

我的问题是如何如上所述获得所需格式的结果。

Answer 1

使用itertuples遍历数据框

import pandas as pd
df = pd.DataFrame([['hi how are you. i am fine. hope this help you','ABC']], columns = ['sent','ID'])

 df
                                              sent  ID
 0   hi how are you. i am fine. hope this help you  ABC

new_sent =[]
for row in df.itertuples():
    for sent in sent_tokenize(row[1]):
        new_sent.append((sent, row[2]))

#creating dataframe for new_sent
df_new = pd.DataFrame(new_sent, columns = ['tokenized_sent', 'ID'])
#o/p

    tokenized_sent      ID
0   hi how are you.     ABC
1   i am fine.          ABC
2   hope this help you  ABC

解释

for row in df.itertuples():
    print(row)

#o/p
Pandas(Index=0, sent='hi how are you. i am fine. hope this help you', ID='ABC')

print(row[0])
0

print(row[1])
'hi how are you. i am fine. hope this help you'

print(row[2])
'ABC'

现在，我们在第二个元素上执行标记化，并将带有id的句子添加到new_list

new_list = []
for sent in sent_tokenize(row[1]):
    new_list.append((sent, row[2]))
    print((sent, row[2]))

o/p
('hi how are you.', 'ABC')
('i am fine.', 'ABC')
('hope this help you', 'ABC')

# now  create dataframe with this new_sent
df_new = pd.DataFrame(new_sent, columns = ['tokenized_sent', 'ID'])

Answer 2

您可以将apply的返回值分配为df中的新列，

df["sentence"] = df["content"].apply(sent_tokenize)

，如果要删除其他列（标题和内容），也可以通过分配来完成：

df = df[["ID", "sentence"]]

如何通过标记现有数据框的内容来创建新数据框？

2 个答案: