我是python / pandas的新手,需要社区的帮助。这就是我想要做的。
我已经阅读了一个包含以下数据的json文件:
使用此代码:
import pandas as pd
df = pd.read_json(path_to_file, lines=True)
所需的输出:我想创建一个新的数据框,使其具有两列
到目前为止,我已经能够做到:
发现令牌生成器来自 nltk ,以及如何将其传递给 apply 功能
result = df["content"].apply(sent_tokenize)
我的问题是如何如上所述获得所需格式的结果。
答案 0 :(得分:1)
使用itertuples遍历数据框
import pandas as pd
df = pd.DataFrame([['hi how are you. i am fine. hope this help you','ABC']], columns = ['sent','ID'])
df
sent ID
0 hi how are you. i am fine. hope this help you ABC
new_sent =[]
for row in df.itertuples():
for sent in sent_tokenize(row[1]):
new_sent.append((sent, row[2]))
#creating dataframe for new_sent
df_new = pd.DataFrame(new_sent, columns = ['tokenized_sent', 'ID'])
#o/p
tokenized_sent ID
0 hi how are you. ABC
1 i am fine. ABC
2 hope this help you ABC
解释
for row in df.itertuples():
print(row)
#o/p
Pandas(Index=0, sent='hi how are you. i am fine. hope this help you', ID='ABC')
print(row[0])
0
print(row[1])
'hi how are you. i am fine. hope this help you'
print(row[2])
'ABC'
现在,我们在第二个元素上执行标记化,并将带有id的句子添加到new_list
new_list = []
for sent in sent_tokenize(row[1]):
new_list.append((sent, row[2]))
print((sent, row[2]))
o/p
('hi how are you.', 'ABC')
('i am fine.', 'ABC')
('hope this help you', 'ABC')
# now create dataframe with this new_sent
df_new = pd.DataFrame(new_sent, columns = ['tokenized_sent', 'ID'])
答案 1 :(得分:0)
您可以将apply
的返回值分配为df中的新列,
df["sentence"] = df["content"].apply(sent_tokenize)
,如果要删除其他列(标题和内容),也可以通过分配来完成:
df = df[["ID", "sentence"]]