我想将nps聊天语料库转换为pandas数据帧。有16个xml文件
10-19-20s_706posts.xml,10-19-30s_705posts.xml, 10-19-40s_686posts.xml,10-19-adults_706posts.xml, 10-24-40s_706posts.xml,10-26-teens_706posts.xml, 11-06-adults_706posts.xml,11-08-20s_705posts.xml, 11-08-40s_706posts.xml,11-08-adults_705posts.xml, 11-08-teens_706posts.xml,11-09-20s_706posts.xml, 11-09-40s_706posts.xml,11-09-adults_706posts.xml, 11-09-teens_706posts.xml
在nps_chat中,我希望所有人都能进入单一数据帧。
以下是语料库中的示例帖子:
<!-- edited with XMLSpy v2007 sp1 (http://www.altova.com) by Eric Forsyth (Naval Postgraduate School) -->
<Session xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="postClassPOSTagset.xsd">
<Posts>
<Post class="Statement" user="10-19-20sUser7">now im left with this gay name<terminals>
<t pos="RB" word="now"/>
<t pos="PRP" word="im"/>
<t pos="VBD" word="left"/>
<t pos="IN" word="with"/>
<t pos="DT" word="this"/>
<t pos="JJ" word="gay"/>
<t pos="NN" word="name"/>
</terminals>
</Post>
<Post class="Emotion" user="10-19-20sUser7">:P<terminals>
<t pos="UH" word=":P"/>
</terminals>
</Post>
<Post class="System" user="10-19-20sUser76">PART<terminals>
<t pos="VB" word="PART"/>
</terminals>
从这里我只需要pandas框架的类和相关文本 例如
Class text
1 Statement now im left with this gay name
2 Emotion :P
3 System PART
我可以使用下面的
将文本输入到pandas中from nltk.corpus import nps_chat as nps
import pandas as pd
import numpy as np
chatroom = nps.posts()
df = pd.DataFrame(np.array(chatroom),columns=["text"])
有什么方法可以上课吗?这是唯一缺失的部分