我想从https://archive.org/download/stackexchange中提取问题/答案对,特别是来自任何转储的Posts.xml
文件(我随机选择Anime
转储,因为它很小并靠近顶部)。我对此文件的处理方式的理解是,有两种PostTypeId
类型,1
是问题(包括问题正文,标题和其他元数据)和2
作为答案(包括分数,答案正文和其他元数据)。
如果我们有一个诸如
之类的条目,那么数据很容易相关 <row Id="1" PostTypeId="1" AcceptedAnswerId="8" CreationDate="2012-12-11T20:37:08.823" Score="69" ViewCount="22384" Body="<p>Assuming the world in the One Piece universe is round, then there is not really a beginning or an end of the Grand Line.</p>

<p>The Straw Hats started out from the first half and are now sailing across the second half.</p>

<p>Wouldn't it have been quicker to set sail in the opposite direction from where they started? </p>
" OwnerUserId="21" LastEditorUserId="1398" LastEditDate="2015-04-17T19:06:38.957" LastActivityDate="2015-05-26T12:50:40.920" Title="The treasure in One Piece is at the end of the Grand Line. But isn't that the same as the beginning?" Tags="<one-piece>" AnswerCount="5" CommentCount="0" FavoriteCount="2" />
相应的答案是:
<row Id="8" PostTypeId="2" ParentId="1" CreationDate="2012-12-11T20:47:52.167" Score="60" Body="<p>No, there is a reason why they can't. </p>

<p>Basically the <a href="http://onepiece.wikia.com/wiki/New_World">New World</a> is beyond the <a href="http://onepiece.wikia.com/wiki/Red_Line">Red Line</a>, but you cannot "walk" on it, or cross it. It's a huge continent, very tall that you cannot go through. You can't cross the <a href="http://onepiece.wikia.com/wiki/Calm_Belt">Calm Belt</a> either, unless you have some form of locomotion such as the Navy or <a href="http://onepiece.wikia.com/wiki/Boa_Hancock">Boa Hancock</a>.</p>

<p>So the only way is to start from one of the Four Seas, then to go the <a href="http://onepiece.wikia.com/wiki/Reverse_Mountain">Reverse Mountain</a> and follow the Grand Line until you reach <em><a href="http://onepiece.wikia.com/wiki/Raftel">Raftel</a></em>, which supposedly is where One Piece is located.</p>

<p><img src="http://i.stack.imgur.com/69IZ0.png" alt="enter image description here"></p>
" OwnerUserId="15" LastEditorUserId="1528" LastEditDate="2013-05-06T19:21:04.703" LastActivityDate="2013-05-06T19:21:04.703" CommentCount="1" />
第一个xml代码段PostTypeId="1"
中的内容表示此行是一个问题,而AcceptedAnswerId="8"
表示答案的Id
。在第二个xml片段中,我们Id="8"
为问题的AcceptedAnswerId
,PostTypeId="2"
表示这是一个答案,ParentId
是问题Id
}。
现在有了这个说法,我怎么能轻松地轮询这些问题/答案对的数据。理想情况下,如果我可以将它转换为我熟悉这些类型的数据结构的SQLite3或Mysql数据库,那将是有用的。如果这不可能(通过数据库函数本身或通过脚本包装器来填充数据库),我将如何在Ruby
中解析这些数据,以便我可以浏览整个XML文档来提取{{1}问题的{}和title
,然后将其与相应的body
正文配对。
感谢您的时间。
答案 0 :(得分:0)
Stack Exchange Creative Commons Data Dump 只是来自Stack Exchange生产Microsoft SQL Server数据库的(已清理的)转储。因此,考虑到数据来自SQL数据库并且真正的是关系数据,您可以将其导入一个。
Data Dump's README中描述了数据库模式,您可以找到一些旧脚本,以便将其导入Meta Stack Exchange的数据库。当然,如果你想要的只是类似SQL的关系查询界面,你可以使用Stack Exchange Data Explorer。