Question

我想从https://archive.org/download/stackexchange中提取问题/答案对，特别是来自任何转储的Posts.xml文件（我随机选择Anime转储，因为它很小并靠近顶部）。我对此文件的处理方式的理解是，有两种PostTypeId类型，1是问题（包括问题正文，标题和其他元数据）和2作为答案（包括分数，答案正文和其他元数据）。

如果我们有一个诸如

之类的条目，那么数据很容易相关

  <row Id="1" PostTypeId="1" AcceptedAnswerId="8" CreationDate="2012-12-11T20:37:08.823" Score="69" ViewCount="22384" Body="&lt;p&gt;Assuming the world in the One Piece universe is round, then there is not really a beginning or an end of the Grand Line.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;The Straw Hats started out from the first half and are now sailing across the second half.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;Wouldn't it have been quicker to set sail in the opposite direction from where they started?     &lt;/p&gt;&#xA;" OwnerUserId="21" LastEditorUserId="1398" LastEditDate="2015-04-17T19:06:38.957" LastActivityDate="2015-05-26T12:50:40.920" Title="The treasure in One Piece is at the end of the Grand Line. But isn't that the same as the beginning?" Tags="&lt;one-piece&gt;" AnswerCount="5" CommentCount="0" FavoriteCount="2" />

相应的答案是：

  <row Id="8" PostTypeId="2" ParentId="1" CreationDate="2012-12-11T20:47:52.167" Score="60" Body="&lt;p&gt;No, there is a reason why they can't. &lt;/p&gt;&#xA;&#xA;&lt;p&gt;Basically the &lt;a href=&quot;http://onepiece.wikia.com/wiki/New_World&quot;&gt;New World&lt;/a&gt; is beyond the &lt;a href=&quot;http://onepiece.wikia.com/wiki/Red_Line&quot;&gt;Red Line&lt;/a&gt;, but you cannot &quot;walk&quot; on it, or cross it. It's a huge continent, very tall that you cannot go through. You can't cross the &lt;a href=&quot;http://onepiece.wikia.com/wiki/Calm_Belt&quot;&gt;Calm Belt&lt;/a&gt; either, unless you have some form of locomotion such as the Navy or &lt;a href=&quot;http://onepiece.wikia.com/wiki/Boa_Hancock&quot;&gt;Boa Hancock&lt;/a&gt;.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;So the only way is to start from one of the Four Seas, then to go the &lt;a href=&quot;http://onepiece.wikia.com/wiki/Reverse_Mountain&quot;&gt;Reverse Mountain&lt;/a&gt; and follow the Grand Line until you reach &lt;em&gt;&lt;a href=&quot;http://onepiece.wikia.com/wiki/Raftel&quot;&gt;Raftel&lt;/a&gt;&lt;/em&gt;, which supposedly is where One Piece is located.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;&lt;img src=&quot;http://i.stack.imgur.com/69IZ0.png&quot; alt=&quot;enter image description here&quot;&gt;&lt;/p&gt;&#xA;" OwnerUserId="15" LastEditorUserId="1528" LastEditDate="2013-05-06T19:21:04.703" LastActivityDate="2013-05-06T19:21:04.703" CommentCount="1" />

第一个xml代码段PostTypeId="1"中的内容表示此行是一个问题，而AcceptedAnswerId="8"表示答案的Id。在第二个xml片段中，我们Id="8"为问题的AcceptedAnswerId，PostTypeId="2"表示这是一个答案，ParentId是问题Id }。

现在有了这个说法，我怎么能轻松地轮询这些问题/答案对的数据。理想情况下，如果我可以将它转换为我熟悉这些类型的数据结构的SQLite3或Mysql数据库，那将是有用的。如果这不可能（通过数据库函数本身或通过脚本包装器来填充数据库），我将如何在Ruby中解析这些数据，以便我可以浏览整个XML文档来提取{{1}问题的{}和title，然后将其与相应的body正文配对。

感谢您的时间。

Answer 1

Stack Exchange Creative Commons Data Dump 只是来自Stack Exchange生产Microsoft SQL Server数据库的（已清理的）转储。因此，考虑到数据来自SQL数据库并且真正的是关系数据，您可以将其导入一个。

Data Dump's README中描述了数据库模式，您可以找到一些旧脚本，以便将其导入Meta Stack Exchange的数据库。当然，如果你想要的只是类似SQL的关系查询界面，你可以使用Stack Exchange Data Explorer。

从XML stackexchange转储中提取Q＆amp; A对

1 个答案: