Question

我刚刚开始用Python编程，我目前正在研究一个非常大的数据集。它是一个xml文件，大小约为80GB，所以我不能解析它，例如使用xml.etree.ElementTree，因为它根本不适合我的RAM。（文件：ftp://ftp.ebi.ac.uk/pub/databases/interpro/46.0/，请参阅match_complete.xml.gz）

我到目前为止所做的事情：我一直在清理它，总是清理当前元素及其根，一旦找到我要找的东西，这非常有效（需要少于10MB的RAM）。

我现在要做的是将我的解析并行化，因为我有10个内核和20个线程供我使用。为了做到这一点，我计划将这个大的xml文件分成20个较小的文件，所以我可以在每个小文件中并行开始搜索（这可能是第二个问题，在另一个线程中）。 / p>

由于我不只是尝试为一个数据集执行此操作，我可以轻松查找其大小（请参阅上部链接中的release_notes.txt），但我希望这是一个更通用的脚本以供进一步使用，我是寻找最有效的查找方式，这个巨大的xml文件中存在多少具有特定标记的元素，因此我总是可以根据我可用的线程数拆分文件。

Datastructure看起来像这样：

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE interpromatch SYSTEM "match_complete.dtd">
<interpromatch>

<release>
    <dbinfo Here is stuff I am totally not interested in>
    <dbinfo Here is stuff I am totally not interested in>
</release><protein id="A0A000" name="A0A000_9ACTN" length="394" crc64="F1DD0C1042811B48">
<match id=Some info about the proteins in my case>
  <ipr Some info I am acutally looking for, when I am parsing the file ESSENTIAL />
  <Don't need this either />
</match>
    <match id=Some info about the proteins in my case>
  <ipr Some info I am acutally looking for, when I am parsing the file ESSENTIAL />
  <Don't need this either />
</match>
<match id=Some info about the proteins in my case>
  <ipr Some info I am acutally looking for, when I am parsing the file ESSENTIAL />
  <Don't need this either />
</match>
</protein>
.
.
. (around 50000000 more entries in the whole db)
<protein>
<match id=Some info about the proteins in my case>
  <ipr Some info I am acutally looking for, when I am parsing the file ESSENTIAL />
  <Don't need this either />
</match>
</protein>
</interpromatch>

让我们说我正在寻找标签＆＃34;蛋白质＆＃34;我的数据库包含10000个这类条目。我希望能够尽可能快地查看这个数字（迭代我认为根本不可行），所以我可以找出，有多少这些条目并将这个数除以线程数。在这个例子中我想得到例如len（tree.findall（＆＃34; protein＆＃34;）），所以我知道，我必须在其中一个较小的文件中放入多少条目。在这种情况下，这将是每个文件10000（蛋白质）/ 20（线程）。

我主要使用Python，但我会考虑一切，只是告诉我，有多少＆＃34;蛋白质＆＃34;我的数据库中的条目尽可能快。

为了完整起见，我后来要做的是：为每个较小的文件启动一个脚本/子进程，并在＆＃34; ipr＆＃34;中查询它的某个属性。部分。在那里，我正在寻找一个特定的标识符，如果存在这个标识符，从父母＆＃34;蛋白质中提取数据＆＃34;节点。结合这些结果，并与之合作。

我希望你明白我的意思并且可以帮助我。提前谢谢！

Python XML快速计算具有特定标记

0 个答案: