Question

我试图解析表格 https://www.neb.com/tools-and-resources/usage-guidelines/nebuffer-performance-chart-with-restriction-enzymes使用Pythons库lxml，但如果我尝试使用类似提取版本（How to extract tables from websites in Python）的一些代码片段，我会遇到<a>-tags的问题以及此表中显示的图像。最后我想要一个文本文件，其中包含来自NEB的此限制酶表的以下列，没有任何格式，只是纯文本：

酶|序列| NEBuffer | NEBuffer中的％活动|热Inac。 | INCU。温度。

我想尝试自己提取行中的每个td并将这些信息组合在一个列表条目中：

from urllib2 import urlopen
from lxml import etree
url = "https://www.neb.com/tools-and-resources/usage-guidelines/nebuffer-performance-chart-with-restriction-enzymes"
tree = etree.HTML(urlopen(url).read())

rows = tree.xpath('//*[@id="form1"]/div[2]/div/div/section[@class="chart"]/table/tbody/tr')

cells = [[rows.xpath('//td/a/text()'), 
          rows.xpath('//td/text()')] for tr in rows]
print cells[1]

但是它只混合了一个条目中的所有内容，我不知道如何处理像＆＃39; u和\ u2122这样的特殊字符输出的第一行：

[['AatII', u'CutSmart\u2122 Buffer', 'AbaSI', 'NEBuffer 4', 'Acc65I', 'NEBuffer 3.1', 'AccI', u'CutSmart\u2122 Buffer', 'AciI', u'CutSmart\u2122 Buffer',

我认为我没有编写像第2列中的图像那样的列被跳过：/

我希望我的问题足够详细，以便您能够理解我想要做的事情。

Answer 1

首先，\u2122只是™ unicode字符的ASCII友好表示。如果你print()字符串，你会看到那个字符而不是那个字符。所以不用担心！

然后，您的代码对我不起作用：

tree.xpath('//*[@id="form1"]/div[2]/div/div/section[@class="chart"]/table/tbody/tr')

返回一个列表，这使得无法做到：

rows.xpath('//td/a/text()')

所以我不知道你是如何得到一个结果的。即使它正在工作，也有一些你没有用XPath得到的东西，//使得搜索从文档的根开始，这就是为什么你得到a的每个内容的原因标记在td标记内，而不是您所在的tr内的标记。

相反，如果您使用 relative xpath，则以下方法可行：

>>> rows[0].xpath('td/a')
[<Element a at 0x2e3ff50>, <Element a at 0x2e3ff00>]
>>> rows[0].xpath('td/a/text()')
['AatII', u'CutSmart\u2122 Buffer']

但事实是，这样做过于通用，你将无法保留元素按照兴趣的顺序。可悲的是，没有自动的方法可以做到这一点有趣的东西。

然后您需要获取HTML，并看到您想要图像的alt td，您希望获取另一个span的内容：

<tr>
    <td>
        <a href="/products/r0117-aatii">AatII</a>
    </td>
    <td>
        <img class="product-icon" longdesc="This enzyme is purified from a recombinant source." alt="recombinant" src="/~/media/Icons/icon_recomb.gif">
        <img class="product-icon" longdesc="This enzyme is capable of digesting 1 µg of DNA in 5 minutes." alt="timesaver 5min" src="/~/media/Icons/icon_timesaver5.gif">
        <img class="product-icon" longdesc="Cleavage with this restriction enzyme is blocked when the substrate DNA is methylated by CpG methylase." alt="cpg" src="/~/media/Icons/icon_cpg.gif">
    </td>
    <td>GACGT/C</td>
    <td>
        <a href="/products/b7204-cutsmart-buffer">CutSmart™ Buffer</a>
    </td>
    <td>10</td>
    <td>50*</td>
    <td>50</td>
    <td>100</td>
    <td>
        <span style="color:red;">80°C</span>
    </td>
    <td>37°C</td>
    <td>B </td>
    <td>
        <img width="10" height="10" alt="Not Sensitive" src="/~/media/Icons/Not Sensitive.gif">
    </td>
    <td>
        <img width="10" height="10" alt="Not Sensitive" src="/~/media/Icons/Not Sensitive.gif">
    </td>
    <td>
        <img width="10" height="10" alt="Blocked" src="/~/media/Icons/Blocked.gif">
    </td>
    <td>λ DNA</td>
    <td></td>
</tr>

以下是从您链接的文档中获取感兴趣的值：

>>> for row in rows: print row[0].xpath('a/text()'), [img.attrib['alt'] for img in row[1].xpath('img')], row[2].text, row[3].xpath('a/text()'), row[4].text, row[5].text, row[6].text, row[7].text, row[8].xpath('span/text()'), row[9].text, [img.attrib['alt'] for img in row[10].xpath('img')], [img.attrib['alt'] for img in row[11].xpath('img')], [img.attrib['alt'] for img in row[12].xpath('img')], row[13].text, row[14].text
['AatII'] ['recombinant', 'timesaver 5min', 'cpg'] GACGT/C [u'CutSmart\u2122 Buffer'] 10 50* 50 100 [u'80\xb0C'] 37°C [] ['Not Sensitive'] ['Not Sensitive'] None λ DNA
['AbaSI'] ['recombinant'] None ['NEBuffer 4'] 25 50 50 100 [] 25°C [] ['Not Sensitive'] ['Not Sensitive'] None None
['Acc65I'] ['recombinant', 'timesaver 5min', 'dcm', 'cpg'] G/GTACC ['NEBuffer 3.1'] 10 75* 100 25 [] 37°C [] ['Not Sensitive'] ['Blocked by Some Combinations of Overlapping'] None pBC4 DNA
...

获取所有字段。

最后，为了使其易于重复使用，这就是我要做的事情：

 enzimes = [{ 'enzime'                     : row[0].xpath('a/text()'),
              'attributes'                 : [img.attrib['alt'] for img in row[1].xpath('img')],
              'Supplied NEBuffer'          : row[2].text,
              '% Activity in NEBuffer 1.1' : row[3].xpath('a/text()'),
              '% Activity in NEBuffer 2.1' : row[4].text,
              '% Activity in NEBuffer 3.1' : row[5].text,
              'CutSmart'                   : row[6].text,
              'Heat Inac.'                 : row[7].text,
              'Incu. Temp.'                : row[8].xpath('span/text()')[0] if len(row[8].xpath('span/text()')) > 0 else row[8].text,
              'Diluent'                    : row[9].text,
              'Dam'                        : [img.attrib['alt'] for img in row[10].xpath('img')],
              'Dcm'                        : [img.attrib['alt'] for img in row[11].xpath('img')],
              'CpG'                        : [img.attrib['alt'] for img in row[12].xpath('img')],
              'Unit Substrate'             : row[13].text,
              'Note'                       : row[14].text
            } for row in rows]

并且对于第一个enzime，这是结果：

>>> import pprint
>>> pprint.pprint(enzimes[0])
{'% Activity in NEBuffer 1.1': [u'CutSmart\u2122 Buffer'],
 '% Activity in NEBuffer 2.1': '10',
 '% Activity in NEBuffer 3.1': '50*',
 'CpG': ['Not Sensitive'],
 'CutSmart': '50',
 'Dam': [],
 'Dcm': ['Not Sensitive'],
 'Diluent': u'37\xb0C',
 'Heat Inac.': '100',
 'Incu. Temp.': u'80\xb0C',
 'Note': u'\u03bb DNA',
 'Supplied NEBuffer': 'GACGT/C',
 'Unit Substrate': None,
 'attributes': ['recombinant', 'timesaver 5min', 'cpg'],
 'enzime': ['AatII']}

HTH

使用不同的分层标记从Web中提取表

1 个答案: