Question

我试图做以下事情：

我有一个逐行显示某些值的文本文件。
根据页码生成值列表的网站。值为XXX＆amp; YYY在下面的例子中。
python脚本读取第一个文本文件（使用集合进行有效的0（1）查找）并在页面后逐页搜索+1，如果找到值匹配，则必须打印页码。

搜索必须像www.site.com/1 www.site.com/2 www.site.com/3 ...等

HTML来源：

<pre class="values">
    <strong>A</strong>
    <strong>B</strong>
    <strong>C</strong>
    <span id="1">
        <a href="/#">+</a> 
        <span title="1">1</span>
        <a href="/#">XXX</a>
        <a href="/#">YYY</a>
    </span>
</pre>

文本文件高效0（1）使用集合查找：

with open("values.txt", "r") as f1:
        lines = set(f1) # efficient 0(1) lookups using a set
        for line in HTML :
            if line in lines:
                print(line)

Answer 1

from xml.etree import ElementTree as ET

<pre class="values">
    <strong>A</strong>
    <strong>B</strong>
    <strong>C</strong>
    <span id="1">
        <a href="/#">+</a> 
        <span title="1">1</span>
        <a href="/#">XXX</a> <a href="/#">YYY</a>
    </span>
</pre>

with open('/path/to/file.html') as fp:
    html = ET.fromstring(fp.read())

for node in html.iter():
    if node.tag == 'a':
        print node.text

在python中查找来自可见HTML的文本

1 个答案: