Question

最近我编写了一个python脚本来解析网页中的特定行。这段代码运行正常，但每当我运行它时，它会在工作目录中下载并写入一个“.php”文件：

#!/usr/bin/env python
import wget
import re
from HTMLParser import HTMLParser
import tempfile
url = "http://tuberculist.epfl.ch/quicksearch.php?gene+name=0009&submit=Search#sequence"
filname = wget.download(url)
a = open(filname,'r')
b = a.readlines()
f = "|Rv0009|"
for c in b:
    if f in c:
        pattern = re.compile("> >.+<br /></")
        z = pattern.findall(c)
        print z

我应该做出哪些更改，以便在不编写文件的情况下解析所需的行。

Answer 1

一些注意事项：

urllib.urlopen(url)将为您提供类似文件的对象，而不会在磁盘上写任何内容。
您的代码正在导入它未使用的2个模块（HTMLParser和tempfile）。摆脱那些进口。
您的网址的#sequence部分永远不会提供给服务器（它是HTTP规范的一部分）。你可以把它拿出来。
您正在使用正则表达式来解析HTML。随着您的使用案例的复杂化，它将导致您痛苦和痛苦。请考虑使用lxml.html（http://lxml.de/lxmlhtml.html）或BeautifulSoup（http://www.crummy.com/software/BeautifulSoup/）。

如何在不写入光盘的情况下阅读和解析html文件

1 个答案: