Question

我已经写下了从html文件中提取一些文本的代码，这段代码从网页中提取请求的行现在我想提取序列数据。遗憾的是我无法提取文本，它显示出一些错误。

import urllib2
from HTMLParser import HTMLParser
import nltk 
from bs4 import BeautifulSoup

# Proxy information were removed  
# from these two lines 

proxyOpener = urllib2.build_opener(proxyHandler)
urllib2.install_opener(proxyOpener)

response = urllib2.urlopen('http://tuberculist.epfl.ch/quicksearch.php?gene+name=Rv0470c')

################## BS Block ################################

soup = BeautifulSoup(response)
text = soup.get_text()
print text 

##########################################################

html = response.readline()

for l in html:
    if "|Rv0470c|" in l:
        print l       # code is running successfully till here 

raw = nltk.clean_html(html) 
print raw

如何成功运行此代码？我已经检查了所有可用的线程和解决方案，但没有任何工作。

我想提取这部分：

M. tuberculosis H37Rv|Rv0470c|pcaA
MSVQLTPHFGNVQAHYDLSDDFFRLFLDPTQTYSCAYFERDDMTLQEAQIAKIDLALGKLNLEPGMTLLDIGCGWGATMRRAIEKYDVNVVGLTLSENQAGHVQKMFDQMDTPRSRRVLLEGWEKFDEPVDRIVSIGAFEHFGHQRYHHFFEVTHRTLPADGKMLLHTIVRPTFKEGREKGLTLTHELVHFTKFILAEIFPGGWLPSIPTVHEYAEKVGFRVTAVQSLQLHYARTLDMWATALEANKDQAIAIQSQTVYDRYMKYLTGCAKLFRQGYTDVDQFTLEK

Answer 1

我可以在写下这段代码后提取所需的文字：没有任何依赖的工作接受＆＃34; urllib2＆＃34;而对于我的情况，它就像一个魅力。

{{1}}

Answer 2

我不太确定你的整体要求是什么，但这是我对你的问题的特别看法（实际上与你的问题相似），它会检索你请求的html部分。也许你可以得到一些想法。（调整Python2）

import requests
from bs4 import BeautifulSoup

url = 'http://tuberculist.epfl.ch/quicksearch.php?gene+name=Rv0470c'
r = requests.get(url)
html = r.content
soup = BeautifulSoup(html, "lxml")
for n in soup.find_all('tr'):
    if "|Rv0470c|" in n.text:
        nt = n.text
        while '\n' in nt:
            nt.replace('\n','\t')
        nt=nt.split('\t')
        nt = [x for x in nt if "|Rv0470c|" in x][0].strip()  
        print (nt.lstrip('>'))

从html文件python中提取文本

2 个答案: