Question

我尝试从中提取数据的网站是： http://www.genome.jp/dbget-bin/www_bget?ecs:ECs0037

我试图提取“nt序列”：

try:
    geneSeq = browser.find_element_by_xpath("html/body/div[1]/table/tbody/tr/td/table[2]/tbody/tr/td[1]/form/table/tbody/tr/td/table/tbody/tr[11]/td").text

except:
    geneSeq = "file\nnot found" 
geneSeq = geneSeq[geneSeq.find('\n')+1:]

我删除输入的第一行，因为我不需要它，但我在代码中有br标签，这些标签在文件中注册但python没有看到它们。我尝试过.isspace（）并返回false，因此.rsplit（）不起作用。不幸的是，当我尝试使用f.write将序列写入文件时，行仍然显示。

有没有办法删除br标签？

Answer 1

假设您的html字符串名为html，请执行以下操作：

html = html.replace('<br>', '')

Answer 2

它将在python中打印整个html内容：

import urllib2

req = urllib2.Request('https://www.google.com')
response = urllib2.urlopen(req)
the_page = response.read()

Answer 3

感谢所有的答案，因为python没有看到像空白一样的空间我刚刚做了一个循环来检查看起来有效的字符：

noSpace =""
for char in geneSeq:
    if char.isalpha():
        noSpace = noSpace + char

从正在提取的元素中删除<br/>

3 个答案: