Question

我正在尝试访问此网页上的序列：
https://www.ncbi.nlm.nih.gov/protein/EGW15053.1?report=fasta

序列存储在div class =＆＃34; seq gbff＆＃34;下。每行存储在

下

<span class='ff_line' id='gi_344258949_1"> *line 1 of sequence* </span>

当我尝试搜索包含序列的跨度时，美丽的汤会返回None。当我尝试查看div上spans的子项或内容时遇到同样的问题。

以下是代码：

import requests
import re
from bs4 import BeautifulSoup
# Create a variable with the url
url = 'https://www.ncbi.nlm.nih.gov/protein/EGW15053.1?report=fasta'

# Use requests to get the contents
r = requests.get(url)

# Get the text of the contents
html_content = r.text

# Convert the html content into a beautiful soup object
soup = BeautifulSoup(html_content, 'html.parser')


div = soup.find_all('div', attrs={'class', 'seq gbff'})
for each in div.children:
    print(each)
soup.find_all('span', aatrs={'class', 'ff_line'})

这两种方法都不起作用，我非常感谢任何帮助：D

Answer 1

此页面使用JavaScript加载数据

在Chrome / Firefox中使用DevTools我找到了此网址并且全部为<span>

https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id=344258949&db=protein&report=fasta&extrafeat=0&fmt_mask=0&retmode=html&withmarkup=on&tool=portal&log $ = seqview＆安培; maxdownloadsize = 1000000

现在很难。您必须在HTML中找到此URL，因为不同的页面将在url中使用不同的参数。或者您必须比较几个网址并找到架构，以便您可以手动生成此网址。

编辑：如果您在网址中将retmode=html更改为retmode=xml，则会将其设为XML。如果您使用retmode=text，则会将其作为没有HTML标记的文字。 retmode=json不起作用。

BeautifulSoup没有找到所有跨度或孩子

1 个答案: