如何使用BeautifulSoup和Python获取表/跨度后的文本?

时间:2017-08-15 05:13:32

标签: python html web-scraping beautifulsoup

我需要获取下面屏幕截图中显示的文字2,585。我对编码很新,但这是我到目前为止所做的:

import urllib2
from bs4 import BeautifulSoup

url= 'insertURL'
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'html.parser')
span = soup.find('span', id='d21475972e793-wk-Fact -8D34B98C76EF518C788A2177E5B18DB0')
print (span.text)

任何信息都有用!!谢谢。

Website HTML

1 个答案:

答案 0 :(得分:0)

3件事,你的使用请求不是urllib2。您选择带有命名空间的XML,因此您需要使用xml作为解析器。你想要的元素不是跨度,它是ix:nonFraction。以下是使用其他网页的工作示例(您只需将其指向您的页面并使用注释行)。

# Using requests no need for urllib2.
import requests
from bs4 import BeautifulSoup

# Using this page as an example.
url= 'https://www.sec.gov/Archives/edgar/data/27904/000002790417000004/0000027904-17-000004.txt'
r = requests.get(url)
data = r.text
# use xml as the parser.
soup = BeautifulSoup(data, 'xml')
ix = soup.find('ix:nonFraction', id="Fact-7365D69E1478B0A952B8159A2E39B9D8-wk-Fact-7365D69E1478B0A952B8159A2E39B9D8")
# Your original code for your page.
# ix = soup.find('ix:nonFraction',  id='d21475972e793-wk-Fact-8D34B98C76EF518C788A2177E5B18DB0')
print (ix.text)