使用BeautifulSoup从svg中抓取文字

时间:2018-12-30 01:40:15

标签: python html web-scraping

我是python的初学者,我正在尝试使用BeautifulSoup刮除实际的年度支出价格。我很难找到应该用来从svg中提取文本的内容。

到目前为止我写的代码:

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'http://abacus.realendpoints.com/ConsoleTemplate.aspx?act=qlrd&req=nav&mop=abacus!main&pk=ed5a81ad-9367-41c8-aa6b-18a08199ddcf&ab-eff=1000&ab-tox=0.1&ab-nov=1&ab-rare=1&ab-pop=1&ab-dev=1&ab-prog=1.0&ab-need=1&ab-time=1543102810'

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

page_soup = soup(page_html, "html.parser")

1 个答案:

答案 0 :(得分:1)

每月数字:

使用硒,您可以移动到每一行来获取每月信息

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains

url = 'http://abacus.realendpoints.com/ConsoleTemplate.aspx?act=qlrd&req=nav&mop=abacus!main&pk=ed5a81ad-9367-41c8-aa6b-18a08199ddcf&ab-eff=1000&ab-tox=0.1&ab-nov=1&ab-rare=1&ab-pop=1&ab-dev=1&ab-prog=1.0&ab-need=1&ab-time=1543102810'
d = webdriver.Chrome()
actions = ActionChains(d)
d.get(url)
paths = WebDriverWait(d,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".highcharts-plot-lines-0 path")))
results = []
for path in paths:
    actions.move_to_element(path).perform()
    actions.click_and_hold(path).perform()
    items = d.find_elements_by_css_selector('#priceChart path + text tspan')
    result = [item.text for item in items]
    if result:
        results.append(result)

print(results)

enter image description here


有关年度数字:

金达丑陋,但是您可以从脚本标记之一中正则表达式输出信息。这是年度数字,而不是每月数字。

import requests
from bs4 import BeautifulSoup as bs
import re
import locale

res = requests.get('http://abacus.realendpoints.com/ConsoleTemplate.aspx?act=qlrd&req=nav&mop=abacus!main&pk=ed5a81ad-9367-41c8-aa6b-18a08199ddcf&ab-eff=1000&ab-tox=0.1&ab-nov=1&ab-rare=1&ab-pop=1&ab-dev=1&ab-prog=1.0&ab-need=1&ab-time=1543102810')
soup = bs(res.content, 'lxml')
script = soup.select('script')[19]
items = str(script).split('series:')
item = items[2].split('exporting')[0][:-15]
p1 = re.compile('name:(.*)]')
p2 = re.compile('(\d+\.\d+)+')
it = re.finditer(p1, item)
names = [match.group(1).split(',')[0].strip().replace("'",'') for match in it]
it2 = re.finditer(p2, item)
allNumbers = [float(match.group(1)) for match in it2]
actualAnnuals = allNumbers[0::2]
abacusAnnuals = allNumbers[1::2]
actuals = list(zip(names,actualAnnuals))
abacus = list(zip(names,abacusAnnuals))

#Examples:
print(actuals,abacus)

locale.setlocale(locale.LC_ALL, 'English')
print(locale.format('%.2f',sum(actualAnnuals) , True))

使用硒,您可以使用CSS类型选择器轻松获取标题年度数字

from selenium import webdriver

d = webdriver.Chrome()
d.get('http://abacus.realendpoints.com/ConsoleTemplate.aspx?act=qlrd&req=nav&mop=abacus!main&pk=ed5a81ad-9367-41c8-aa6b-18a08199ddcf&ab-eff=1000&ab-tox=0.1&ab-nov=1&ab-rare=1&ab-pop=1&ab-dev=1&ab-prog=1.0&ab-need=1&ab-time=1543102810')
print(d.find_element_by_css_selector('tspan').text)

年度算盘,价格表和场景:

print(d.find_elements_by_css_selector('tspan')[3].text, d.find_element_by_css_selector('#Options_price_sheet_id [selected]').text, d.find_element_by_css_selector('#Options_scenario_id [selected]').text )