如何从更长的字符串中提取数据?

时间:2018-07-03 11:34:54

标签: python selenium beautifulsoup

当天的第二个问题。 这是我到目前为止编写的代码。 我正在尝试从此表中提取Settl.Prices和Vol.Exchange列:https://www.eex.com/en/market-data/power/futures/phelix-at-futures#!/2018/7/3 该行中的结果是一团糟,我尝试使用re.sub使其更好,但是我无法保留数字,逗号和点,也不会丢失位置和数字小数点分隔符。关于如何将两列存储在两个列表中的任何想法?

from bs4 import BeautifulSoup as soup
from selenium import webdriver
from selenium.webdriver.common.by import By
import datetime
import time
from selenium.webdriver.support import expected_conditions as EC 
from selenium.webdriver.support.ui import WebDriverWait 

today=datetime.date.today()
browser = webdriver.Chrome(executable_path=r"C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe")
my_url = 'https://www.eex.com/en/market-data/power/futures/phelix-at-futures#!/'+str(today.year)+'/'+str(today.month)+'/'+str(today.day-1)
browser.get(my_url)
button = WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "ul.tabs.filter_wrap.clearfix li.ng-scope:nth-child(3)>a"))).click()

page_html = browser.page_source
page_soup = soup(page_html, "html.parser")
browser.close()
time.sleep(5)
table = page_soup.find('table')
table_rows = table.findAll('tr')

for tr in table_rows:
    list = ""
    td = tr.find_all('td')
    row = [i.text for i in td]
    print(row)

实际输出

['\n              Cal-19\n            ', '\n              -\n            ', '\n              -\n            ', '\n              -\n            ', '\n              -\n            ', '\n              0.51\n            ', '\n              -\n            ', '\n              -\n            ', '\n              46.15\n            ', '\n              -\n            ', '\n              -\n            ', '\n              12\n            ', '', '\n\n']
['\n\nloading...\n\nan error occurred while loading the chart...\nPlease reload the chart.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nInvalid Date Format: Please use the format YYYY-MM-DD.\n\n\n\n\nx\n\n\n\n\nIntraday Prices\nSettlement Prices\n\n\n\n\n\n\nall series\n\n\n\n\n\n\n\n\n\n\n\n\n\n']
['\n              Cal-20\n            ', '\n              -\n            ', '\n              -\n            ', '\n              -\n            ', '\n              -\n            ', '\n              0.54\n            ', '\n              -\n            ', '\n              -\n            ', '\n              44.62\n            ', '\n              -\n            ', '\n              -\n            ', '\n              1\n            ', '', '\n\n']
['\n\nloading...\n\nan error occurred while loading the chart...\nPlease reload the chart.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nInvalid Date Format: Please use the format YYYY-MM-DD.\n\n\n\n\nx\n\n\n\n\nIntraday Prices\nSettlement Prices\n\n\n\n\n\n\nall series\n\n\n\n\n\n\n\n\n\n\n\n\n\n']
['\n              Cal-21\n            ', '\n              -\n            ', '\n              -\n            ', '\n              -\n            ', '\n              -\n            ', '\n              0.65\n            ', '\n              -\n            ', '\n              -\n            ', '\n              43.70\n            ', '\n              -\n            ', '\n              -\n            ', '\n              -\n            ', '', '\n\n']
['\n\nloading...\n\nan error occurred while loading the chart...\nPlease reload the chart.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nInvalid Date Format: Please use the format YYYY-MM-DD.\n\n\n\n\nx\n\n\n\n\nIntraday Prices\nSettlement Prices\n\n\n\n\n\n\nall series\n\n\n\n\n\n\n\n\n\n\n\n\n\n']
['\n              Cal-22\n            ', '\n              -\n            ', '\n              -\n            ', '\n              -\n            ', '\n              -\n            ', '\n              0.55\n            ', '\n              -\n            ', '\n              -\n            ', '\n              45.08\n            ', '\n              -\n            ', '\n              -\n            ', '\n              -\n            ', '', '\n\n']
['\n\nloading...\n\nan error occurred while loading the chart...\nPlease reload the chart.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nInvalid Date Format: Please use the format YYYY-MM-DD.\n\n\n\n\nx\n\n\n\n\nIntraday Prices\nSettlement Prices\n\n\n\n\n\n\nall series\n\n\n\n\n\n\n\n\n\n\n\n\n\n']
['\n              Cal-23\n            ', '\n              -\n            ', '\n              -\n            ', '\n              -\n            ', '\n              -\n            ', '\n              0.55\n            ', '\n              -\n            ', '\n              -\n            ', '\n              45.85\n            ', '\n              -\n            ', '\n              -\n            ', '\n              -\n            ', '', '\n\n']
['\n\nloading...\n\nan error occurred while loading the chart...\nPlease reload the chart.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nInvalid Date Format: Please use the format YYYY-MM-DD.\n\n\n\n\nx\n\n\n\n\nIntraday Prices\nSettlement Prices\n\n\n\n\n\n\nall series\n\n\n\n\n\n\n\n\n\n\n\n\n\n']
['\n              Cal-24\n            ', '\n              -\n            ', '\n              -\n            ', '\n              -\n            ', '\n              -\n            ', '\n              0.53\n            ', '\n              -\n            ', '\n              -\n            ', '\n              46.83\n            ', '\n              -\n            ', '\n              -\n            ', '\n              -\n            ', '', '\n\n']
['\n\nloading...\n\nan error occurred while loading the chart...\nPlease reload the chart.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nInvalid Date Format: Please use the format YYYY-MM-DD.\n\n\n\n\nx\n\n\n\n\nIntraday Prices\nSettlement Prices\n\n\n\n\n\n\nall series\n\n\n\n\n\n\n\n\n\n\n\n\n\n']

想要的输出

46.15,- (the one from the column adjacent)
44.62,-
43.70,-              
45.08,-  
45.85,-
46.83,-

1 个答案:

答案 0 :(得分:2)

与其只是尝试使用整个页面并从表中的html表数据中获取值,不如只是使用正确的参数调用API会容易得多。

API如下:

https://www.eex.com/data//view/data/detail/ws-power-futures-austrian-v1/{year}/{month}.{day}.json

示例:

https://www.eex.com/data//view/data/detail/ws-power-futures-austrian-v1/2018/06.07.json

它获取一个JSON,然后您可以根据需要操纵其中的数据,基本上可以使用pandas使用适当的值来构建数据框架。似乎比直接浏览页面更简单的解决方案,而且您的值不会有任何问题。

以下一些链接可以帮助您阅读JSON:

解析JSON:Parsing values from a JSON file?

JSON到熊猫DF:JSON to pandas DataFrame

更新:

我写了一段代码,应该可以帮助您理解这个想法:

from urllib.request import Request, urlopen
import json

request=Request('https://www.eex.com/data//view/data/detail/ws-power-futures-austrian-v1/2018/06.07.json')
response = urlopen(request)
data = response.read()
d = json.loads(data)


# this first obj corresponds to : P-Power-F-AT-Peak-Quarter
first_obj = d["data"][0]["rows"]

values = []

for row in first_obj:
    if('settlementPrice' in row["data"]):
        sp = row["data"]["settlementPrice"]
        values.append(sp)

print(values)

提取的JSON如下所示:

    {
       "data": [
            {
               "identifier": "P-Power-F-AT-Peak-Quarter",
               "rows": [
                    {
                     "data" : {'param1': value, 'param2': value, ...},
                     "contractIdentifier": value,
                    },
                    {
                     "data" : {'param1': value, 'param2': value, ...},
                     "contractIdentifier": value,
                    },
                    ...
               ]
             },
             {
               "identifier": "P-Power-F-AT-Peak-Month",
               "rows": [
                    {
                     "data" : {'param1': value, 'param2': value, ...},
                     "contractIdentifier": value,
                    },
                    {
                     "data" : {'param1': value, 'param2': value, ...},
                     "contractIdentifier": value,
                    },
                    ...
               ]
             },
             {
               "identifier": "P-Power-F-AT-Base-Year",
               "rows": [
                    {
                     "data" : {'param1': value, 'param2': value, ...},
                     "contractIdentifier": value,
                    },
                    {
                     "data" : {'param1': value, 'param2': value, ...},
                     "contractIdentifier": value,
                    },
                    ...
               ]
             },
             ...

我打印出的结果如下:

[53.36, 63.86, 62.63, 46.83, 47.44, 59.28, 58.7]

因此,基本上,您要做的是加载JSON,对其进行解析并存储要从中获取数据的对象。在我给您的代码示例中,我取得了第一个对象,该对象位于索引“ 0”处,该对象对应于2018年6月7日的标识符“ P-Power-F-AT-Peak-Quarter”(网址字符串中的此参数)。您可以通过解析“ d ['data']”中的数据并停止要从中获取值的标识符值来选择要获取的对象。

如果您想知道参数名称是什么,只需在浏览器中打开URL或下载JSON文件并在您喜欢的编辑器中打开它即可。

希望这会有所帮助。