在非描述性标签之间刮擦文本

时间:2014-08-05 21:28:39

标签: python html web-scraping beautifulsoup

在某些情况下,我的文字介于模糊值和属性之间,这些属性会在整个文件中多次出现(例如''重复使用)。

最终,我想退出:" Prev Close:"和" 565.07" 并将该信息放入字符串或列表中(请提出建议)。


相关HTML来源的部分:

<div class="yui-u first yfi-start-content"><div class="yfi_quote_summary"><div id="yfi_quote_summary_data" class="rtq_table"><table id="table1"><tr><th scope="row" width="48%">Prev Close:</th><td class="yfnc_tabledata1">565.07</td></tr>

我的代码(Python 3.4.1):

soup = BeautifulSoup(data) # data contains the HTML source

FirstTable_tag = soup.find('div', attrs={'class': '"yui-u first yfi-start-content"'})
# Should the keys (attributes) in the "findNextSibling parameters below be filled in or left empty???
next_FirstTable_tag = FirstTable_tag.findNextSibling('div', attrs={'class': '"yfi_quote_summary"'})     
next_next_FirstTable_tag = next_FirstTable_tag.findNextSibling('div', attrs={'id': '"yfi_quote_sumary_data"', 'class': '"rtq_table"'})
next_next_next_FirstTable_tag = next_next_FirstTable_tag.findNextSibling('table', attrs={'id': '"table1"'})
data = next_next_next_FirstTable_tag.get_text()

SelectSoup = BeautifulSoup(data)
print("SelectSoup:" + SelectSoup + "(should be:  Prev Close)")

错误

Traceback (most recent call last):
    next_FirstTable_tag = FirstTable_tag.findNextSibling          
AttributeError: 'NoneType' object has no attribute 'findNextSibling'
<<< Process finished. (Exit code 1)

修改

Here is the initial and full source as requested

虽然我已经转向使用雅虎的API,这显然是一种更好的方法,但我仍然试图在@scandinavian _ 的帮助下避开好奇心p>

我更新了上面的代码,但我仍然遇到同样的错误。


编辑2

这篇文章今后将集中讨论解决方案@scandinavian_正在协助开发:

import sys
import urllib.request
url = "http://finance.yahoo.com/q?s=GOOG"
urlRunner = urllib.request.urlopen(url)
data = urlRunner.read()

from bs4 import BeautifulSoup
soup = BeautifulSoup(data)

import re
tables = soup.findAll("table", id = re.compile('^table'))
result = {}
for table in tables:
    for th, td in zip(table.findAll("th"), table.findAll("td")):
        result[th.text] = td.text
print(result)

结果:

{'52wk Range:': '502.80 - 604.83', 'Market Cap:': '381.04B', 'Next Earnings Date:': 'N/A', 'P/E (ttm):': '29.52', 'Avg Vol (3m):': '1,701,610', 'EPS (ttm):': '19.09', '1y Target Est:': 'N/A', 'Volume:': '561,384', 'Ask:': '563.98 x 100', 'Div & Yield:': 'N/A (N/A) ', 'Bid:': '563.56 x 100', 'Beta:': '1.144', 'Open:': '568.00', "Day's Range:": '562.53 - 569.77', 'Prev Close:': '566.37'}

2 个答案:

答案 0 :(得分:1)

这是基于我的想法,但如果没有适当的数据样本,就无法说出来。我无法猜测它的结构如何。在您的描述中,听起来数据是不规则的,这在您的样本中是不可能看到的。

from bs4 import BeautifulSoup
from itertools import izip

html = """<div class="yui-u first yfi-start-content">
    <div class="yfi_quote_summary">
        <div id="yfi_quote_summary_data" class="rtq_table">
            <table id="table1">
                <tr>
                    <th scope="row" width="48%">Target Point:</th>
                    <td class="yfnc_tabledata1">200.22</td>
                </tr>
                <tr>
                    <th scope="row" width="48%">Target Point:</th>
                    <td class="yfnc_tabledata1">200.22</td>
                </tr>
                <tr>
                    <th scope="row" width="48%">Target Point:</th>
                    <td class="yfnc_tabledata1">200.22</td>
                </tr>
            </table>
        </div>
    </div>
</div>"""

bs = BeautifulSoup(html)

result = {}

ths = bs.findAll("th")
tds = bs.findAll("td")
elements = izip(ths, tds)

result = []

for x, y in elements:
    result.append((x.text, y.text))

print result

编辑:

雅虎API解决方案,请考虑使用此代码:

import requests

URL = "https://query.yahooapis.com/v1/public/yql"

query = 'select * from yahoo.finance.quotes where symbol in ("GOOG")'

params = {
    "q": query,
    "format": "json",
    "env": "store://datatables.org/alltableswithkeys"
}

data = requests.get(URL, params=params).json()

print data['query']['results']['quote']['PreviousClose']
print data['query']['results']['quote']['Open']

这将打印:

565.07
561.78

这些是股票的可用数据:

AfterHoursChangeRealtime
AnnualizedGain
Ask
AskRealtime
AverageDailyVolume
Bid
BidRealtime
BookValue
Change
Change_PercentChange
ChangeFromFiftydayMovingAverage
ChangeFromTwoHundreddayMovingAverage
ChangeFromYearHigh
ChangeFromYearLow
ChangeinPercent
ChangePercentRealtime
ChangeRealtime
Commission
Currency
DaysHigh
DaysLow
DaysRange
DaysRangeRealtime
DaysValueChange
DaysValueChangeRealtime
DividendPayDate
DividendShare
DividendYield
EarningsShare
EBITDA
EPSEstimateCurrentYear
EPSEstimateNextQuarter
EPSEstimateNextYear
ErrorIndicationreturnedforsymbolchangedinvalid
ExDividendDate
FiftydayMovingAverage
HighLimit
HoldingsGain
HoldingsGainPercent
HoldingsGainPercentRealtime
HoldingsGainRealtime
HoldingsValue
HoldingsValueRealtime
LastTradeDate
LastTradePriceOnly
LastTradeRealtimeWithTime
LastTradeTime
LastTradeWithTime
LowLimit
MarketCapitalization
MarketCapRealtime
MoreInfo
Name
Notes
OneyrTargetPrice
Open
OrderBookRealtime
PEGRatio
PERatio
PERatioRealtime
PercebtChangeFromYearHigh
PercentChange
PercentChangeFromFiftydayMovingAverage
PercentChangeFromTwoHundreddayMovingAverage
PercentChangeFromYearLow
PreviousClose
PriceBook
PriceEPSEstimateCurrentYear
PriceEPSEstimateNextYear
PricePaid
PriceSales
SharesOwned
ShortRatio
StockExchange
symbol
Symbol
TickerTrend
TradeDate
TwoHundreddayMovingAverage
Volume
YearHigh
YearLow
YearRange

答案 1 :(得分:1)

from bs4 import BeautifulSoup
import re
import urllib2

url = "http://finance.yahoo.com/q?s=GOOG"
html = urllib2.urlopen(url).read()
bs = BeautifulSoup(html)

#Find the two tables which ID's start with "table".
tables = bs.findAll("table", id=re.compile('^table')) 

result = {}

#Iterate the tables.
for table in tables:
    #Iterate both th and td in order.
    for th, td in zip(table.findAll("th"), table.findAll("td")):
        result[th.text] = td.text

print result
  • 1)什么决定了结果的顺序? 字典不保留顺序,因此它们是随机顺序的。如果需要订购,可以使用OrderedDict或包含元组的列表。数据从上到下左列,然后右列从上到下。

  • 2)我相信数据目前在字典中?如果我想稍后重用这些数据并将某些数据点插入到不同的函数中,我应该怎么做...另外,我如何重新组织名称和值并以更易于理解的方式显示它们(例如多行列表其中每一行以描述开头,有空格和短划线,然后显示值)?一旦我重新组织了结果,它应该存储在元组还是别的东西中?

-

for key, val in result.items():
    print key + " - " + val

关于排序,我们正在进入基本的编程问题,通过阅读python中的不同容器,您将更好地理解这些问题。

相关问题