这是关于html解析的一个非常基本的问题:
我是python(编码,计算机科学等)的新手,教我自己解析html,我已经导入了模式和漂亮的汤模块来解析。我在互联网上找到了这个代码来删除所有格式。
import requests
import json
import urllib
from lxml import etree
from pattern import web
from bs4 import BeautifulSoup
url = "http://webrates.truefx.com/rates/connect.html?f=html"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text)
这会产生以下输出:
EUR/USD14265522866931.056661.056751.056081.057911.05686USD/JPY1426552286419121.405121.409121.313121.448121.382GBP/USD14265522866821.482291.482361.481941.483471.48281EUR/GBP14265522865290.712790.712900.712300.713460.71273USD/CHF14265522866361.008041.008291.006551.008791.00682EUR/JPY1426552286635128.284128.296128.203128.401128.280EUR/CHF14265522866551.065121.065441.063491.066281.06418USD/CAD14265522864891.278211.278321.276831.278531.27746AUD/USD14265522864960.762610.762690.761150.764690.76412GBP/JPY1426552286682179.957179.976179.854180.077179.988
现在从这一点开始我如何进一步解析如果我只想要字符串'USD / CHF'或特定的数据点?
是否有更简单的webscrape和解析方法?任何建议都会很棒!
系统规格: Windows 7 64位 IDE:空闲 python:2.7.5
提前谢谢大家, 生锈的
答案 0 :(得分:2)
Keep it simple。按文字(例如USD/CHF
)查找单元格并获取following siblings:
text = 'USD/CHF'
cell = soup.find('td', text=text)
for td in cell.next_siblings:
print td.text
打印:
1426561775912
1.00
768
1.00
782
1.00655
1.00879
1.00682
答案 1 :(得分:1)
根据我的经验,美丽的汤非常容易。我会写一个正则表达式来删除一串字符后的数字。我希望这能让你走上正轨。
答案 2 :(得分:1)
你可以尝试这样快速和肮脏的东西。显然,这样的代码会根据字符串本身而改变。更高级的方法将使用python的正则表达式库。但有时保持简单是件好事。
string = []
starting_position = text[text.find("USD/CHF")+7:] #+7 to start after the tag USD/CHF
for ch in starting_position:
if ch.isdigit() == True or ch == ".":
string.append(str(ch))
else:
break
print "".join(string)