如何在python中删除样式和元素后解析代码

时间:2015-03-17 01:47:20

标签: python html

这是关于html解析的一个非常基本的问题:

我是python(编码,计算机科学等)的新手,教我自己解析html,我已经导入了模式和漂亮的汤模块来解析。我在互联网上找到了这个代码来删除所有格式。

import requests
import json
import urllib
from lxml import etree
from pattern import web
from bs4 import BeautifulSoup


url = "http://webrates.truefx.com/rates/connect.html?f=html"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)


print(text)

这会产生以下输出:

EUR/USD14265522866931.056661.056751.056081.057911.05686USD/JPY1426552286419121.405121.409121.313121.448121.382GBP/USD14265522866821.482291.482361.481941.483471.48281EUR/GBP14265522865290.712790.712900.712300.713460.71273USD/CHF14265522866361.008041.008291.006551.008791.00682EUR/JPY1426552286635128.284128.296128.203128.401128.280EUR/CHF14265522866551.065121.065441.063491.066281.06418USD/CAD14265522864891.278211.278321.276831.278531.27746AUD/USD14265522864960.762610.762690.761150.764690.76412GBP/JPY1426552286682179.957179.976179.854180.077179.988

现在从这一点开始我如何进一步解析如果我只想要字符串'USD / CHF'或特定的数据点?

是否有更简单的webscrape和解析方法?任何建议都会很棒!

系统规格: Windows 7 64位 IDE:空闲 python:2.7.5

提前谢谢大家, 生锈的

3 个答案:

答案 0 :(得分:2)

Keep it simple。按文字(例如USD/CHF)查找单元格并获取following siblings

text = 'USD/CHF'
cell = soup.find('td', text=text)
for td in cell.next_siblings:
    print td.text

打印:

1426561775912
1.00
768
1.00
782
1.00655
1.00879
1.00682

答案 1 :(得分:1)

根据我的经验,美丽的汤非常容易。我会写一个正则表达式来删除一串字符后的数字。我希望这能让你走上正轨。

答案 2 :(得分:1)

你可以尝试这样快速和肮脏的东西。显然,这样的代码会根据字符串本身而改变。更高级的方法将使用python的正则表达式库。但有时保持简单是件好事。

string = []
starting_position = text[text.find("USD/CHF")+7:] #+7 to start after the tag USD/CHF
for ch in starting_position:
    if ch.isdigit() == True or ch == ".":
        string.append(str(ch))
    else:
        break
print "".join(string)