Python正在过滤掉货币标记

时间:2016-01-10 20:18:12

标签: python beautifulsoup html-parsing

目标: 编写一个screenscraper,循环浏览包含旧价格和新价格的网页,读取价格,并将它们写入CSV文件。

方式: 配置文件urls.txt包含页面列表。打开该文件并循环访问URL。对于每个URL,使用Beautiful Soup来提取“current-price”和“old-price”类的任何div的内容。并非所有页面都有旧价格,所以我已经选择了。

问题: 这一切都很好,但有一个奇怪的例外。价格以美元计价,价格和美元符号正在传递。在价格为欧元或英镑的情况下,货币标记£和€正在被剥离。我希望货币标记在所有情况下都能通过。我怀疑这是一个编码问题。 (下面的lstrip调用是删除一些错误的空格和标签。)

urls.txt的内容:

http://uk.norton.com/norton-security-for-one-device
http://uk.norton.com/norton-security-antivirus
http://uk.norton.com/norton-security-with-backup
http://us.norton.com/norton-security-for-one-device
http://us.norton.com/norton-security-antivirus
http://us.norton.com/norton-security-with-backup
http://ie.norton.com/norton-security-for-one-device
http://ie.norton.com/norton-security-antivirus
http://ie.norton.com/norton-security-with-backup

Python代码:

###############################################
#
# PRICESCRAPE
# Screenscraper that checks prices on PD pages
#
###############################################

# Import the modules we need
import urllib.request
import re
import lxml
from lxml import etree
from lxml.html.soupparser import fromstring
from lxml.etree import tostring
from lxml.cssselect import CSSSelector
from bs4 import BeautifulSoup, NavigableString

# Open the files we need
out = open('out.csv', 'w')
urls=open('urls.txt','r')

# function to take a URL, open the HTML, and return it
def getPage(url):
    return urllib.request.urlopen(url).read().decode(encoding='UTF-8',errors='strict').encode('ascii','ignore')

out.write('URL,Current Price,Strikethrough Price\n')



#Loop through the URLs
for url in urls:
    print('\nExamining ' + url) 
    url=url.rstrip('\n')
    html=getPage(url)
    soup = BeautifulSoup(html,'lxml')
    currentPrice = soup.find('div', {'class': 'current-price'}).contents[0].lstrip('\n').lstrip(' ').lstrip('\t')
    oldPrice = soup.find('div', {'class': 'old-price'}).contents[0].lstrip(' ')

    out.write(url)
    out.write(',')
    out.write(str(currentPrice))
    out.write(',')
    if oldPrice:
        out.write(str(oldPrice))
    else:
         out.write('No strikethrough price')
    out.write('\n')

    if html =='':
        print('Problem reading page')

print('Done. See out.csv for output')

out.close()
urls.close()

1 个答案:

答案 0 :(得分:2)

我会使用两个模块使其工作并使代码更简单:

  • csv将结果导出到csv输出文件
  • requests使编码部分对您透明

如果您import requests并将getPage实施替换为:

def getPage(url):
    return requests.get(url).content

你也会得到欧元和英镑的价格。