Question

我正在试图清除41件物品和清单。他们的价格来自一个网站。但我的输出csv缺少在页面末尾的2-3个项目。原因在于，有些设备的价格与其他设备不同。我的代码中的递归与名称和价格一起运行，对于在不同类别下提到价格的项目，它从下一个设备获取价格值。因此，它正在跳过最后2-3项，因为这些设备的价格已经输入以前设备的递归中。以下是推荐代码：

# -*- coding: cp1252 -*-
import csv
import urllib2
import sys
import time
from bs4 import BeautifulSoup
page = urllib2.urlopen('http://www.att.com/shop/wireless/devices/smartphones.deviceListGridView.xhr.flowtype-NEW.deviceGroupType-Cellphone.paymentType-postpaid.packageType-undefined.html?taxoStyle=SMARTPHONES&showMoreListSize=1000').read()
soup = BeautifulSoup(page)
soup.prettify()
with open('AT&T_2012-12-28.csv', 'wb') as csvfile:
    spamwriter = csv.writer(csvfile, delimiter=',')
    spamwriter.writerow(["Date","Month","Day of Week","Device Name","Price"])
    items = soup.findAll('a', {"class": "clickStreamSingleItem"},text=True)
    prices = soup.findAll('div', {"class": "listGrid-price"})
    for item, price in zip(items, prices):
        textcontent = u' '.join(price.stripped_strings)
        if textcontent:            
            spamwriter.writerow([time.strftime("%Y-%m-%d"),time.strftime("%B"),time.strftime("%A") ,unicode(item.string).encode('utf8').replace('â„¢','').replace('Â®','').strip(),textcontent])

价格通常在listGrid-price下提及，但是目前价格低于listGrid-price-outOfStock的大约2-3件物品我需要在我的递归中包括这个价格，以便在价格之前得到合适的价格。项目和循环运行所有设备。

请原谅我的无知，因为我是编程新手

Answer 1

您可以使用比较器功能进行自定义比较并将其传递给findAll()。

因此，如果您将prices分配修改为：

prices = soup.findAll('div', class_=match_both)

并将函数定义为：

def match_both(arg):
    if arg == "listGrid-price" or arg == "listGrid-price-outOfStock":
        return True
    return False

（功能可以更加简洁，冗长，只是为了让您了解它是如何工作的）

因此，它将与两者进行比较并在任何情况下返回匹配。

更多信息可在documentation中找到。（has_six_characters变体）

现在，因为您还询问了如何排除特定文本。

对text的

findAll()参数也可以有自定义比较器。因此，在这种情况下，您不希望文字说Write a review匹配并导致价格与文字的转换。

因此，您编辑的脚本将排除评论部分：

# -*- coding: cp1252 -*-
import csv
import urllib2
import sys
import time
from bs4 import BeautifulSoup

def match_both(arg):
    if arg == "listGrid-price" or arg == "listGrid-price-outOfStock":
        return True
    return False

def not_review(arg):
    if not arg:
        return arg
    return "Write a review" not in arg

page = urllib2.urlopen('http://www.att.com/shop/wireless/devices/smartphones.deviceListGridView.xhr.flowtype-NEW.deviceGroupType-Cellphone.paymentType-postpaid.packageType-undefined.html?taxoStyle=SMARTPHONES&showMoreListSize=1000').read()
soup = BeautifulSoup(page)
soup.prettify()
with open('AT&T_2012-12-28.csv', 'wb') as csvfile:
    spamwriter = csv.writer(csvfile, delimiter=',')
    spamwriter.writerow(["Date","Month","Day of Week","Device Name","Price"])
    items = soup.findAll('a', {"class": "clickStreamSingleItem"},text=not_review)
    prices = soup.findAll('div', class_=match_both)
    for item, price in zip(items, prices):
        textcontent = u' '.join(price.stripped_strings)
        if textcontent:
                spamwriter.writerow([time.strftime("%Y-%m-%d"),time.strftime("%B"),time.strftime("%A") ,unicode(item.string).encode('utf8').replace('â„¢','').replace('Â®','').strip(),textcontent])

使用漂亮的汤从网站上抓取数据的问题

1 个答案: