Question

我编写了一个自动收报机，让沃尔玛缺货和价格变动...... 但是我被困了：当我尝试获取项目的ID（链接中的结束编号）时，我无法解析它。这是代码

# -*- coding: utf-8 -*-

import re
import urllib2

def walmart():
    fileprod = urllib2.urlopen("http://testh3x.altervista.org/walmart.txt").read()
    prods = fileprod.split("|")
    print prods
    lenp = len(prods)
    counter = 0
    while 1:
        while counter < lenp:
            data = urllib2.urlopen(prods[counter]).read()
            path = re.compile("class=\"Outofstock\"") #\s space - \w char - \W Tutto meno che char - 
            matching = path.match(data)
            if matching == None: 
                pass
            else:
                print "Out of stock"
            name = re.compile("\d") 
            m = name.match(str(prods[counter])).group #prods counter è il link
            print m


def main():
    walmart()

if __name__ == "__main__":
    main()

它抛出：

  File "C:\Users\Leonardo\Desktop\BotDevelop\ticker.py", line 22, in walmart
    m = name.match(str(prods[counter])).group #prods counter ├¿ il link
AttributeError: 'NoneType' object has no attribute 'group'

Answer 1

您应该检查 BeautifulSoup ，这使得解析html易于管理且非常简单。正则表达式通常不会很好。

但是，要回答您的问题，您的错误来自未找到匹配项的事实。一般来说，运行像这样的正则表达式更好：

m = name.match(str(prods[counter]))  # if no match is found, then None is returned
if m:
    m = m.group()  # be sure to call the method here

Answer 2

你的正则表达式并不匹配。您使用re.match()代替re.search();前者仅匹配字符串的 start ：

m = name.search(str(prods[counter])).group()

您也不需要在循环中重新编译正则表达式;将它们移出循环并将它们编译一次。

当有更好的工具可用时，你真的不应该使用正则表达式来解析HTML。请改用BeautifulSoup。

你也应该直接遍历prods，不需要while循环：

import urllib
from bs4 import BeautifulSoup

fileprod = urllib2.urlopen("http://testh3x.altervista.org/walmart.txt").read()
prods = fileprod.split("|")

for prod in prods:
    # split off last part of the URL for the product code
    product_code = prod.rsplit('/', 1)[-1]

    data = urllib2.urlopen(prod).read()
    soup = BeautifulSoup(data)
    if soup.find(class_='Outofstock'):
        print product_code, 'out of stock!'
        continue

    price = soup.find('span', class_='camelPrice').text
    print product_code, price

对于您的首发网址，它会打印：

7812821 $32.98

正则表达式在python中抛出异常

2 个答案: