Python - 正则表达式和Webscrape

时间:2017-07-19 22:52:52

标签: python regex web-scraping bs4

我正在尝试从网页中提取所有狗的名字。我希望能够抓取页面,将名称(如果是新的)附加到列表中,并最终能够在列表大小发生变化时发送通知。

我目前正在尝试提取名称。我在页面上的“我的名字是”这个短语后提取名称时遇到了问题。

我到目前为止的代码是:

import requests
from bs4 import BeautifulSoup
import re

url = 'http://petharbor.com/results.asp?searchtype=ADOPT&start=3%20&friends=1&samaritans=1&nosuccess=0&rows=25&imght=200&imgres=thumb&tWidth=200&view=sysadm.v_chmp&bgcolor=b7b7b7&text=ffffff&link=ffffff&alink=4400ff&vlink=ffffff&fontface=arial&fontsize=12&col_hdr_bg=000066&col_hdr_fg=ffffff&SBG=000066&zip=61802&miles=10&shelterlist=%27CHMP%27&atype=&where=type_DOG&PAGE=1'
response = requests.get(url)
html = response.content

soup = BeautifulSoup(html, 'html.parser')

key = "My name is "
puppies = []

names = soup.find_all(text=re.compile("My name is(.*)"))

我正在尝试将内容切片为仅列出每个单独的名称,然后将其附加到我拥有的列表(小狗)。有没有办法拆分列表,然后切片每个部分?或者,通过该列表再次运行正则表达式会更好吗?

我认为到目前为止我找到了一个解决方案:

for name in names:
    puppies.append(name[10:])

这似乎有效。

现在我能够抓住我正在尝试建立列表以保留名称:旧的,新的和最新的。我想运行该函数并检查旧列表的新名称,并将名称附加到“更新”列表。我无法通过支票运行名称并重新分配。任何人都可以帮助我理解一种方法来实现这一点:

import requests
from bs4 import BeautifulSoup
import re
import time
from twilio.rest import Client

url = 'http://petharbor.com/results.asp?searchtype=ADOPT&start=3%20&friends=1&samaritans=1&nosuccess=0&rows=25&imght=200&imgres=thumb&tWidth=200&view=sysadm.v_chmp&bgcolor=b7b7b7&text=ffffff&link=ffffff&alink=4400ff&vlink=ffffff&fontface=arial&fontsize=12&col_hdr_bg=000066&col_hdr_fg=ffffff&SBG=000066&zip=61802&miles=10&shelterlist=%27CHMP%27&atype=&where=type_DOG&PAGE=1'
response = requests.get(url)
html = response.content

account_sid = ("XXXXXXXXXXXXXXXXXXXXXXXXXXX")
auth_token = ("XXXXXXXXXXXXXXXXXXXXXXXXXXXXX")
client = Client(account_sid, auth_token)
soup = BeautifulSoup(html, 'html.parser')

names = soup.find_all(text=re.compile("My name is(.*)"))

old = []
new = []
newest = []

def check():
    for name in names:
        if name not in old:
            old.append(name[11:-2])
            if name in old:
                continue

def new_check():
    for name in names:
        if name in old:
            continue
        if name not in new and name not in old:
            new.append(name[11:-2])

    for name in names:
        if name in old and name in new:
            continue
        if name not in old and name not in new:
            newest.append(name[11:-2])

    #client.api.account.messages.create(to = "+xxxxxxxxxxx",
                                        #from_= "+xxxxxxxxxxx",
                                        #body = "Here are some new dogs:" + str(new))

    #client.api.account.messages.create(to="+xxxxxxxxxxx",
                                       #from_="+xxxxxxxxxxx",
                                       #body=("There are " + str(num_newest) + " new puppies"), media_url = 'http://petharbor.com/results.asp?searchtype=ADOPT&start=3%20&friends=1&samaritans=1&nosuccess=0&rows=25&imght=200&imgres=thumb&tWidth=200&view=sysadm.v_chmp&bgcolor=b7b7b7&text=ffffff&link=ffffff&alink=4400ff&vlink=ffffff&fontface=arial&fontsize=12&col_hdr_bg=000066&col_hdr_fg=ffffff&SBG=000066&zip=61802&miles=10&shelterlist=%27CHMP%27&atype=&where=type_DOG&PAGE=1')

    #client.api.account.messages.create(to = "+xxxxxxxxxxx",
                                        #from_= "+xxxxxxxxxxx",
                                        #body = "Here are some new names:" + str(newest))


num_old = len(old)
num_new = len(new)
num_newest = len(newest)

while True:
    check()
    print("Old List: " + str(old))
    print("Number of old: " + str(num_old))
    time.sleep(20)
    new_check()
    print("New List: " + str(new))
    print("Number of new: " + str(num_new))
    print("Newest List: " + str(newest))
    print("Number of newest: " + str(num_newest))
    time.sleep(20)

0 个答案:

没有答案