我正在尝试从网页中提取所有狗的名字。我希望能够抓取页面,将名称(如果是新的)附加到列表中,并最终能够在列表大小发生变化时发送通知。
我目前正在尝试提取名称。我在页面上的“我的名字是”这个短语后提取名称时遇到了问题。
我到目前为止的代码是:
import requests
from bs4 import BeautifulSoup
import re
url = 'http://petharbor.com/results.asp?searchtype=ADOPT&start=3%20&friends=1&samaritans=1&nosuccess=0&rows=25&imght=200&imgres=thumb&tWidth=200&view=sysadm.v_chmp&bgcolor=b7b7b7&text=ffffff&link=ffffff&alink=4400ff&vlink=ffffff&fontface=arial&fontsize=12&col_hdr_bg=000066&col_hdr_fg=ffffff&SBG=000066&zip=61802&miles=10&shelterlist=%27CHMP%27&atype=&where=type_DOG&PAGE=1'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html, 'html.parser')
key = "My name is "
puppies = []
names = soup.find_all(text=re.compile("My name is(.*)"))
我正在尝试将内容切片为仅列出每个单独的名称,然后将其附加到我拥有的列表(小狗)。有没有办法拆分列表,然后切片每个部分?或者,通过该列表再次运行正则表达式会更好吗?
我认为到目前为止我找到了一个解决方案:
for name in names:
puppies.append(name[10:])
这似乎有效。
现在我能够抓住我正在尝试建立列表以保留名称:旧的,新的和最新的。我想运行该函数并检查旧列表的新名称,并将名称附加到“更新”列表。我无法通过支票运行名称并重新分配。任何人都可以帮助我理解一种方法来实现这一点:
import requests
from bs4 import BeautifulSoup
import re
import time
from twilio.rest import Client
url = 'http://petharbor.com/results.asp?searchtype=ADOPT&start=3%20&friends=1&samaritans=1&nosuccess=0&rows=25&imght=200&imgres=thumb&tWidth=200&view=sysadm.v_chmp&bgcolor=b7b7b7&text=ffffff&link=ffffff&alink=4400ff&vlink=ffffff&fontface=arial&fontsize=12&col_hdr_bg=000066&col_hdr_fg=ffffff&SBG=000066&zip=61802&miles=10&shelterlist=%27CHMP%27&atype=&where=type_DOG&PAGE=1'
response = requests.get(url)
html = response.content
account_sid = ("XXXXXXXXXXXXXXXXXXXXXXXXXXX")
auth_token = ("XXXXXXXXXXXXXXXXXXXXXXXXXXXXX")
client = Client(account_sid, auth_token)
soup = BeautifulSoup(html, 'html.parser')
names = soup.find_all(text=re.compile("My name is(.*)"))
old = []
new = []
newest = []
def check():
for name in names:
if name not in old:
old.append(name[11:-2])
if name in old:
continue
def new_check():
for name in names:
if name in old:
continue
if name not in new and name not in old:
new.append(name[11:-2])
for name in names:
if name in old and name in new:
continue
if name not in old and name not in new:
newest.append(name[11:-2])
#client.api.account.messages.create(to = "+xxxxxxxxxxx",
#from_= "+xxxxxxxxxxx",
#body = "Here are some new dogs:" + str(new))
#client.api.account.messages.create(to="+xxxxxxxxxxx",
#from_="+xxxxxxxxxxx",
#body=("There are " + str(num_newest) + " new puppies"), media_url = 'http://petharbor.com/results.asp?searchtype=ADOPT&start=3%20&friends=1&samaritans=1&nosuccess=0&rows=25&imght=200&imgres=thumb&tWidth=200&view=sysadm.v_chmp&bgcolor=b7b7b7&text=ffffff&link=ffffff&alink=4400ff&vlink=ffffff&fontface=arial&fontsize=12&col_hdr_bg=000066&col_hdr_fg=ffffff&SBG=000066&zip=61802&miles=10&shelterlist=%27CHMP%27&atype=&where=type_DOG&PAGE=1')
#client.api.account.messages.create(to = "+xxxxxxxxxxx",
#from_= "+xxxxxxxxxxx",
#body = "Here are some new names:" + str(newest))
num_old = len(old)
num_new = len(new)
num_newest = len(newest)
while True:
check()
print("Old List: " + str(old))
print("Number of old: " + str(num_old))
time.sleep(20)
new_check()
print("New List: " + str(new))
print("Number of new: " + str(num_new))
print("Newest List: " + str(newest))
print("Number of newest: " + str(num_newest))
time.sleep(20)