使用搜索方法打印子字符串

时间:2016-12-27 21:29:54

标签: python html

我将网页的源代码定义为字符串类型变量。我知道源代码会有一个特定的日期。我想打印出该日期之前出现的第一个链接。这个链接可以在撇号("")之间找到,这里是代码:

import requests
from datetime import date
import re

link = "https://www.google.com.mx/search?biw=1535&bih=799&tbm=nws&q=%22New+Strong+Buy%22+site%3A+zacks.com&oq=%22New+Strong+Buy%22+site%3A+zacks.com&gs_l=serp.3...1632004.1638057.0.1638325.24.24.0.0.0.0.257.2605.0j15j2.17.0....0...1c.1.64.serp..8.0.0.Nl4BZQWwR3o"
fetch_data =requests.get(link)
content = str((fetch_data.content))

#this is the source code as a string

Months = ["January","February","March","April","May","June","July","August","September","October","November","December"]
today = date.today()
A= ("%s %s" % (Months[today.month - 1],today.day))
a=today.day
B= A in content
if B == True:
    B = ("%s %s" % (Months[today.month - 1], a))
else:
    while B == False:
        a = a - 1
        B = ("%s %s" % (Months[today.month - 1], a))

#the B variable is the string date that will appear in the variable string content

c= ('"https:')
Z= ("%s(.*)%s" % (c,B))
result = re.search(Z, content)
print (result)

这就是我尝试的:我在变量cB之间寻找子字符串,代码没有找到任何东西

如果有人从the link查找源代码,您会发现今天的日期" 12月27日"只出现一次,在此之前,我感兴趣的链接显示为" https://www.zacks.com/commentary/98986/new-strong-buy-stocks-for-december-27th"。

有人可以帮我自动化python来定义这个链接并打印出来吗?

1 个答案:

答案 0 :(得分:0)

正如Barmar所说,你最好使用像BeautifulSoup这样的DOM解析器。这是一个例子

from BeautifulSoup import BeautifulSoup
import requests, urlparse
from datetime import datetime

link = "https://www.google.com.mx/search?biw=1535&bih=799&tbm=nws&q=%22New+Strong+Buy%22+site%3A+zacks.com&oq=%22New+Strong+Buy%22+site%3A+zacks.com&gs_l=serp.3...1632004.1638057.0.1638325.24.24.0.0.0.0.257.2605.0j15j2.17.0....0...1c.1.64.serp..8.0.0.Nl4BZQWwR3o"

r = requests.get(link)

soup = BeautifulSoup(r.text)

search = datetime.today().strftime("%B %d")
print("Searching for {}".format(search))

result = None
for i in soup.findAll('h3'):
    linkText = i.getText()
    if search in linkText:
        result = i.find('a').get('href')
        result = result.split('?')[-1]
        result = urlparse.parse_qs(result)['q'][0]
        break

print(result)

我收到的输出是

Searching for December 27
https://www.zacks.com/commentary/98986/new-strong-buy-stocks-for-december-27th