Python没有在BeautifulSoup解析的字符串中查找搜索词

时间:2017-09-05 14:03:11

标签: python beautifulsoup

在Python 3中,当我想只返回我感兴趣的术语的字符串时,我可以这样做:

phrases = ["1. The cat was sleeping",
        "2. The dog jumped over the cat",
        "3. The cat was startled"]

for phrase in phrases:
    if "dog" in phrase:
        print(phrase)

当然打印" 2。那只狗跳过了猫#34;

现在我尝试做的是使相同的概念与BeautifulSoup中的解析字符串一起使用。例如,Craigslist有很多A标签,但只有A标签也有" hdrlnk"我们感兴趣的是他们。所以我:

import requests
from bs4 import BeautifulSoup

url = "https://chicago.craigslist.org/search/apa"
r = requests.get(url)

soup = BeautifulSoup(r.content, "html.parser") 
links = soup.find_all("a")

for link in links:
    if "hdrlnk" in link:
        print(link)

问题是,而不是用" hdrlnk"打印所有A标签。在里面,Python什么都不打印。而且我不确定出了什么问题。

4 个答案:

答案 0 :(得分:4)

“hdrlnk”是链接上的类属性。正如您所说,您只对这些链接感兴趣,只需找到基于类的链接:

import requests
from bs4 import BeautifulSoup

url = "https://chicago.craigslist.org/search/apa"
r = requests.get(url)

soup = BeautifulSoup(r.content, "html.parser")
links = soup.find_all("a", {"class": "hdrlnk"})

for link in links:
    print(link)

输出:

<a class="result-title hdrlnk" data-id="6293679332" href="/chc/apa/d/high-rise-2-bedroom-heated/6293679332.html">High-Rise 2 Bedroom Heated Pool Indoor Parking Fire Pit Pet Friendly!</a>
<a class="result-title hdrlnk" data-id="6285069993" href="/chc/apa/d/new-beautiful-studio-in/6285069993.html">NEW-Beautiful Studio in Uptown/free heat</a>
<a class="result-title hdrlnk" data-id="6293694090" href="/chc/apa/d/albany-park-2-bed-1-bath/6293694090.html">Albany Park 2 Bed 1 Bath Dishwasher W/D &amp; Heat + Parking Incl Pets ok</a>
<a class="result-title hdrlnk" data-id="6282289498" href="/chc/apa/d/north-center-2-bed-1-bath/6282289498.html">NORTH CENTER: 2 BED 1 BATH HDWD AC UNITS PROVIDE W/D ON SITE PRK INCLU</a>
<a class="result-title hdrlnk" data-id="6266583119" href="/chc/apa/d/beautiful-2bed-1bath-in-the/6266583119.html">Beautiful 2bed/1bath in the heart of Wrigleyville</a>
<a class="result-title hdrlnk" data-id="6286352598" href="/chc/apa/d/newly-rehabbed-2-bedroom-unit/6286352598.html">Newly Rehabbed 2 Bedroom Unit! Section 8 OK! Pets OK! (NHQ)</a>

要获取链接href或文本,请使用:

print(link["href"])
print(link.text)

答案 1 :(得分:0)

尝试:

for link in links:
    if "hdrlnk" in link["href"]:
        print(link)

答案 2 :(得分:0)

只需在链接内容中搜索字词,否则您的代码似乎没问题

import requests
from bs4 import BeautifulSoup

url = "https://chicago.craigslist.org/search/apa"
r = requests.get(url)

soup = BeautifulSoup(r.content, "html.parser") 
links = soup.find_all("a")

for link in links:
    if "hdrlnk" in link.contents[0]:
        print(link)

或者,如果您想在href或标题内搜索,请使用link['href']link['title']

答案 3 :(得分:0)

要获取所需的链接,您可以在脚本中使用选择器来使刮刀更加健壮和简洁。

import requests
from bs4 import BeautifulSoup

base_link = "https://chicago.craigslist.org"
res = requests.get("https://chicago.craigslist.org/search/apa").text
soup = BeautifulSoup(res, "lxml") 
for link in soup.select(".hdrlnk"):
    print(base_link + link.get("href"))