Python BeautifulSoup无法选择特定标记

时间:2016-07-04 14:35:07

标签: python beautifulsoup

我的问题是在解析网站然后用BS加载数据树时。如何查找<em>标记的内容?我试过了

for first in soup.find_all("li", class_="li-in"):
    print first.select("em.fl.in-date").string

                   #or

    print first.select("em.fl.in-date").contents

但它不起作用。请帮助。

我在tutti.ch上搜索汽车

这是我的整个代码:

#Crawl tutti.ch
import urllib
thisurl = "http://www.tutti.ch/stgallen/fahrzeuge/autos"
handle = urllib.urlopen(thisurl)
html_gunk =  handle.read()

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_gunk, 'html.parser')

for first in soup.find_all("li", class_="li-in"):
    if first.a.string and "Audi" and "BMW" in first.a.string:
        print "Geschafft: %s" % first.a.contents
        print first.select("em.fl.in-date").string
    else:
        print first.a.contents

当它找到bmw或audi时,它应检查汽车何时插入。时间位于em-Tag中,如下所示:

<em class="fl in-date"> Heute <br></br> 13:59 </em>

1 个答案:

答案 0 :(得分:-1)

 first.select("em.fl.in-date").text

假设你的选择器是正确的。您没有提供您正在抓取的网址,因此我无法确定。

>>> url = "http://stackoverflow.com/questions/38187213/python-beautifulsoup"
>>> from bs4 import BeautifulSoup
>>> import urllib2
>>> html = urllib2.urlopen(url).read()
>>> soup = BeautifulSoup(html)
>>> soup.find_all("p")[0].text
u'My problem is when parsing a website and then loading the data tree with BS. How can I look for the content of an <em> Tag? I tried '

看到你的代码后,我做了以下更改,看看:

#Crawl tutti.ch
import urllib
thisurl = "http://www.tutti.ch/stgallen/fahrzeuge/autos"
handle = urllib.urlopen(thisurl)
html_gunk =  handle.read()

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_gunk, 'html.parser')

for first in soup.find_all("li", class_="li-in"):
    if first.a.string and "Audi" and "BMW" in first.a.string:
        print "Geschafft: %s" % first.a.contents
        print first.select("em.fl.in-date")[0].text
    else:
        print first.a.contents