Question

我正在尝试使用BeautifulSoup在网站源代码中抓取文本。部分源代码如下所示：

        <hr />
        <div class="see-more inline canwrap" itemprop="genre">
            <h4 class="inline">Genres:</h4>
<a href="/genre/Horror?ref_=tt_stry_gnr"
> Horror</a>&nbsp;<span>|</span>
<a href="/genre/Mystery?ref_=tt_stry_gnr"
> Mystery</a>&nbsp;<span>|</span>
<a href="/genre/Thriller?ref_=tt_stry_gnr"
> Thriller</a>
        </div>

所以我一直在尝试提取文本＆＃39;恐怖＆＃39; ＆＃39;神秘＆＃39;和＃39;惊悚片＆＃39;使用这些代码：

import requests
from bs4 import BeautifulSoup
url1='http://www.imdb.com/title/tt5308322/?ref_=inth_ov_tt'
r1=requests.get(url1)
soup1= BeautifulSoup(r1.text, 'lxml')
genre1=soup1.find('div',attrs={'itemprop':'genre'}).contents
print(genre1)

但回报如下：

['\n', <h4 class="inline">Genres:</h4>, '\n', <a href="/genre/Horror?
ref_=tt_stry_gnr"> Horror</a>, '\xa0', <span>|</span>, '\n', <a 
href="/genre/Mystery?ref_=tt_stry_gnr"> Mystery</a>, '\xa0', <span>|</span>, 
'\n', <a href="/genre/Thriller?ref_=tt_stry_gnr"> Thriller</a>, '\n']

我是python和webscraping的新手，所以我很感激能得到的所有帮助。谢谢！

Answer 1

试试这个，我使用的是html.parser。如果您遇到任何问题，请告诉我们：

 for data in genre1:
     get_a = data.find_all("a")
     text = ""
     for i in get_a:
         text = i.text
         print(text)

请检查缩进，因为我正在使用手机。

Answer 2

使用直接BeautifulSoup.select()函数将所需元素提取到 CSS 选择器：

import requests
from bs4 import BeautifulSoup

url1 = 'http://www.imdb.com/title/tt5308322/?ref_=inth_ov_tt'
soup = BeautifulSoup(requests.get(url1).text, 'lxml')
genres = [a.text.strip() for a in soup.select("div[itemprop='genre'] > a")]

print(genres)

输出：

['Horror', 'Mystery', 'Thriller']

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors

Answer 3

你可以使用Beautiful {get_text()方法来.contents属性来获得你想要的东西：

来自get_text() documentation：

如果您只想要文档或标记的文本部分，则可以使用get_text（）方法。它返回文档中或标记下的所有文本，作为单个Unicode字符串：

markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
soup = BeautifulSoup(markup)

soup.get_text()
>>> u'\nI linked to example.com\n'
soup.i.get_text()
>>> u'example.com'

您可以指定用于将文本位连接在一起的字符串：

soup.get_text("|")
>>> u'\nI linked to |example.com|\n'

你可以告诉Beautiful Soup从每个文本的开头和结尾去掉空格：

soup.get_text("|", strip=True)
>>> u'I linked to|example.com'

但是在那时你可能想要使用.stripped_strings生成器，并自己处理文本：

[text for text in soup.stripped_strings]
>>> [u'I linked to', u'example.com']

Answer 4

您可以通过多种方式执行相同的操作。 Css选择器精确，易于理解且不易出错。所以你也可以选择选择器来实现这个目的：

from bs4 import BeautifulSoup            
import requests

link = 'http://www.imdb.com/title/tt5308322/?ref_=inth_ov_tt'

res = requests.get(link).text
soup = BeautifulSoup(res,'lxml')
genre = ' '.join([item.text.strip() for item in soup.select(".canwrap a[href*='genre']")])
print(genre)

结果：

Horror Mystery Thriller

使用BeautifulSoup提取标签内的文本

4 个答案: