Question

我是Python的新手，我正在尝试创建一个只打印文章的Web-Crawler（例如这个网站 - http://techcrunch.com/2014/09/15/microsoft-has-acquired-minecraft/），而不是网站上的其他内容。我试过这个（这不起作用）：

source_code = requests.get('http://techcrunch.com/2014/09/15/microsoft-has-acquired-minecraft/')
plain_text = source_code.text
soup = BeautifulSoup(plain_text)

for link in soup.findAll('div', {'class': 'article-entry text'}):
    title = link.string
    print(title)

它的印刷品：＆＃39;无＆＃39; THX

Answer 1

您只需要文章而不是for循环：

for link in soup.findAll('div', {'class': 'article-entry text'}):
  title = link.string
  print(title)

成功：

title = soup.find('h1', {'class': 'alpha tweet-title'}).get_text()
article = soup.find('div', {'class': 'article-entry text'}.get_text()
print title
print article

您将只获得标题和文章。有关BeautifulSoup的文档可能有所帮助。

在Python上使用Web-Crawler打印文章

1 个答案: