Question

我编写了一个脚本，从文章中提取段落并将其写入文件。对于某些文章，它不会拉每一段。这是我迷失的地方。任何指导都将深表感谢。我已经包含了一个特定文章的链接，其中没有提取所有信息。它会刮掉所有内容直到第一个引用的句子。

网址：http://www.reuters.com/article/2014/03/06/us-syria-crisis-assad-insight-idUSBREA250SD20140306

# Ask user to enter URL
url = raw_input("Please enter a valid URL: ")

# Open txt document for output
txt = open('ctp_output.txt', 'w')

# Parse HTML of article
soup = BeautifulSoup(urllib2.urlopen(url).read())

# retrieve all of the paragraph tags
tags = soup('p')
for tag in tags:
    txt.write(tag.get_text() + '\n' + '\n')

Answer 1

这对我有用：

import urllib2
from bs4 import BeautifulSoup

url = "http://www.reuters.com/article/2014/03/06/us-syria-crisis-assad-insight-idUSBREA250SD20140306"

soup = BeautifulSoup(urllib2.urlopen(url))

with open('ctp_output.txt', 'w') as f:
    for tag in soup.find_all('p'):
        f.write(tag.text.encode('utf-8') + '\n')

请注意，在处理文件时应使用with上下文管理器。此外，您可以将urllib2.urlopen(url)直接传递给BeautifulSoup构造函数，因为urlopen会返回类似文件的对象。

希望有所帮助。

使用beautifulsoup进行文章抓取：抓取所有<p>标签</p>

1 个答案: