Question

对于Python来说，我非常环保，但我看到它有多强大。我想用它来尝试一些事情，但我非常喜欢自学，请随时用最基本的术语来解释。：/

我尝试使用鹅提取工具从URL中提取一些文本，但效果非常好。我很简单......

from goose import Goose

url = 'http://example.com'
g = Goose()
article = g.extract(url=url)

article.cleaned_text

我希望复制该过程，以便从数百个网址中提取文字。有没有办法设置它，所以我可以输入一个URL列表，提取文本，然后（我猜）我可以加入他们一起为NLP或我想做的其他事情？提前谢谢......

Answer 1

只需将所有网址放在文本文件中，例如：

http://example1.com
http://example2.com
http://example3.com

然后，使用此列表循环播放，

from goose import Goose

# Read list of hundreds of urls from a file
url_list = open("url_list.txt", "r").read().split("\n")

# loop for each url
for url in url_list:
    g = Goose()
    article = g.extract(url=url)

    # process/store ...
    article.cleaned_text

稍后，由于您有分析所需的文本，请继续存储，然后在单独的代码块中进行处理。

Answer 2

是，您可以迭代＆＃34;列表＆＃34; （这是一个python对象）的url，或者从文件中获取这些url：

从列表中获取网址：

with open(url_filename_here) as url_file:
lines = url_file.readlines()
#each line should contain a different url
for line in lines:
    article = g.extract(url=line)
    #do_more_stuff

从文件中读取网址：

{{1}}

从多个网址中提取文本

2 个答案: