我有一个数据框,其中的一列包含超过4000个文章的不同URL。我实现了以下代码以从URL中提取所有文本,它似乎适用于一个或两个URL,但不适用于所有URL。
for i in df.url:
http = urllib3.PoolManager()
response = http.request('GET', i)
soup = bsoup(response.data, 'html.parser')
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text)
break
答案 0 :(得分:0)
在第一个for
循环中,您将所有解析的url分配给同一个变量-soup
。在循环结束时,此变量将包含最后一个URL的解析内容,而不是您期望的所有URL。这就是为什么您只看到一个输出的原因。
您可以将所有代码放在一个循环中
for url in df.url:
http = urllib3.PoolManager()
response = http.request('GET', url)
soup = bsoup(response.data, 'html.parser')
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print(url)
print(text)