如何使用beautifulsoup从多个URL中提取文本?

时间:2019-09-26 06:32:17

标签: web-scraping

我正在进行销售线索生成,并希望提取一些URL的文本。这是我要提取一个URL的代码。如果我要提取多个URL并将其保存到数据框中,该怎么办?

import urllib
from urllib.request import urlopen as urlopen
from bs4 import BeautifulSoup



url = 'https://www.wdtl.com/'
html = urlopen(url).read()
soup = BeautifulSoup(html)


for script in soup(["script", "style"]):
script.extract()    # rip it out

text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text)

1 个答案:

答案 0 :(得分:0)

如果我对您的理解正确,则可以使用此简化方法到达那里。让我们看看它是否对您有用:

import pandas as pd
from bs4 import BeautifulSoup as bs
import requests
headers={'User-Agent':'Mozilla/5.0'}
url = 'https://www.wdtl.com/'

resp = requests.get(url,headers = headers)
soup = bs(resp.content, "lxml")

#first, find the links
links = soup.find_all('link',href=True)

#create a list to house the links
all_links= []

#find each link and add it to the list
for link in links:
    if 'http' in link['href']: #the soup contains many non-http links; this will remove them
        all_links.append(link['href'])

#finally, load the list into a dataframe
df = pd.DataFrame(all_links)