我正在进行销售线索生成,并希望提取一些URL的文本。这是我要提取一个URL的代码。如果我要提取多个URL并将其保存到数据框中,该怎么办?
import urllib
from urllib.request import urlopen as urlopen
from bs4 import BeautifulSoup
url = 'https://www.wdtl.com/'
html = urlopen(url).read()
soup = BeautifulSoup(html)
for script in soup(["script", "style"]):
script.extract() # rip it out
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text)
答案 0 :(得分:0)
如果我对您的理解正确,则可以使用此简化方法到达那里。让我们看看它是否对您有用:
import pandas as pd
from bs4 import BeautifulSoup as bs
import requests
headers={'User-Agent':'Mozilla/5.0'}
url = 'https://www.wdtl.com/'
resp = requests.get(url,headers = headers)
soup = bs(resp.content, "lxml")
#first, find the links
links = soup.find_all('link',href=True)
#create a list to house the links
all_links= []
#find each link and add it to the list
for link in links:
if 'http' in link['href']: #the soup contains many non-http links; this will remove them
all_links.append(link['href'])
#finally, load the list into a dataframe
df = pd.DataFrame(all_links)