我正在使用Selenium Web Driver从LinkedIn配置文件中提取数据点。在此示例中,我想从“技能”部分中提取每种技能,但数据将提取为HTML格式。
当尝试将HTML代码转换为文本时,出现附件错误消息。
from parsel import Selector
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
driver = webdriver.Chrome('/Users/davidcraven/Downloads/chromedriver')
# get profile URL
driver.get('https://www.linkedin.com/AnyProfileURL')
# assigning the source code for the web page to variable sel
sel = Selector(text=driver.page_source)
# get skills
skills = sel.xpath('//*[starts-with(@class, "skills searchable has-several ")]').extract()
newtext = BeautifulSoup(skills, "lxml").text
答案 0 :(得分:0)
您可以使用硒从页面中获取所有文本。
尝试一下: 以下代码将在控制台中打印文本。
from selenium import webdriver
driver = webdriver.Chrome(executable_path="chromedriver.exe")
driver.get("https://www.linkedin.com/in/profile")
elem = driver.find_element_by_tag_name("body")
print(elem.text)
driver.quit()
编辑:
在您的代码中,sel.xpath().extract
返回一个列表到skills
。
您必须迭代列表以获取文本。以下代码在控制台中打印找到的文本。
from parsel import Selector
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome(executable_path="chromedriver")
# get profile URL
driver.get('https://www.linkedin.com/in/AnyProfile')
# assigning the source code for the web page to variable sel
sel = Selector(text=driver.page_source)
# get skills
skills = sel.xpath('//*[starts-with(@class, "skills searchable has-several ")]').extract()
# newtext = BeautifulSoup(skills, "lxml").text
for skl in skills:
print(BeautifulSoup(skl,"lxml").text)
driver.quit()