我正在尝试从本website中抓取,跳至每篇href
文章,并抓取主体文本后面的注释。但是,我得到的结果是空白。 Ive还尝试通过编写li
来获取所有soup.find_all('li')
来检查是否存在任何评论,并发现即使提取所有li也不会包含有关该文章的任何评论。有人可以请教吗?我怀疑该网站使获取文字更加困难。
import requests
from bs4 import BeautifulSoup as bs
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import pandas as pd
urls = [
'https://hypebeast.com/brands/jordan-brand'
]
with requests.Session() as s:
for url in urls:
driver = webdriver.Chrome('/Users/Documents/python/Selenium/bin/chromedriver')
driver.get(url)
products = [element for element in WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='post-box ']")))]
soup = bs(driver.page_source, 'lxml')
element = soup.select('.post-box ')
time.sleep(1)
ahref = [item.find('a')['href'] for item in element]
results = list(zip(ahref))
df = pd.DataFrame(results)
for result in results:
res = driver.get(result[0])
soup = bs(driver.page_source, 'lxml')
time.sleep(6)
comments_href = soup.find_all('ul', {'id': 'post-list'})
print(comments_href)
答案 0 :(得分:1)
帖子/评论位于<iframe>
标记中。该标记还具有一个以dsq-app
开头的动态属性。因此,您需要做的是找到该iframe,切换到该iframe,然后进行解析。我选择使用BeautifulSoup提取script
标签,以json格式阅读并在其中导航。这样应该可以使您继续寻找所需的东西:
import requests
from bs4 import BeautifulSoup as bs
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import pandas as pd
import json
urls = [
'https://hypebeast.com/brands/jordan-brand'
]
with requests.Session() as s:
for url in urls:
driver = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe')
driver.get(url)
products = [element for element in WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='post-box ']")))]
soup = bs(driver.page_source, 'lxml')
element = soup.select('.post-box ')
time.sleep(1)
ahref = [item.find('a')['href'] for item in element]
results = list(zip(ahref))
df = pd.DataFrame(results)
for result in ahref:
driver.get(result)
time.sleep(6)
iframe = driver.find_element_by_xpath('//iframe[starts-with(@name, "dsq-app")]')
driver.switch_to.frame(iframe)
soup = bs(driver.page_source, 'html.parser')
scripts = soup.find_all('script')
for script in scripts:
if 'response' in script.text:
jsonStr = script.text
jsonData = json.loads(jsonStr)
for each in jsonData['response']['posts']:
author = each['author']['username']
message = each['raw_message']
print('%s: %s' %(author, message))
输出:
annvee: Lemme get them BDSM jordans fam
deathb4designer: Lmao
zenmasterchen: not sure why this model needed to exist in the first place
Spawnn: Issa flop.
disqus_lEPADa2ZPn: looks like an AF1
Lekkerdan: Hoodrat shoes.
rubnalntapia: Damn this are sweet
marcellusbarnes: Dope, and I hate Jordan lows
marcellusbarnes: The little jumpman on the back is dumb
chickenboihotsauce: copping those CPFM gonna be aids
lowercasegod: L's inbound
monalisadiamante: Sold out in 4 minutes.
nickpurita: Those CPFM’s r overhyped AF.
...