如何提取隐藏的李文字

时间:2019-05-21 09:24:04

标签: python-3.x web-scraping selenium-chromedriver

我正在尝试从本website中抓取,跳至每篇href文章,并抓取主体文本后面的注释。但是,我得到的结果是空白。 Ive还尝试通过编写li来获取所有soup.find_all('li')来检查是否存在任何评论,并发现即使提取所有li也不会包含有关该文章的任何评论。有人可以请教吗?我怀疑该网站使获取文字更加困难。

import requests
from bs4 import BeautifulSoup as bs
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import pandas as pd

urls = [
    'https://hypebeast.com/brands/jordan-brand'
]

with requests.Session() as s:
    for url in urls:
        driver = webdriver.Chrome('/Users/Documents/python/Selenium/bin/chromedriver')
        driver.get(url)
        products = [element for element in WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='post-box    ']")))]
        soup = bs(driver.page_source, 'lxml')
        element = soup.select('.post-box    ')
        time.sleep(1)
        ahref = [item.find('a')['href']  for item in element]
        results = list(zip(ahref))
        df = pd.DataFrame(results)
        for result in results:
            res = driver.get(result[0])
            soup = bs(driver.page_source, 'lxml')
            time.sleep(6)
            comments_href = soup.find_all('ul', {'id': 'post-list'})
            print(comments_href)

1 个答案:

答案 0 :(得分:1)

帖子/评论位于<iframe>标记中。该标记还具有一个以dsq-app开头的动态属性。因此,您需要做的是找到该iframe,切换到该iframe,然后进行解析。我选择使用BeautifulSoup提取script标签,以json格式阅读并在其中导航。这样应该可以使您继续寻找所需的东西:

import requests
from bs4 import BeautifulSoup as bs
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import pandas as pd
import json

urls = [
    'https://hypebeast.com/brands/jordan-brand'
]

with requests.Session() as s:
    for url in urls:
        driver = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe')
        driver.get(url)
        products = [element for element in WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='post-box    ']")))]
        soup = bs(driver.page_source, 'lxml')
        element = soup.select('.post-box    ')
        time.sleep(1)
        ahref = [item.find('a')['href']  for item in element]
        results = list(zip(ahref))
        df = pd.DataFrame(results)
        for result in ahref:

            driver.get(result)
            time.sleep(6)

            iframe = driver.find_element_by_xpath('//iframe[starts-with(@name, "dsq-app")]')

            driver.switch_to.frame(iframe)
            soup = bs(driver.page_source, 'html.parser')

            scripts = soup.find_all('script')
            for script in scripts:
                if 'response' in script.text:
                    jsonStr = script.text
                    jsonData = json.loads(jsonStr)

                    for each in jsonData['response']['posts']:
                        author = each['author']['username']
                        message = each['raw_message']
                        print('%s: %s' %(author, message))

输出:

annvee: Lemme get them BDSM jordans fam
deathb4designer: Lmao
zenmasterchen: not sure why this model needed to exist in the first place
Spawnn: Issa flop.
disqus_lEPADa2ZPn: looks like an AF1
Lekkerdan: Hoodrat shoes.
rubnalntapia: Damn this are sweet
marcellusbarnes: Dope, and I hate Jordan lows
marcellusbarnes: The little jumpman on the back is dumb
chickenboihotsauce: copping those CPFM gonna be aids
lowercasegod: L's inbound
monalisadiamante: Sold out in 4 minutes. 
nickpurita: Those CPFM’s r overhyped AF.
...