报废显示在叠加层/新窗口上的数据

时间:2019-05-25 22:03:33

标签: python web-scraping beautifulsoup python-requests

我完全不熟悉网页抓取功能,并且希望取消来自https://www.hostelworld.com/hosteldetails.php/HI-NYC-Hostel/New-York/1850#reviews

的评论和财产回复

但是,我获得的HTML似乎是用于旅馆页面而不是带有评论的覆盖页面,而且我想知道如何从评论面板中获取和剪贴。

我可以使用以下代码段抓取用户评论,

    $adsql="SELECT meta_value FROM ue_usermeta where meta_key LIKE 'billing_first_name' and user_id IN ( SELECT user_id FROM ue_stm_lms_user_quizzes WHERE status LIKE '%pass%') ";
$sorgulaad=mysqli_query($baglanti,$adsql);

$soyadsql="SELECT meta_value FROM ue_usermeta where meta_key LIKE 'billing_last_name' and user_id IN ( SELECT user_id FROM ue_stm_lms_user_quizzes WHERE status LIKE '%pass%') ";
$sorgulasoyad=mysqli_query($baglanti,$soyadsql);

$mailsql="SELECT meta_value FROM ue_usermeta where meta_key LIKE 'billing_email' and user_id IN ( SELECT user_id FROM ue_stm_lms_user_quizzes WHERE status LIKE '%pass%') ";
$sorgulamail=mysqli_query($baglanti,$mailsql);


for($i = 1; $sonucad=mysqli_fetch_array($sorgulaad,MYSQLI_ASSOC) AND $sonucsoyad=mysqli_fetch_array($sorgulasoyad,MYSQLI_ASSOC) AND  $sonucmail=mysqli_fetch_array($sorgulamail,MYSQLI_ASSOC); $i++){


echo '<tr class="data">';
        echo ' 
        <td class="icerik">
        <span class="icerik">'.$i.'</span></td>

        <td class="icerik">
        <span class="icerik">'.$sonucad['meta_value'].'</span></td>

        <td class="icerik">
        <span class="icerik">'.$sonucsoyad['meta_value'].'</span></td>

        <td class="icerik">
        <span class="icerik">'.$sonucmail['meta_value'].'</span></td>
        </tr>
            ';
    }

,但是它似乎来自评论面板的不同来源,因为我看不到任何与属性回复相对应的类或文本。任何帮助或建议,将不胜感激。

1 个答案:

答案 0 :(得分:1)

如果您要删除整个评论面板(所有页面),建议您使用以下链接:

import requests
import pandas as pd

numb_of_pages = 10 #enter the number of pages you want to scrap
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:67.0) Gecko/20100101 Firefox/67.0"}
df = pd.DataFrame()

for nmb in range(1,10):
    url = f"https://www.hostelworld.com/properties/1850/reviews?sort=newest&page={nmb}&monthCount=36"
    data_raw = requests.get(url, headers=headers).json()
    df = df.append(data_raw["reviews"])

    print(f"page: {nmb} out of {numb_of_pages}")

或者,如果您只需要几页的注释,则可以使用以下代码:

import requests
import pandas

numb_of_pages = 10 #enter the number of pages you want to scrap

headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:67.0) Gecko/20100101 Firefox/67.0"}
df = pd.DataFrame()

for nmb in range(1,numb_of_pages):
    url = f"https://www.hostelworld.com/properties/1850/reviews?sort=newest&page={nmb}&monthCount=36"
    data_raw = requests.get(url, headers=headers).json()
    df = df.append(data_raw["reviews"])

    print(f"page: {nmb} out of {numb_of_pages}")

print(df)

(PS:评论以JSON字符串的形式接收,因此您不需要BeautifulSoup)

我希望这对您有帮助