我完全不熟悉网页抓取功能,并且希望取消来自https://www.hostelworld.com/hosteldetails.php/HI-NYC-Hostel/New-York/1850#reviews
的评论和财产回复但是,我获得的HTML似乎是用于旅馆页面而不是带有评论的覆盖页面,而且我想知道如何从评论面板中获取和剪贴。
我可以使用以下代码段抓取用户评论,
$adsql="SELECT meta_value FROM ue_usermeta where meta_key LIKE 'billing_first_name' and user_id IN ( SELECT user_id FROM ue_stm_lms_user_quizzes WHERE status LIKE '%pass%') ";
$sorgulaad=mysqli_query($baglanti,$adsql);
$soyadsql="SELECT meta_value FROM ue_usermeta where meta_key LIKE 'billing_last_name' and user_id IN ( SELECT user_id FROM ue_stm_lms_user_quizzes WHERE status LIKE '%pass%') ";
$sorgulasoyad=mysqli_query($baglanti,$soyadsql);
$mailsql="SELECT meta_value FROM ue_usermeta where meta_key LIKE 'billing_email' and user_id IN ( SELECT user_id FROM ue_stm_lms_user_quizzes WHERE status LIKE '%pass%') ";
$sorgulamail=mysqli_query($baglanti,$mailsql);
for($i = 1; $sonucad=mysqli_fetch_array($sorgulaad,MYSQLI_ASSOC) AND $sonucsoyad=mysqli_fetch_array($sorgulasoyad,MYSQLI_ASSOC) AND $sonucmail=mysqli_fetch_array($sorgulamail,MYSQLI_ASSOC); $i++){
echo '<tr class="data">';
echo '
<td class="icerik">
<span class="icerik">'.$i.'</span></td>
<td class="icerik">
<span class="icerik">'.$sonucad['meta_value'].'</span></td>
<td class="icerik">
<span class="icerik">'.$sonucsoyad['meta_value'].'</span></td>
<td class="icerik">
<span class="icerik">'.$sonucmail['meta_value'].'</span></td>
</tr>
';
}
,但是它似乎来自评论面板的不同来源,因为我看不到任何与属性回复相对应的类或文本。任何帮助或建议,将不胜感激。
答案 0 :(得分:1)
如果您要删除整个评论面板(所有页面),建议您使用以下链接:
import requests
import pandas as pd
numb_of_pages = 10 #enter the number of pages you want to scrap
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:67.0) Gecko/20100101 Firefox/67.0"}
df = pd.DataFrame()
for nmb in range(1,10):
url = f"https://www.hostelworld.com/properties/1850/reviews?sort=newest&page={nmb}&monthCount=36"
data_raw = requests.get(url, headers=headers).json()
df = df.append(data_raw["reviews"])
print(f"page: {nmb} out of {numb_of_pages}")
或者,如果您只需要几页的注释,则可以使用以下代码:
import requests
import pandas
numb_of_pages = 10 #enter the number of pages you want to scrap
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:67.0) Gecko/20100101 Firefox/67.0"}
df = pd.DataFrame()
for nmb in range(1,numb_of_pages):
url = f"https://www.hostelworld.com/properties/1850/reviews?sort=newest&page={nmb}&monthCount=36"
data_raw = requests.get(url, headers=headers).json()
df = df.append(data_raw["reviews"])
print(f"page: {nmb} out of {numb_of_pages}")
print(df)
(PS:评论以JSON字符串的形式接收,因此您不需要BeautifulSoup)
我希望这对您有帮助