我正在尝试在网站上抓取客户投诉数据。我可以获得标题和日期,但无法弄清楚如何获得观看次数。
这是获得标题的代码:
headers = {'User-Agent':'Mozilla/5.0'}
complaints = []
time = []
for i in range(100):
r = requests.get(f'https://www.sikayetvar.com/sikayetler?brand=bosch&page={i}')
soup = bs(r.content, 'html.parser')
complaints += soup.find_all('h5', {"class":"card-title"})
time += soup.find_all('span', {"class":"info-icn time-tooltip"})
投诉数量:
我尝试使用以下代码获取数字“ 479”:
site = 'https://www.sikayetvar.com/sikayetler?brand=bosch'
r = requests.get(site)
soup = bs(r.content, 'html.parser')
time = soup.find_all('span', {"class":"count"})
它返回:
print(time[0])
<span class="count">-</span>
print(time[0].text)
-
我只得到“-”而不是“ 479”。谁能告诉我我在做错什么或获得该号码的方法吗?
预先感谢
答案 0 :(得分:0)
类**0000 0000 0000 0000 0000 0000 0000 0000**
中可能还有其他元素。尝试使用BEGIN;
EXPLAIN ANALYZE <query>
ROLLBACK;
获取元素。其中将包含一个具有计数的跨度。
答案 1 :(得分:0)
最有可能由javascript呈现为DOM,并且不存在于原始html中。检查html源而不是检查器。我检查了第3页(view-source:https://www.sikayetvar.com/sikayetler?brand=bosch&page=3
)的来源,当页面显示377时,计数实际上只是-
。
This answer可能会帮助您进行预渲染。
答案 2 :(得分:0)
那些视图由javascript显示,并且总是有很高的机会在“网络”标签中找到与该请求相关的XHR请求。 我找到了该站点,该站点以JSON返回这些视图。 https://collector.sikayetvar.com/complaints/view-count?complaints=14793971%2C14792534%2C14791327%2C14789107%2C14787731%2C14787333%2C14787160%2C14784871%2C14784832%2C14784823%2C14784535%2C14783714%2C14762547%2C14783312%2C14783264%2C14782822
这些投诉是每个投诉的ID,以逗号分隔,可以在https://www.sikayetvar.com/sikayetler?brand=bosch的源代码中找到,我们只需要用'%2C'替换逗号,然后将它们完整 m
import requests
import lxml
from bs4 import BeautifulSoup
headers = {'user-agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36 OPR/68.0.3618.125'}
brand = 'bosch'
site = f'https://www.sikayetvar.com/sikayetler?brand={brand}'
req = requests.get(site, headers=headers)
soup = BeautifulSoup(req.content, 'lxml')
complaints = soup.find_all('span', {'class':'js-view-count'})
complaints = '%2C'.join(i['data-id'] for i in complaints)
print(complaints)
site = f'https://collector.sikayetvar.com/complaints/view-count?complaints={complaints}'
r = requests.get(site)
print(r.json())
输出:-
[{'id': 14793971, 'viewCount': 43, 'uniqueViewCount': 29}, {'id': 14792534,'viewCount': 501, 'uniqueViewCount': 263}, {'id': 14791327, 'viewCount': 218,'uniqueViewCount': 134}, {'id': 14789107, 'viewCount': 492, 'uniqueViewCount': 286},{'id': 14787731, 'viewCount': 581, 'uniqueViewCount': 342}, {'id': 14787333, 'viewCount':506, 'uniqueViewCount': 287}, {'id': 14787160, 'viewCount': 428,'uniqueViewCount': 249}, {'id': 14784871, 'viewCount': 25, 'uniqueViewCount': 16},{'id': 14784832, 'viewCount': 29, 'uniqueViewCount': 15}, {'id': 14784823, 'viewCount':20, 'uniqueViewCount': 8}, {'id': 14784535, 'viewCount': 99, 'uniqueViewCount': 57}, {'id':14783714, 'viewCount': 510, 'uniqueViewCount': 280}, {'id': 14762547, 'viewCount':139, 'uniqueViewCount': 85}, {'id': 14783312, 'viewCount': 142,'uniqueViewCount': 78}, {'id': 14783264, 'viewCount': 190, 'uniqueViewCount': 114},{'id': 14782822, 'viewCount': 216, 'uniqueViewCount': 120}]
另一种方法是处理每个投诉,这将花费很多时间,而且如果您继续抓取,网站可能会阻止您
答案 3 :(得分:0)
这是您可以从该站点解析名称及其相关计数的方法之一:
import requests
from bs4 import BeautifulSoup
link = 'https://collector.sikayetvar.com/complaints/view-count?'
url = 'https://www.sikayetvar.com/sikayetler?brand=bosch&page=1'
def get_count(data_id):
r = requests.get(link,params={'complaints':f'{data_id}'})
complaints = r.json()[0]['viewCount']
return complaints
r = requests.get(url,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(r.text,"html.parser")
for item in soup.select(".filter-cards > article.card"):
name = item.select_one("h5.card-title > a").get_text(strip=True)
complaints = get_count(item['data-id'])
print(name,complaints)
输出如下:
Ürünleri Alamıyoruz Ve Bosch Hiçbir Sorumluluk Kabul Etmiyor 50
Bosch Buzdolabından Gelen Ses 508
Bosch Online Mağaza Sipariş Gecikmesi 225
Bosch Buzdolabı Soğutmuyor Alarm Ötüyor 499
Bosch Buzdolabı Sebze Ve Meyveliği Su Doluyor 589
Bosch Bulaşık Makinesi E09 Hatası 526