我通过尝试编写脚本来学习xHamster来学习Python。如果有人熟悉该网站,我会尝试将给定用户视频的所有网址专门写入.txt文件。
目前,我已设法从特定网页上删除网址,但是有多个网页,而且我很难遍历网页数量。
在我的下面尝试中,我评论了我在哪里尝试阅读下一页的网址,但是当前打印None
。任何想法为什么以及如何解决这个问题?
当前脚本:
#!/usr/bin/env python
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--incognito")
driver = webdriver.Chrome(chrome_options=chrome_options)
username = **ANY_USERNAME**
##page = 1
url = "https://xhams***.com/user/video/" + username + "/new-1.html"
driver.implicitly_wait(10)
driver.get(url)
links = [];
links = driver.find_elements_by_class_name('hRotator')
#nextPage = driver.find_elements_by_class_name('last')
noOfLinks = len(links)
count = 0
file = open('x--' + username + '.txt','w')
while count < noOfLinks:
#print links[count].get_attribute('href')
file.write(links[count].get_attribute('href') + '\n');
count += 1
file.close()
driver.close()
我尝试循环浏览页面:
#!/usr/bin/env python
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--incognito")
driver = webdriver.Chrome(chrome_options=chrome_options)
username = **ANY_USERNAME**
##page = 1
url = "https://xhams***.com/user/video/" + username + "/new-1.html"
driver.implicitly_wait(10)
driver.get(url)
links = [];
links = driver.find_elements_by_class_name('hRotator')
#nextPage = driver.find_elements_by_class_name('colR')
## TRYING TO READ THE NEXT PAGE HERE
print driver.find_element_by_class_name('last').get_attribute('href')
noOfLinks = len(links)
count = 0
file = open('x--' + username + '.txt','w')
while count < noOfLinks:
#print links[count].get_attribute('href')
file.write(links[count].get_attribute('href') + '\n');
count += 1
file.close()
driver.close()
更新
我已经在下面使用了Philippe Oger的答案,但修改了下面的两种方法来处理单页结果:
def find_max_pagination(self):
start_url = 'https://www.xhamster.com/user/video/{}/new-1.html'.format(self.user)
r = requests.get(start_url)
tree = html.fromstring(r.content)
abc = tree.xpath('//div[@class="pager"]/table/tr/td/div/a')
if tree.xpath('//div[@class="pager"]/table/tr/td/div/a'):
self.max_page = max(
[int(x.text) for x in tree.xpath('//div[@class="pager"]/table/tr/td/div/a') if x.text not in [None, '...']]
)
else:
self.max_page = 1
return self.max_page
def generate_listing_urls(self):
if self.max_page == 1:
pages = [self.paginated_listing_page(str(page)) for page in range(0, 1)]
else:
pages = [self.paginated_listing_page(str(page)) for page in range(0, self.max_page)]
return pages
答案 0 :(得分:1)
在用户页面上,我们实际上可以找出分页的距离,因此我们可以使用列表理解生成用户的每个URL,而不是循环分页,然后逐个删除它们。
以下是使用LXML的两分钱。如果您只是复制/粘贴此代码,它将返回TXT文件中的每个视频网址。您只需要更改用户名。
from lxml import html
import requests
class XXXVideosScraper(object):
def __init__(self, user):
self.user = user
self.max_page = None
self.video_urls = list()
def run(self):
self.find_max_pagination()
pages_to_crawl = self.generate_listing_urls()
for page in pages_to_crawl:
self.capture_video_urls(page)
with open('results.txt', 'w') as f:
for video in self.video_urls:
f.write(video)
f.write('\n')
def find_max_pagination(self):
start_url = 'https://www.xhamster.com/user/video/{}/new-1.html'.format(self.user)
r = requests.get(start_url)
tree = html.fromstring(r.content)
try:
self.max_page = max(
[int(x.text) for x in tree.xpath('//div[@class="pager"]/table/tr/td/div/a') if x.text not in [None, '...']]
)
except ValueError:
self.max_page = 1
return self.max_page
def generate_listing_urls(self):
pages = [self.paginated_listing_page(page) for page in range(1, self.max_page + 1)]
return pages
def paginated_listing_page(self, pagination):
return 'https://www.xhamster.com/user/video/{}/new-{}.html'.format(self.user, str(pagination))
def capture_video_urls(self, url):
r = requests.get(url)
tree = html.fromstring(r.content)
video_links = tree.xpath('//a[@class="hRotator"]/@href')
self.video_urls += video_links
if __name__ == '__main__':
sample_user = 'wearehairy'
scraper = XXXVideosScraper(sample_user)
scraper.run()
当用户总共只有1页时,我没有检查过这种情况。让我知道这是否正常。