我正在尝试抓取一个网站website / post-sitemap.xml,其中包含针对wordpress网站发布的所有网址。第一步,我需要列出后站点地图中存在的所有URL的列表。当我使用request.get并检查输出时,它也会同时打开所有内部url,这很奇怪。我的意图是首先列出所有url,然后使用循环,在下一个函数中将抓取单个url。下面是我到目前为止完成的代码。如果python专家可以帮助,我将所有url作为列表作为最终输出。
我尝试使用request.get和openurl,但是似乎没有任何东西只能打开/post-sitemap.xml的基本URL
import pandas as pd
import numpy as np
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
import re
class wordpress_ext_url_cleanup(object):
def __init__(self,wp_url):
self.wp_url_raw = wp_url
self.wp_url = wp_url + '/post-sitemap.xml/'
def identify_ext_url(self):
html = requests.get(self.wp_url)
print(self.wp_url)
print(html.text)
soup = BeautifulSoup(html.text,'lxml')
#print(soup.get_text())
raw_data = soup.find_all('tr')
print (raw_data)
#for link in raw_data:
#print(link.get("href"))
def main():
print ("Inside Main Function");
url="http://punefirst dot com" #(knowingly removed the . so it doesnt look spammy)
first_call = wordpress_ext_url_cleanup(url)
first_call.identify_ext_url()
if __name__ == '__main__':
main()
我需要发布站点地图中存在的所有548个URL作为列表,我将其用于下一个进一步抓取的功能。
答案 0 :(得分:0)
从服务器返回的文档为XML,并使用XSLT转换为HTML格式(more info here)。要解析此XML中的所有链接,可以使用以下脚本:
import requests
from bs4 import BeautifulSoup
url = 'http://punefirst.com/post-sitemap.xml/'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
for loc in soup.select('url > loc'):
print(loc.text)
打印:
http://punefirst.com
http://punefirst.com/hospitals/pcmc-hospitals/aditya-birla-memorial-hospital-chinchwad-pune
http://punefirst.com/hospitals/pcmc-hospitals/saijyoti-hospital-and-icu-chinchwad-pune
http://punefirst.com/hospitals/pcmc-hospitals/niramaya-hospital-chinchwad-pune
http://punefirst.com/hospitals/pcmc-hospitals/chetna-hospital-chinchwad-pune
http://punefirst.com/hospitals/hadapsar-hospitals/pbmas-h-v-desai-eye-hospital
http://punefirst.com/hospitals/punecentral-hospitals/shree-sai-prasad-hospital
http://punefirst.com/hospitals/punecentral-hospitals/sadhu-vaswani-missions-medical-complex
http://punefirst.com/hospitals/katraj-kondhwa-hospitals/shivneri-hospital
http://punefirst.com/hospitals/punecentral-hospitals/kelkar-nursing-home
http://punefirst.com/hospitals/pcmc-hospitals/shrinam-hospital
http://punefirst.com/hospitals/pcmc-hospitals/dhanwantari-hospital-nigdi
http://punefirst.com/hospitals/punecentral-hospitals/dr-tarabai-limaye-hospital
http://punefirst.com/hospitals/katraj-kondhwa-hospitals/satyanand-hospital-kondhwa-pune
...and so on.