使用Python和/so-sitemap.xml/的Beautiful Soup进行网络抓取

时间:2019-08-18 07:21:23

标签: python web screen-scraping

我正在尝试抓取一个网站website / post-sitemap.xml,其中包含针对wordpress网站发布的所有网址。第一步,我需要列出后站点地图中存在的所有URL的列表。当我使用request.get并检查输出时,它也会同时打开所有内部url,这很奇怪。我的意图是首先列出所有url,然后使用循环,在下一个函数中将抓取单个url。下面是我到目前为止完成的代码。如果python专家可以帮助,我将所有url作为列表作为最终输出。

我尝试使用request.get和openurl,但是似乎没有任何东西只能打开/post-sitemap.xml的基本URL

import pandas as pd
import numpy as np
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
import re

class wordpress_ext_url_cleanup(object):
    def __init__(self,wp_url):
        self.wp_url_raw = wp_url
        self.wp_url = wp_url + '/post-sitemap.xml/'

    def identify_ext_url(self):
        html = requests.get(self.wp_url)
        print(self.wp_url)
        print(html.text)
        soup = BeautifulSoup(html.text,'lxml')
        #print(soup.get_text())
        raw_data = soup.find_all('tr')
        print (raw_data)
        #for link in raw_data:
            #print(link.get("href"))

def main():
    print ("Inside Main Function");
    url="http://punefirst dot com" #(knowingly removed the . so it doesnt look spammy)
    first_call = wordpress_ext_url_cleanup(url)
    first_call.identify_ext_url()


if __name__ == '__main__':
    main()

我需要发布站点地图中存在的所有548个URL作为列表,我将其用于下一个进一步抓取的功能。

1 个答案:

答案 0 :(得分:0)

从服务器返回的文档为XML,并使用XSLT转换为HTML格式(more info here)。要解析此XML中的所有链接,可以使用以下脚本:

import requests
from bs4 import BeautifulSoup

url = 'http://punefirst.com/post-sitemap.xml/'

soup = BeautifulSoup(requests.get(url).text, 'lxml')

for loc in soup.select('url > loc'):
    print(loc.text)

打印:

http://punefirst.com
http://punefirst.com/hospitals/pcmc-hospitals/aditya-birla-memorial-hospital-chinchwad-pune
http://punefirst.com/hospitals/pcmc-hospitals/saijyoti-hospital-and-icu-chinchwad-pune
http://punefirst.com/hospitals/pcmc-hospitals/niramaya-hospital-chinchwad-pune
http://punefirst.com/hospitals/pcmc-hospitals/chetna-hospital-chinchwad-pune
http://punefirst.com/hospitals/hadapsar-hospitals/pbmas-h-v-desai-eye-hospital
http://punefirst.com/hospitals/punecentral-hospitals/shree-sai-prasad-hospital
http://punefirst.com/hospitals/punecentral-hospitals/sadhu-vaswani-missions-medical-complex
http://punefirst.com/hospitals/katraj-kondhwa-hospitals/shivneri-hospital
http://punefirst.com/hospitals/punecentral-hospitals/kelkar-nursing-home
http://punefirst.com/hospitals/pcmc-hospitals/shrinam-hospital
http://punefirst.com/hospitals/pcmc-hospitals/dhanwantari-hospital-nigdi
http://punefirst.com/hospitals/punecentral-hospitals/dr-tarabai-limaye-hospital
http://punefirst.com/hospitals/katraj-kondhwa-hospitals/satyanand-hospital-kondhwa-pune

...and so on.
相关问题