如何从不同的网站解析多个XML(rss)进行单个处理

时间:2015-10-28 16:41:22

标签: python xml parsing rss

我试图从每个不同的网站解析多个XML(rss,而不是api)进行单一分析。 (多个输入,单组结果) 每个XML在提取xpath方面都有一点差异。

我还想过滤一些不应该出现的词。 目前,来自一个在线xml的单词频率有效。

如何以更简单的方式完成这项工作?

import urllib.request
with urllib.request.urlopen('http://python.org/') as response:
   html = response.read()

import MySQLdb
import math
import random
import requests
import collections
import string
import re
import xml.etree.ElementTree as ET
from textblob import TextBlob
from textblob.classifiers import NaiveBayesClassifier
from string import punctuation
from collections import defaultdict
from collections import Counter    

def main(n=10):

        # Download the content

        #NYArtbeat
    #    contents1 = requests.get('http://www.nyartbeat.com/list/event_type_print_painting.en.xml')
    #    root=ET.fromstring(contents1.content)
    #    descs=[element.text for element in root.findall('.//description')]

        #FriezeMag
    #    contents1 = requests.get('http://feeds.feedburner.com/FriezeMagazineUniversal?format=xml')
    #    root=ET.fromstring(contents1.content)
    #    descs=[element.text for element in root.findall('.//description')]

        #Art Education
        contents = requests.get('http://www.artandeducation.net/category/announcement/feed/')
        root=ET.fromstring(contents.content)
        descs=[element.text for element in root.findall('.//description')]

        #Blouinartinfo
    #    contents1 = requests.get('http://www.blouinartinfo.com/rss/visual-arts.xml')
    #    root=ET.fromstring(contents1.content)
    #    descs=[element.text for element in root.findall('.//description')]

        #Art Agenda
    #    contents1 = requests.get('http://www.art-agenda.com/category/reviews/feed/')
    #    root=ET.fromstring(contents1.content)
    #    descs=[element.text for element in root.findall('.///.*')]




        # Clean the content a little

        filterWords = set(['artist', 'artists'])

        contents=",".join(map(str, descs))
        contents = re.sub('\s+', ' ', contents)  
        contents = re.sub('[^A-Za-z ]+', '', contents)  

        words=[w.lower() for w in contents.split() if len(w) >=6 ]


     #   fliteredWords=set(fliteredWords)-filterWords 


        # Start counting
        word_count = Counter(words)

        # The Top-N words
        print("The Top {0} words".format(n))
        for word, count in word_count.most_common(n):
            print("{0}: {1}".format(word, count))



    if __name__ == "__main__":
        main()

1 个答案:

答案 0 :(得分:0)

您可能想要创建一个Feed列表及其xpath,以便您可以循环这些并使用一个函数处理它们。这是一个做你想要的例子。注意如何轻松添加任意数量的feed并指定xpath。您提供的所有示例都包含.//description的xpath,但实际上.//Description的第一个示例除外,您可以通过将.//body或其他任何内容添加到{{{}}来轻松处理Feed 1}}列表。

feeds