来自web xml的单词频率,用于过滤deviantart api

时间:2015-10-24 05:45:56

标签: python xml api frequency word

我正在研究这个代码来排除频繁的单词,但有一些例外。 然后,我想使用最常用的单词作为关键词来搜索和删除deviantart api中的图像。

我一直在制作一个不包括filterWords的词频列表,但是结果的顺序搞砸了。另外,我可以使用其中一个词来搜索并从deviantart API中获取图像吗?

最佳,

    from collections import Counter
import re
import requests # for more pleasant http, use http://bit.ly/python-requests
import xml.etree.ElementTree as ET

def main(n=100):

    # Download the content
    contents = requests.get('http://www.nyartbeat.com/list/event_type_print_painting.en.xml')
    root=ET.fromstring(contents.content)
    descs=[element.text for element in root.findall('.//Description')]

    # Clean the content a little

    filterWords = set(['artist', 'artists'])

    contents=",".join(map(str, descs))
    contents = re.sub('\s+', ' ', contents)  # condense all whitespace
    contents = re.sub('[^A-Za-z ]+', '', contents)  # remove non-alpha chars

    words=[w.lower() for w in contents.split() if len(w) >=6 ]


 #   fliteredWords=set(fliteredWords)-filterWords 


    # Start counting
    word_count = Counter(words)

    # The Top-N words
    print("The Top {0} words".format(n))
    for word, count in word_count.most_common(n):
        print("{0}: {1}".format(word, count))

#    with open("Output.txt", "w") as text_file:
#        for word, count in word_count.most_common(n):
#            print("The Top {0} words".format(n))
#            print("{0}: {1}".format(word, count), file=text_file)


if __name__ == "__main__":
    main()

0 个答案:

没有答案