我正在研究这个代码来排除频繁的单词,但有一些例外。 然后,我想使用最常用的单词作为关键词来搜索和删除deviantart api中的图像。
我一直在制作一个不包括filterWords的词频列表,但是结果的顺序搞砸了。另外,我可以使用其中一个词来搜索并从deviantart API中获取图像吗?
最佳,
from collections import Counter
import re
import requests # for more pleasant http, use http://bit.ly/python-requests
import xml.etree.ElementTree as ET
def main(n=100):
# Download the content
contents = requests.get('http://www.nyartbeat.com/list/event_type_print_painting.en.xml')
root=ET.fromstring(contents.content)
descs=[element.text for element in root.findall('.//Description')]
# Clean the content a little
filterWords = set(['artist', 'artists'])
contents=",".join(map(str, descs))
contents = re.sub('\s+', ' ', contents) # condense all whitespace
contents = re.sub('[^A-Za-z ]+', '', contents) # remove non-alpha chars
words=[w.lower() for w in contents.split() if len(w) >=6 ]
# fliteredWords=set(fliteredWords)-filterWords
# Start counting
word_count = Counter(words)
# The Top-N words
print("The Top {0} words".format(n))
for word, count in word_count.most_common(n):
print("{0}: {1}".format(word, count))
# with open("Output.txt", "w") as text_file:
# for word, count in word_count.most_common(n):
# print("The Top {0} words".format(n))
# print("{0}: {1}".format(word, count), file=text_file)
if __name__ == "__main__":
main()