根据频率重复单词制作文本文件

时间:2015-12-06 20:26:21

标签: python parsing text frequency word

我知道这个问题根据stackoverflow问题标准可能不合适,但我一直在做几个月的编码实践来解析和分析以前从未编程过的文本,并得到了这个论坛的帮助。

我用频率分析分析了多个xml文件,存储在mysqldb中。 [字,数]

我想通过基于频率重复单词来制作文本文件。 (例如早餐,6 =>早餐早餐早餐早餐早餐) 包括重复单词之间的一个空格,以及从最低(文本的开头)到最高频率的解析单词('a'或'the'将是最频繁的段,并且到达文本内容的最后部分)

请允许我获得一些想法,图书馆,编码示例.. 谢谢。

import math
import random
import requests
import collections
import string
import re
import MySQLdb as mdb
import xml.etree.ElementTree as ET
from xml.dom import minidom
from string import punctuation
from oauthlib import *
from operator import itemgetter
from collections import defaultdict
from functools import reduce
import requests, re
from xml.etree import ElementTree
from collections import Counter
from lxml import html




### MYSQL ###

db = mdb.connect(host="****", user="****", passwd="****", db="****")

cursor = db.cursor()
sql = "DROP TABLE IF EXISTS Table1"
cursor.execute(sql)
db.commit()
sql = "CREATE TABLE Table1(Id INT PRIMARY KEY AUTO_INCREMENT, keyword TEXT, frequency INT)"
cursor.execute(sql)
db.commit()



## XML PARSING
def main(n=1000):

    # A list of feeds to process and their xpath


    feeds = [
        {'url': 'http://www.nyartbeat.com/list/event_type_print_painting.en.xml', 'xpath': './/Description'},
        {'url': 'http://feeds.feedburner.com/FriezeMagazineUniversal?format=xml', 'xpath': './/description'},
        {'url': 'http://www.artandeducation.net/category/announcement/feed/', 'xpath': './/description'},
        {'url': 'http://www.blouinartinfo.com/rss/visual-arts.xml', 'xpath': './/description'},
        {'url': 'http://feeds.feedburner.com/ContemporaryArtDaily?format=xml', 'xpath': './/description'}
    ]



    # A place to hold all feed results
    results = []

    # Loop all the feeds
    for feed in feeds:
        # Append feed results together
        results = results + process(feed['url'], feed['xpath'])

    # Join all results into a big string
    contents=",".join(map(str, results))

    # Remove double+ spaces
    contents = re.sub('\s+', ' ', contents)

    # Remove everything that is not a character or whitespace
    contents = re.sub('[^A-Za-z ]+', '', contents)

    # Create a list of lower case words that are at least 8 characters
    words=[w.lower() for w in contents.split() if len(w) >=1 ]


    # Count the words
    word_count = Counter(words)

    # Clean the content a little
    filter_words = ['art', 'artist', 'artist']
    for word in filter_words:
        if word in word_count:
            del word_count[word]



# Add to DB
    for word, count in word_count.most_common(n):
                sql = """INSERT INTO Table1 (keyword, frequency) VALUES(%s, %s)"""
                cursor.execute(sql, (word, count))
                db.commit()

def process(url, xpath):
    """
    Downloads a feed url and extracts the results with a variable path
    :param url: string
    :param xpath: string
    :return: list
    """
    contents = requests.get(url)
    root = ElementTree.fromstring(contents.content)
    return [element.text.encode('utf8') if element.text is not None else '' for element in root.findall(xpath)]





if __name__ == "__main__":
    main()

1 个答案:

答案 0 :(得分:0)

假设您在for循环中使用的word_count.most_common(n)将按顺序返回元组或包含wordcount的列表:

让我们将它存储在一个变量中:

words = word_count.most_common(n)
# Ex: [('a',5),('apples',2),('the',4)]

使用itemgetter,按计数排序:

from operator import itemgetter
words = sorted(words, key = itemgetter(1))
# words = [('apples', 2), ('the', 4), ('a', 5)]

现在浏览每个条目,并将其附加到列表中:

out = []
for word, count in words:
    out += [word]*count
# out = ['apples', 'apples', 'the', 'the', 'the', 'the', 'a', 'a', 'a', 'a', 'a']

以下行将使其成为一个长字符串:

final = " ".join(out)
# final = "apples apples the the the the a a a a a"

现在只需将其写入文件:

with open("filename.txt","w+") as f:
    f.write(final)

代码如下所示:

from operator import itemgetter

words = word_count.most_common(n)
words = sorted(words, key = itemgetter(1))

out = []
for word, count in words:
    out += [word]*count

final = " ".join(out)

with open("filename.txt","w+") as f:
    f.write(final)