在Python 3中找到网站上最常见的单词

时间:2014-06-24 21:13:35

标签: python beautifulsoup web-crawler nltk

我需要使用Python 3代码查找并复制在给定网站上出现超过5次的单词,我不知道该怎么做。我已经查看了堆栈溢出的存档,但其他解决方案依赖于python 2代码。这是我到目前为止的可靠代码:

   from urllib.request import urlopen
   website = urllib.urlopen("http://en.wikipedia.org/wiki/Wolfgang_Amadeus_Mozart")

有没有人对如何做有任何建议?我安装了NLTK并且我已经看过美丽的汤但是对于我的生活,我不知道如何正确安装它(我非常蟒蛇绿色)!在我学习的过程中,任何解释也会非常感激。谢谢:))

4 个答案:

答案 0 :(得分:7)

这不完美,但想知道如何开始使用requestsBeautifulSoupcollections.Counter

import requests
from bs4 import BeautifulSoup
from collections import Counter
from string import punctuation

r = requests.get("http://en.wikipedia.org/wiki/Wolfgang_Amadeus_Mozart")

soup = BeautifulSoup(r.content)

text = (''.join(s.findAll(text=True))for s in soup.findAll('p'))

c = Counter((x.rstrip(punctuation).lower() for y in text for x in y.split()))
print (c.most_common()) # prints most common words staring at most common.

[('the', 279), ('and', 192), ('in', 175), ('of', 168), ('his', 140), ('a', 124), ('to', 103), ('mozart', 82), ('was', 77), ('he', 70), ('with', 53), ('as', 50), ('for', 40), ("mozart's", 39), ('on', 35), ('from', 34), ('at', 31), ('by', 31), ('that', 26), ('is', 23), ('k.', 21), ('an', 20), ('had', 20), ('were', 20), ('but', 19), ('which',.............

print ([x for x in c if c.get(x) > 5]) # words appearing more than 5 times

['there', 'but', 'both', 'wife', 'for', 'musical', 'salzburg', 'it', 'more', 'first', 'this', 'symphony', 'wrote', 'one', 'during', 'mozart', 'vienna', 'joseph', 'in', 'later', 'salzburg,', 'other', 'such', 'last', 'needed]', 'only', 'their', 'including', 'by', 'music,', 'at', "mozart's", 'mannheim,', 'composer', 'and', 'are', 'became', 'four', 'premiered', 'time', 'did', 'the', 'not', 'often', 'is', 'have', 'began', 'some', 'success', 'court', 'that', 'performed', 'work', 'him', 'leopold', 'these', 'while', 'been', 'new', 'most', 'were', 'father', 'opera', 'as', 'who', 'classical', 'k.', 'to', 'of', 'has', 'many', 'was', 'works', 'which', 'early', 'three', 'family', 'on', 'a', 'when', 'had', 'december', 'after', 'he', 'no.', 'year', 'from', 'great', 'period', 'music', 'with', 'his', 'composed', 'minor', 'two', 'number', '1782', 'an', 'piano']

答案 1 :(得分:3)

所以,这是来自一个新手,但如果你只是需要快速回答,我认为这可能会奏效。请注意,使用此方法,您不能只使用程序输入URL,您必须手动将其粘贴到代码中。 (抱歉)。

text = '''INSERT TEXT HERE'''.split() #Where you see "INSERT TEXT HERE", that's where the text goes.
#also note the .split() method at the end. This converts the text into a list, splitting every word in between the spaces. 
#for example, "red dog food".split() would be ['red','dog','food']
overusedwords = [] #this is where the words that are used 5 or more times are going to be held.
for i in text: #this will iterate through every single word of the text
    if text.count(i) >= 5 and overusedwords.count(i) == 0: #(1. Read below)
        overusedwords.append(i) #this adds the word to the list of words used 5 or more times
if len(overusedwords) > 0: #if there are no words used 5 or more times, it doesn't print anything useless.
    print('The overused words are:')
    for i in overusedwords:
        print(i)
else:
    print('No words used 5 or more times.') #just in case there are no words used 5 or more times

对于“text.count(i)> = 5部分的解释。每次迭代for循环时,它会检查文本中是否有五个或更多特定单词。 然后,对于“和overusedwords.count(i)== 0:”,这只是确保相同的单词没有被两次添加到过度使用的单词列表中。 希望我帮忙。我想你可能想要一种方法,你可以直接从网址输入获得这些信息,但这可能会帮助其他有类似问题的初学者。

答案 2 :(得分:3)

我这样做:

  • 安装BeautifulSoup,解释为here
  • 您需要这些导入:

    from bs4 import BeautifulSoup
    import re
    from collections import Counter
    
  • 使用BeautifulSoup抓取网站上的可见文字,详见stackoverflow here

  • 使用

    从可见文字中获取lst个单词列表
    re.findall(r'\b\w+', visible_text_string)
    
  • 将每个单词转换为小写

    lst = [x.lower() for x in lst]
    
  • 计算每个单词的出现次数并列出(word, count)个元组。

    counter = Counter(lst)
    occs = [(word,count) for word,count in counter.items() if count > 5]
    
  • 按出现次数occs排序:

    occs.sort(key=lambda x:x[1])
    

答案 3 :(得分:0)

scrapyurlliburllib2BeautifulSoup是您在网站上传输数据时的朋友。

这取决于各个站点以及站点的作者将文本放在页面上的位置。大多数情况下,您可以在<p>...</p>中找到文字。

例如,在此网站(http://www.yoursingapore.com/content/traveller/en/browse/see-and-do/nightlife/dance-clubs/zouk.html)中,您需要的文字是:

  

如果你只有时间在新加坡的一个俱乐部,那么它只需要   是Zouk。可能是新加坡唯一的国际知名夜总会,   Zouk仍然是一个机构和年轻人的通行仪式   这个城市的人们。

     

它已经在邻国产生了几个其他俱乐部   马来西亚,甚至还有自己的舞蹈节 - 圣淘沙的ZoukOut。   Zouk由三个俱乐部和一个酒吧组成,主要房间   展示技术和家庭音乐。地下丝绒更放松   和独家,而Phuture是实验和比其他人更好的,   顾名思义就是这样。

     

Zouk的全球声誉意味着它拥有各种领先优势   世界DJ,从Carl Cox和Paul Oakenfold到化学兄弟   和原始尖叫。 Zouk还举办着名的Mambo Jambo复古之夜   周三,为什么在Zouk度过一个晚上的另一个原因就是品尝。

页面上还有其他文字,但通常情况下,您只需要主页,而不是页面上的导航栏和样板。

你可以通过以下方式获得:

>>> import urllib2
>>> from bs4 import BeautifulSoup as bsoup
>>> url = "http://www.yoursingapore.com/content/traveller/en/browse/see-and-do/nightlife/dance-clubs/zouk.html"
>>> page = urllib2.urlopen(url).read()
>>> for i in bsoup(page).find_all('p'):
...     print i.text.strip()
... 

If you only have time for one club in Singapore, then it simply has to be Zouk. Probably Singapore’s only nightspot of international repute, Zouk remains both an institution and a rite of passage for young people in the city-state.
It has spawned several other clubs in neighbouring countries like Malaysia, and even has its own dance festival – Sentosa’s ZoukOut. Zouk is made up of three clubs and a wine bar, with the main room showcasing techno and house music. Velvet Underground is more relaxed and exclusive, while Phuture is experimental and racier than the rest, just as its name suggests.
Zouk’s global reputation means it’s home to all manner of leading world DJs, from Carl Cox and Paul Oakenfold to the Chemical Brothers and Primal Scream. Zouk also holds its famous Mambo Jambo retro nights on Wednesdays, another reason why a night at Zouk is one to savour.
Find us on       Facebook      Twitter      Youtube      Wikipedia     Singapore Reviews

Copyright © 2013 Singapore Tourism Board. Website Terms of Use   |   Privacy Statement   |   Photo Credits

你意识到你获得的不仅仅是你真正需要的,所以你可以在访问其中的段落之前获取bsoup(page).find_all()来进一步筛选<div class="paragraph section">...</div>

>>> for i in bsoup(page).find_all(attrs={'class':'paragraph section'}):
...     print i.text.strip()
... 
If you only have time for one club in Singapore, then it simply has to be Zouk. Probably Singapore’s only nightspot of international repute, Zouk remains both an institution and a rite of passage for young people in the city-state. 
It has spawned several other clubs in neighbouring countries like Malaysia, and even has its own dance festival – Sentosa’s ZoukOut. Zouk is made up of three clubs and a wine bar, with the main room showcasing techno and house music. Velvet Underground is more relaxed and exclusive, while Phuture is experimental and racier than the rest, just as its name suggests.
Zouk’s global reputation means it’s home to all manner of leading world DJs, from Carl Cox and Paul Oakenfold to the Chemical Brothers and Primal Scream. Zouk also holds its famous Mambo Jambo retro nights on Wednesdays, another reason why a night at Zouk is one to savour.

瞧,你有文字。但如前所述,如何从页面中删除主文本取决于页面的编写方式。

这是完整的代码:

>>> import urllib2
>>> from collections import Counter
>>> from nltk import word_tokenize
>>> from bs4 import BeautifulSoup as bsoup
>>> page = urllib2.urlopen(url).read()
>>> text = " ".join([i.text.strip() for i in bsoup(page).find_all(attrs={'class':'paragraph section'})])
>>> word_freq = Counter(word_tokenize(text))
>>> word_freq['Zouk'] 4
>>> word_freq.most_common() [(u',', 8), (u'and', 8), (u'to', 4), (u'of', 4), (u'Zouk', 4), (u'is', 4), (u'the', 4), (u'its', 3), (u'has', 3), (u'in', 3), (u'a', 3), (u'only', 2), (u'for', 2), (u'one', 2), (u'clubs', 2), (u'exclusive', 1), (u'all', 1), (u'Velvet', 1), (u'just', 1), (u'dance', 1), (u'global', 1), (u'rest', 1), (u'Chemical', 1), (u'Oakenfold', 1), (u'it\u2019s', 1), (u'young', 1), (u'passage', 1), (u'main', 1), (u'neighbouring', 1), (u'then', 1), (u'than', 1), (u'means', 1), (u'famous', 1), (u'made', 1), (u'world', 1), (u'like', 1), (u'DJs', 1), (u'bar', 1), (u'name', 1), (u'countries', 1), (u'night', 1), (u'showcasing', 1), (u'Paul', 1), (u'people', 1), (u'house', 1), (u'ZoukOut.', 1), (u'up', 1), (u'\u2013', 1), (u'Underground', 1), (u'home', 1), (u'even', 1), (u'Singapore', 1), (u'city-state.', 1), (u'retro', 1), (u'international', 1), (u'rite', 1), (u'be', 1), (u'institution', 1), (u'reason', 1), (u'techno', 1), (u'both', 1), (u'nightspot', 1), (u'festival', 1), (u'experimental', 1), (u'Singapore\u2019s', 1), (u'own', 1), (u'savour', 1), (u'suggests.', 1), (u'Zouk\u2019s', 1), (u'simply', 1), (u'another', 1), (u'Probably', 1), (u'Jambo', 1), (u'spawned', 1), (u'from', 1), (u'Brothers', 1), (u'remains', 1), (u'leading', 1), (u'.', 1), (u'Phuture', 1), (u'Carl', 1), (u'more', 1), (u'on', 1), (u'club', 1), (u'relaxed', 1), (u'If', 1), (u'with', 1), (u'Wednesdays', 1), (u'room', 1), (u'Primal', 1), (u'while', 1), (u'three', 1), (u'at', 1), (u'racier', 1), (u'it', 1), (u'an', 1), (u'Zouk.', 1), (u'as', 1), (u'manner', 1), (u'have', 1), (u'nights', 1), (u'Malaysia', 1), (u'holds', 1), (u'also', 1), (u'other', 1), (u'repute', 1), (u'you', 1), (u'several', 1), (u'Sentosa\u2019s', 1), (u'Cox', 1), (u'Mambo', 1), (u'why', 1), (u'It', 1), (u'reputation', 1), (u'time', 1), (u'Scream.', 1), (u'music.', 1), (u'wine', 1)]

以上示例来自:

  

Liling Tan和Francis Bond。 2011.建立和注释   语言多样化的NTU-MC(NTU-多语言语料库)。在   第25届亚太地区语言会议论文集,   信息和计算(PACLIC 25)。新加坡。