Question

我需要使用管道（以及任何其他需要的python脚本）从文本中提取10个最常用的单词;输出是由空格分隔的全部大写单词块。这个管道需要从任何外部文件中提取文本：我已经设法让它处理.txt文件，但我还需要能够输入一个URL并让它做同样的事情。

我有以下代码：

alias words="tr a-zA-Z | tr -cs A-Z | tr ' ' '\012' | sort -n | uniq -c | 
sort -r | head -n 10 | awk '{printf \"%s \", \$2}END{print \"\"}'" (on one line)

，cat hamlet.txt | words给了我：

TO THE AND A  'TIS THAT OR OF IS

为了使它更复杂，我需要排除任何“功能”词：这些是'非词汇'词，如'a'，'the'，'of'，'is'，任何代词（我，你，他）和任何介词（在那里，在，从）。

我需要能够输入htmlstrip http://www.google.com.au | words并将其打印出来，如上所述。

对于网址开放：我试图找出的python脚本（让我们称之为htmlstrip）从文本中删除任何标记，只留下“人类可读”的文本。这个应该能够打开任何给定的URL，但我无法弄清楚如何让它工作。到目前为止我所拥有的：

import re
import urllib2
filename = raw_input('File name: ')
filehandle = open(filename)
html = filehandle.read()

f = urllib2.urlopen('http://') #???
print f.read()

text = [ ]
inTag = False


for ch in html:
    if ch == '<':
        inTag = True
    if not inTag:
        text.append(ch)
    if ch == '>':
        inTag = False

print ''.join(text)

我知道这既不完整也可能不正确 - 任何指导都会非常感激。

Answer 1

使用re.sub：

import re

text = re.sub(r"<.+>", " ", html)

对于脚本等特殊情况，您可以包含正则表达式，例如：

<script.*>.*</script>

Answer 2

您可以使用scrape.py和这样的正则表达式：

#!/usr/bin/env python

from scrape import s
import sys, re

if len(sys.argv) < 2:
    print "Usage: words.py url"
    sys.exit(0)

s.go(sys.argv[1]) # fetch content
text = s.doc.text # extract readable text
text = re.sub("\W+", " ", text) # remove all non-word characters and repeating whitespace
print text

然后只是： ./words.py http://whatever.com

Answer 3

更新：抱歉，只需阅读有关纯Python的评论，无需任何其他模块。是的，在这种情况下，我认为re将是最好的方式。

使用pycURL而不是re删除标签可能更容易也更正确？

from StringIO import StringIO    
import pycurl

url = 'http://www.google.com/'

storage = StringIO()
c = pycurl.Curl()
c.setopt(c.URL, url)
c.setopt(c.WRITEFUNCTION, storage.write)
c.perform()
c.close()
content = storage.getvalue()
print content

Bash / Python：打开网址＆amp;打印前10个单词

3 个答案: