从python列表中删除单词?

时间:2015-04-05 19:00:28

标签: python web-scraping

我是python和web抓取的完整菜鸟,并且很早就遇到了一些问题。我已经能够在荷兰新闻网站上搜集他们的头衔和分词。现在我的目标是从结果中删除某些单词。例如,我不想要像#34; het"和" om"在列表中。有谁知道我怎么做到这一点? (我使用python请求和BeautifulSoup)



import requests
from bs4 import BeautifulSoup

url="http://www.nu.nl"
r=requests.get(url)

soup=BeautifulSoup(r.content)

g_data=soup.find_all("span" , {"class": "title"})


for item in g_data:
    print item.text.split()

 




1 个答案:

答案 0 :(得分:0)

在自然语言处理中,排除常用词的术语称为“停用词”。

您是想保留每个单词的顺序和数量,还是只想要在页面上显示的单词集?

如果您只想要在页面上显示的单词集,则可能需要使用集合。以下内容可能有效:

# It's probably more common to define your STOP_WORDS in a file and then read
# them into your data structure to keep things simple for large numbers of those
# words.
STOP_WORDS = set([
    'het',
    'om'
])

all_words = set()
for item in g_data:
    all_words |= set(item.text.split())
all_words -= STOP_WORDS
print all_words

另一方面,如果您关心订单,可以不要在列表中添加停用词。

words_in_order = []
for item in g_data:
    words_from_span = item.text.split()
    # You might want to break this out into its own function for modularity.
    for word in words_from_span:
        if word not in STOP_WORDS:
            words_in_order.append(word)
print words_in_order

如果您不关心订单但想要频率,则可以创建一个字数(或方便的默认字典)来计算字数。

from collections import defaultdict
word_counts = defaultdict(int)
for item in g_data:
    # You might want to break this out into its own function for modularity.
    for word in item.text.split():
        if word not in STOP_WORDS:
            word_counts[word] += 1
for word, count in word_counts.iteritems():
    print '%s: %d' % (word, count)