从网络上抓取的文字中标记文字

时间:2019-05-01 08:30:24

标签: python nltk

我正在尝试获取我的代码以刮除http://www.pythonscraping.com/pages/warandpeace.html,然后打印出10个最常用的英语单词。但是,我仅找到最常用的段落/句子而不是单词的代码。因此,我得到了这个垃圾:

aList

我的代码是:

[("Well, Prince, so Genoa and Lucca are now just family estates of the\nBuonapartes. But I warn you, if you don't tell me that this means war,\nif you still try to defend the infamies and horrors perpetrated by\nthat Antichrist- I really believe he is Antichrist- I will have\nnothing more to do with you and you are no longer my friend, no longer\nmy 'faithful slave,' as you call yourself! But how do you do? I see\nI have frightened you- sit down and tell me all the news.", 1),
('If you have nothing better to do, Count [or Prince], and if the\nprospect of spending an evening with a poor invalid is not too\nterrible, I shall be very charmed to see you tonight between 7 and 10-\nAnnette Scherer.',   1),
('Heavens! what a virulent attack!', 1),
("First of all, dear friend, tell me how you are. Set your friend's\nmind at rest,",   1),
('Can one be well while suffering morally? Can one be calm in times\nlike these if one has any feeling?',   1),
('You are\nstaying the whole evening, I hope?', 1), 
("And the fete at the English ambassador's? Today is Wednesday. I\nmust put in an appearance there,",   1),
('My daughter is\ncoming for me to take me there.', 1),
("I thought today's fete had been canceled. I confess all these\nfestivities and fireworks are becoming wearisome.",   1),

我尝试通过添加import nltk from nltk.corpus import stopwords from nltk import word_tokenize nltk.download('stopwords') stop_words = set(stopwords.words('english')) from urllib.request import urlopen from bs4 import BeautifulSoup html=urlopen('http://www.pythonscraping.com/pages/warandpeace.html') soup=BeautifulSoup(html,'html.parser') nameList = [tag.text for tag in soup.findAll("span", {"class":"red"})] filtered_words = [word for word in nameList if word not in stopwords.words('english')] fdist1 = nltk.FreqDist(nameList) fdist1.most_common(10) 来标记名称列表,但最终得到了"token = nltk.word_tokenize(nameList)"

令牌生成器可以与Web抓取一起使用吗?我也尝试过使用TypeError: expected string or bytes-like object.进行拆分,但最终出现AttributeError:'list'对象没有属性'split'

如何使这部分文本成为单个单词?

2 个答案:

答案 0 :(得分:1)

nameList是带有文本的列表。它本身不包含任何单词,因此无法正确处理。您遇到以下错误:

  1. 您正在搜索文本,而不是文本中的单词
  2. FreqDict在nameList中搜索(带有文本),而不是在filtered_words

您应该用它替换最后一个代码块:

# Remember filtered words between texts
filtered_words = []
# Check all texts
for text in nameList:
    # Replace EOLs with ' ', split by ' ' and filter stopwords
    filtered_words += [word for word in text.replace('\n', ' ').split(' ') if word not in stopwords.words('english')]

# Search in stopwords
fdist1 = nltk.FreqDist(filtered_words)
fdist1.most_common(10)

此外,nltk有一个子模块tokenize,可以(并且应该)使用它来代替手动拆分。最好是自然文字:

nltk.tokenize.casual_tokenize(nameList[2])

返回:

['Heavens', '!', 'what', 'a', 'virulent', 'attack', '!']

答案 1 :(得分:1)

也许这样可以帮助您:

首先对您的nameList中的每个元素(句子)使用re.split()

import re
nameList_splitted=[re.split(';|,|\n| ',x) for x in nameList]

因此,您将获得单个单词列表的列表,其中 然后可以将其合并到一个最终列表中,如下所示:

list_of_words=[]
for list_ in nameList_spaces:
    list_of_words += list_

结果是:

['Well',
 '',
 'Prince',
 '',
 'so',
 'Genoa',
 'and',
 'Lucca',
 'are',
 'now',
...