网络抓取数据的合法化

时间:2019-03-22 10:28:41

标签: python nlp text-parsing stemming lemmatization

假设我有一个文本文件,如下所示:

document = '<p> I am a sentence. I am another sentence <p> I am a third sentence.'

(或更复杂的文本示例:

document = '<p>Forde Education are looking to recruit a Teacher of Geography for an immediate start in a Doncaster Secondary school.</p> <p>The school has a thriving and welcoming environment with very high expectations of students both in progress and behaviour.&nbsp; This position will be working&nbsp;until Easter with a&nbsp;<em><strong>likely extension until July 2011.</strong></em></p> <p>The successful candidates will need to demonstrate good practical subject knowledge  but also possess the knowledge and experience to teach to GCSE level with the possibility of teaching to A’Level to smaller groups of students.</p> <p>All our candidate will be required to hold a relevant teaching qualifications with QTS  successful applicants will be required to provide recent relevant references and undergo a Enhanced CRB check.</p> <p>To apply for this post or to gain information regarding similar roles please either submit your CV in application or Call Debbie Slater for more information.&nbsp;</p>' 

我正在应用一系列预处理NLP技术,以获取文档的“更干净”版本,同时也为每个单词取了词干。

我为此使用了以下代码:

stemmer_1 = PorterStemmer()
stemmer_2 = LancasterStemmer()
stemmer_3 = SnowballStemmer(language='english')

# Remove all the special characters
document = re.sub(r'\W', ' ', document)

# remove all single characters
document = re.sub(r'\b[a-zA-Z]\b', ' ', document)

# Substituting multiple spaces with single space
document = re.sub(r' +', ' ', document, flags=re.I)

# Converting to lowercase
document = document.lower()

# Tokenisation
document = document.split()

# Stemming
document = [stemmer_3.stem(word) for word in document]

# Join the words back to a single document
document = ' '.join(document)

这将为上面的文本文档提供以下输出:

'am sent am anoth sent am third sent'

(此输出为更复杂的示例:

'ford educ are look to recruit teacher of geographi for an immedi start in doncast secondari school the school has thrive and welcom environ with veri high expect of student both in progress and behaviour nbsp this posit will be work nbsp until easter with nbsp em strong like extens until juli 2011 strong em the success candid will need to demonstr good practic subject knowledg but also possess the knowledg and experi to teach to gcse level with the possibl of teach to level to smaller group of student all our candid will be requir to hold relev teach qualif with qts success applic will be requir to provid recent relev refer and undergo enhanc crb check to appli for this post or to gain inform regard similar role pleas either submit your cv in applic or call debbi slater for more inform nbsp'

我现在想要做的是获得与上述完全一样的输出,但是在我应用了词形化并且没有词干之后。

但是,除非我丢失了某些内容,否则这需要将原始文档拆分为(明智的)句子,应用POS标记,然后实施定形。

但是这里的事情有点复杂,因为文本数据来自网络抓取,因此您会遇到许多HTML标签,例如<br><p>等。

我的想法是,每次单词序列以某个常见的标点符号(句号,感叹号等)或HTML标签(例如<br><p>等)结尾时,被视为单独的句子。

例如上面的原始文档:

document = '<p> I am a sentence. I am another sentence <p> I am a third sentence.'

应分为以下内容:

['I am a sentence', 'I am another sentence', 'I am a third sentence']

然后我想我们将在每个句子上应用POS标记,将每个句子拆分为单词,应用词形化并将.join()的单词返回到单个文档,就像我在上面的代码中所做的那样。

我该怎么做?

1 个答案:

答案 0 :(得分:1)

删除HTML标签是文本优化的常见部分。您可以使用自己编写的规则,例如text.replace('<p>', '.'),但是有更好的解决方案:html2text。该库可以为您完成所有肮脏的HTML精炼工作,例如:

>>> print h.handle("<p>Hello, <a href='http://earth.google.com/'>world</a>!")
Hello, world!

您可以在Python代码中导入该库,也可以将其用作独立程序。

编辑:这是一个将您的文本分割成多个句子的小型示例:

>>> document = '<p> I am a sentence. I am another sentence <p> I am a third sentence.'
>>> text_without_html = html2text.html2text(document)
>>> refined_text = re.sub(r'\n+', '. ', text_without_html)
>>> sentences = nltk.sent_tokenize(refined_text)
>>> sentences

['I am a sentence.', 'I am another sentence.', 'I am a third sentence..']