Webscraping:如果在文档的前20个字符中删除单词?

时间:2015-10-06 00:45:24

标签: python html web-scraping beautifulsoup

我正在从http://www.millercenter.org抓一堆演讲。除了一小块之外,我的演讲被刮掉并按照我想要的格式进行格式化。每个文档(所有911个文档)在开头都有“transcript”这个词,当我向前推进一些NLP时,我不希望它们出现在文档中。我无法将其删除,并且我尝试了replaceremove方法。我甚至尝试将find方法扩展到每个文档开头的<h2>Transcript</h2>部分HTML。

以下是我正在查看的文档示例:

transcript
to the senate and house of representatives
i lay before congress several dispatches from his

transcript
the period for a new election of a citizen to administer the executive government

这是我的代码:

import urllib2,sys,os
from bs4 import BeautifulSoup,NavigableString
from string import punctuation as p
from multiprocessing import Pool
import re, nltk
import requests
reload(sys)

chester_url = 'http://millercenter.org/president/arthur/speeches/speech-3752'
chester_3752 = urllib2.urlopen(chester_url).read()
chester_3752 = BeautifulSoup(chester_3752)

# find the speech itself within the HTML
chester_3752 = chester_3752.find('div',{'id': 'transcript'},{'class': 'displaytext'})

# removes extraneous characters (e.g. '<br/>')
chester_3752 = chester_3752.text.lower()

# for further text analysis, remove punctuation
punctuation = re.compile('[{}]+'.format(re.escape(p)))

chester_3752 = punctuation.sub('', chester_3752)
chester_3752 = chester_3752.replace('—',' ')
chester_3752 = chester_3752.replace('transcript','')

就像我说的那样,最后replace方法似乎不起作用。想法?

2 个答案:

答案 0 :(得分:1)

不确定你的问题是什么,但当我使用python 3.4和bs4运行时,它删除了“transcript”以及一堆标点符号。 (我拿出了一堆包含内容并将urllib2更改为urllib.request

import urllib.request
import re
from bs4 import BeautifulSoup

import re
from string import punctuation as p

chester_url = 'http://millercenter.org/president/arthur/speeches/speech-3752'
chester_3752 = urllib.request.urlopen(chester_url).read()
chester_3752 = BeautifulSoup(chester_3752)

# find the speech itself within the HTML
chester_3752 = chester_3752.find('div',{'id': 'transcript'},{'class': 'displaytext'})

# removes extraneous characters (e.g. '<br/>')
chester_3752 = chester_3752.text.lower()

# for further text analysis, remove punctuation
punctuation = re.compile('[{}]+'.format(re.escape(p)))

chester_3752 = punctuation.sub('', chester_3752)
chester_3752 = chester_3752.replace('—',' ')
chester_3752 = chester_3752.replace('transcript','')

print(chester_3752)

答案 1 :(得分:1)

我已经尝试过您的代码,它运行正常,但我建议稍作调整。而不是使用replace使用startswith来确保字符串确实以transcript开头。替换将从整个字符串中删除所有出现的抄本,但您真正需要的是在字符串开头处删除抄本。

import urllib2
import sys
from string import punctuation as p
import re

reload(sys)

chester_url = 'http://millercenter.org/president/arthur/speeches/speech-3752'
chester_3752 = urllib2.urlopen(chester_url).read()
chester_3752 = BeautifulSoup(chester_3752)

# find the speech itself within the HTML
chester_3752 = chester_3752.find('div',{'id': 'transcript'},{'class': 'displaytext'})

# removes extraneous characters (e.g. '<br/>')
chester_3752 = chester_3752.text.lower()

# for further text analysis, remove punctuation
punctuation = re.compile('[{}]+'.format(re.escape(p)))

chester_3752 = punctuation.sub('', chester_3752)
chester_3752 = chester_3752.replace('-',' ')
print(chester_3752)

# chester_3752 = chester_3752.replace('transcript','') #avoid this as it will delete all instances of transcript in the string

if chester_3752.startswith("transcript"): #this ensures only transcript at the beginning of the string is deleted which is what you want
    chester_3752 =  chester_3752[10:].strip() 
print chester_3752