我正在从http://www.millercenter.org抓一堆演讲。除了一小块之外,我的演讲被刮掉并按照我想要的格式进行格式化。每个文档(所有911个文档)在开头都有“transcript”这个词,当我向前推进一些NLP时,我不希望它们出现在文档中。我无法将其删除,并且我尝试了replace
和remove
方法。我甚至尝试将find
方法扩展到每个文档开头的<h2>Transcript</h2>
部分HTML。
以下是我正在查看的文档示例:
transcript
to the senate and house of representatives
i lay before congress several dispatches from his
和
transcript
the period for a new election of a citizen to administer the executive government
这是我的代码:
import urllib2,sys,os
from bs4 import BeautifulSoup,NavigableString
from string import punctuation as p
from multiprocessing import Pool
import re, nltk
import requests
reload(sys)
chester_url = 'http://millercenter.org/president/arthur/speeches/speech-3752'
chester_3752 = urllib2.urlopen(chester_url).read()
chester_3752 = BeautifulSoup(chester_3752)
# find the speech itself within the HTML
chester_3752 = chester_3752.find('div',{'id': 'transcript'},{'class': 'displaytext'})
# removes extraneous characters (e.g. '<br/>')
chester_3752 = chester_3752.text.lower()
# for further text analysis, remove punctuation
punctuation = re.compile('[{}]+'.format(re.escape(p)))
chester_3752 = punctuation.sub('', chester_3752)
chester_3752 = chester_3752.replace('—',' ')
chester_3752 = chester_3752.replace('transcript','')
就像我说的那样,最后replace
方法似乎不起作用。想法?
答案 0 :(得分:1)
不确定你的问题是什么,但当我使用python 3.4和bs4运行时,它删除了“transcript”以及一堆标点符号。 (我拿出了一堆包含内容并将urllib2
更改为urllib.request
)
import urllib.request
import re
from bs4 import BeautifulSoup
import re
from string import punctuation as p
chester_url = 'http://millercenter.org/president/arthur/speeches/speech-3752'
chester_3752 = urllib.request.urlopen(chester_url).read()
chester_3752 = BeautifulSoup(chester_3752)
# find the speech itself within the HTML
chester_3752 = chester_3752.find('div',{'id': 'transcript'},{'class': 'displaytext'})
# removes extraneous characters (e.g. '<br/>')
chester_3752 = chester_3752.text.lower()
# for further text analysis, remove punctuation
punctuation = re.compile('[{}]+'.format(re.escape(p)))
chester_3752 = punctuation.sub('', chester_3752)
chester_3752 = chester_3752.replace('—',' ')
chester_3752 = chester_3752.replace('transcript','')
print(chester_3752)
答案 1 :(得分:1)
我已经尝试过您的代码,它运行正常,但我建议稍作调整。而不是使用replace
使用startswith
来确保字符串确实以transcript
开头。替换将从整个字符串中删除所有出现的抄本,但您真正需要的是在字符串开头处删除抄本。
import urllib2
import sys
from string import punctuation as p
import re
reload(sys)
chester_url = 'http://millercenter.org/president/arthur/speeches/speech-3752'
chester_3752 = urllib2.urlopen(chester_url).read()
chester_3752 = BeautifulSoup(chester_3752)
# find the speech itself within the HTML
chester_3752 = chester_3752.find('div',{'id': 'transcript'},{'class': 'displaytext'})
# removes extraneous characters (e.g. '<br/>')
chester_3752 = chester_3752.text.lower()
# for further text analysis, remove punctuation
punctuation = re.compile('[{}]+'.format(re.escape(p)))
chester_3752 = punctuation.sub('', chester_3752)
chester_3752 = chester_3752.replace('-',' ')
print(chester_3752)
# chester_3752 = chester_3752.replace('transcript','') #avoid this as it will delete all instances of transcript in the string
if chester_3752.startswith("transcript"): #this ensures only transcript at the beginning of the string is deleted which is what you want
chester_3752 = chester_3752[10:].strip()
print chester_3752