Question

我正在从http://www.millercenter.org抓一堆演讲。除了一小块之外，我的演讲被刮掉并按照我想要的格式进行格式化。每个文档（所有911个文档）在开头都有“transcript”这个词，当我向前推进一些NLP时，我不希望它们出现在文档中。我无法将其删除，并且我尝试了replace和remove方法。我甚至尝试将find方法扩展到每个文档开头的<h2>Transcript</h2>部分HTML。

以下是我正在查看的文档示例：

transcript
to the senate and house of representatives
i lay before congress several dispatches from his

和

transcript
the period for a new election of a citizen to administer the executive government

这是我的代码：

import urllib2,sys,os
from bs4 import BeautifulSoup,NavigableString
from string import punctuation as p
from multiprocessing import Pool
import re, nltk
import requests
reload(sys)

chester_url = 'http://millercenter.org/president/arthur/speeches/speech-3752'
chester_3752 = urllib2.urlopen(chester_url).read()
chester_3752 = BeautifulSoup(chester_3752)

# find the speech itself within the HTML
chester_3752 = chester_3752.find('div',{'id': 'transcript'},{'class': 'displaytext'})

# removes extraneous characters (e.g. '<br/>')
chester_3752 = chester_3752.text.lower()

# for further text analysis, remove punctuation
punctuation = re.compile('[{}]+'.format(re.escape(p)))

chester_3752 = punctuation.sub('', chester_3752)
chester_3752 = chester_3752.replace('—',' ')
chester_3752 = chester_3752.replace('transcript','')

就像我说的那样，最后replace方法似乎不起作用。想法？

Answer 1

不确定你的问题是什么，但当我使用python 3.4和bs4运行时，它删除了“transcript”以及一堆标点符号。（我拿出了一堆包含内容并将urllib2更改为urllib.request）

import urllib.request
import re
from bs4 import BeautifulSoup

import re
from string import punctuation as p

chester_url = 'http://millercenter.org/president/arthur/speeches/speech-3752'
chester_3752 = urllib.request.urlopen(chester_url).read()
chester_3752 = BeautifulSoup(chester_3752)

# find the speech itself within the HTML
chester_3752 = chester_3752.find('div',{'id': 'transcript'},{'class': 'displaytext'})

# removes extraneous characters (e.g. '<br/>')
chester_3752 = chester_3752.text.lower()

# for further text analysis, remove punctuation
punctuation = re.compile('[{}]+'.format(re.escape(p)))

chester_3752 = punctuation.sub('', chester_3752)
chester_3752 = chester_3752.replace('—',' ')
chester_3752 = chester_3752.replace('transcript','')

print(chester_3752)

Answer 2

我已经尝试过您的代码，它运行正常，但我建议稍作调整。而不是使用replace使用startswith来确保字符串确实以transcript开头。替换将从整个字符串中删除所有出现的抄本，但您真正需要的是在字符串开头处删除抄本。

import urllib2
import sys
from string import punctuation as p
import re

reload(sys)

chester_url = 'http://millercenter.org/president/arthur/speeches/speech-3752'
chester_3752 = urllib2.urlopen(chester_url).read()
chester_3752 = BeautifulSoup(chester_3752)

# find the speech itself within the HTML
chester_3752 = chester_3752.find('div',{'id': 'transcript'},{'class': 'displaytext'})

# removes extraneous characters (e.g. '<br/>')
chester_3752 = chester_3752.text.lower()

# for further text analysis, remove punctuation
punctuation = re.compile('[{}]+'.format(re.escape(p)))

chester_3752 = punctuation.sub('', chester_3752)
chester_3752 = chester_3752.replace('-',' ')
print(chester_3752)

# chester_3752 = chester_3752.replace('transcript','') #avoid this as it will delete all instances of transcript in the string

if chester_3752.startswith("transcript"): #this ensures only transcript at the beginning of the string is deleted which is what you want
    chester_3752 =  chester_3752[10:].strip() 
print chester_3752

Webscraping：如果在文档的前20个字符中删除单词？

2 个答案: