如何删除HTML文本中的随机文本中断

时间:2016-06-07 19:09:43

标签: python html regex python-3.x web-scraping

我希望从一些HTML文档中删除一些文本,但我无法摆脱一些换行符。目前我有美丽的汤解析网页,然后我读了所有的行,并尝试从文本中删除所有换行符,但我无法摆脱字符串中间的那些。例如,

<font face="ARIAL" size="2">Thomas
H. Lenagh </font>

我希望在一行中获得此人的姓名,但在中间有某种换行符。这是我到目前为止所尝试的内容:

line=line.replace("\n"," ")
line=line.replace("\\n"," ")
line=line.replace("\r\n", " ")
line=line.replace("\t", " ")
line=line.replace("\\r\\n"," ")

我还尝试了以下正则表达式:

line=re.sub("\n"," ",line)
line=re.sub("\\n", " ",line)
line=re.sub("\s\s+", " ",line)

到目前为止没有人工作,我不确定我缺少什么性格。有什么想法吗?

编辑:这是我使用的完整代码(减去错误检查):

soup=BeautifulSoup(threePage) #make the soup
paragraph=soup.stripped_strings
if paragraph is not None: 
for i in range (len(data)): #for all rows...
    lineCounter=lineCounter+1
    row =data[i]
    row=row.replace("\n"," ") #remove newline (<enter>) characters
    row = re.sub("---+"," ",row) #remove dashed lines
    row =re.sub(","," ",row) #replace commas with spaces
    row=re.sub("\s\s+", " ",row) #remove 
    if ("/s/" in row): #if /s/ is in the row, remove it
         row=re.sub(".*/s/"," ",row)
    if ("/S/" in row): #upper case of the last removal
         row=re.sub(".*/S/"," ",row)
    row = row.replace("\n"," ")
    row=row.strip()#remove any weird characters

1 个答案:

答案 0 :(得分:0)

for循环之后,你还没有分享你的其余代码是什么样的,但我猜一个非常简化的版本是这样的:

data = ["a\nb", "c\nd", "e\nf"]

for i in range(len(data)):
    row = data[i]
    row = row.replace("\n", "")

#let's see if that fixed it...    
print(data)
#output: ['a\nb', 'c\nd', 'e\nf']
#hey, the newlines are still there! What gives?

这是因为在字符串上调用replace不会就地改变它,并且为row分配新值不会更改data中存储的值。如果您也希望更改data,则必须重新分配值。

data = ["a\nb", "c\nd", "e\nf"]

for i in range(len(data)):
    row = data[i]
    row = row.replace("\n", "")
    data[i] = row

#let's see if that fixed it...    
print data
#output: ['ab', 'cd', 'ef']
#looking good!

奖金风格提示:如果您的替换逻辑足够简单,可以在一个表达式中表达,您可以在一行中完成所有操作并避免弄乱range和索引等:

data = [row.replace("\n", "") for row in data]