Question

我希望从一些HTML文档中删除一些文本，但我无法摆脱一些换行符。目前我有美丽的汤解析网页，然后我读了所有的行，并尝试从文本中删除所有换行符，但我无法摆脱字符串中间的那些。例如，

<font face="ARIAL" size="2">Thomas
H. Lenagh </font>

我希望在一行中获得此人的姓名，但在中间有某种换行符。这是我到目前为止所尝试的内容：

line=line.replace("\n"," ")
line=line.replace("\\n"," ")
line=line.replace("\r\n", " ")
line=line.replace("\t", " ")
line=line.replace("\\r\\n"," ")

我还尝试了以下正则表达式：

line=re.sub("\n"," ",line)
line=re.sub("\\n", " ",line)
line=re.sub("\s\s+", " ",line)

到目前为止没有人工作，我不确定我缺少什么性格。有什么想法吗？

编辑：这是我使用的完整代码（减去错误检查）：

soup=BeautifulSoup(threePage) #make the soup
paragraph=soup.stripped_strings
if paragraph is not None: 
for i in range (len(data)): #for all rows...
    lineCounter=lineCounter+1
    row =data[i]
    row=row.replace("\n"," ") #remove newline (<enter>) characters
    row = re.sub("---+"," ",row) #remove dashed lines
    row =re.sub(","," ",row) #replace commas with spaces
    row=re.sub("\s\s+", " ",row) #remove 
    if ("/s/" in row): #if /s/ is in the row, remove it
         row=re.sub(".*/s/"," ",row)
    if ("/S/" in row): #upper case of the last removal
         row=re.sub(".*/S/"," ",row)
    row = row.replace("\n"," ")
    row=row.strip()#remove any weird characters

Answer 1

在for循环之后，你还没有分享你的其余代码是什么样的，但我猜一个非常简化的版本是这样的：

data = ["a\nb", "c\nd", "e\nf"]

for i in range(len(data)):
    row = data[i]
    row = row.replace("\n", "")

#let's see if that fixed it...    
print(data)
#output: ['a\nb', 'c\nd', 'e\nf']
#hey, the newlines are still there! What gives?

这是因为在字符串上调用replace不会就地改变它，并且为row分配新值不会更改data中存储的值。如果您也希望更改data，则必须重新分配值。

data = ["a\nb", "c\nd", "e\nf"]

for i in range(len(data)):
    row = data[i]
    row = row.replace("\n", "")
    data[i] = row

#let's see if that fixed it...    
print data
#output: ['ab', 'cd', 'ef']
#looking good!

奖金风格提示：如果您的替换逻辑足够简单，可以在一个表达式中表达，您可以在一行中完成所有操作并避免弄乱range和索引等：

data = [row.replace("\n", "") for row in data]

如何删除HTML文本中的随机文本中断

1 个答案: