我希望从一些HTML文档中删除一些文本,但我无法摆脱一些换行符。目前我有美丽的汤解析网页,然后我读了所有的行,并尝试从文本中删除所有换行符,但我无法摆脱字符串中间的那些。例如,
<font face="ARIAL" size="2">Thomas
H. Lenagh </font>
我希望在一行中获得此人的姓名,但在中间有某种换行符。这是我到目前为止所尝试的内容:
line=line.replace("\n"," ")
line=line.replace("\\n"," ")
line=line.replace("\r\n", " ")
line=line.replace("\t", " ")
line=line.replace("\\r\\n"," ")
我还尝试了以下正则表达式:
line=re.sub("\n"," ",line)
line=re.sub("\\n", " ",line)
line=re.sub("\s\s+", " ",line)
到目前为止没有人工作,我不确定我缺少什么性格。有什么想法吗?
编辑:这是我使用的完整代码(减去错误检查):
soup=BeautifulSoup(threePage) #make the soup
paragraph=soup.stripped_strings
if paragraph is not None:
for i in range (len(data)): #for all rows...
lineCounter=lineCounter+1
row =data[i]
row=row.replace("\n"," ") #remove newline (<enter>) characters
row = re.sub("---+"," ",row) #remove dashed lines
row =re.sub(","," ",row) #replace commas with spaces
row=re.sub("\s\s+", " ",row) #remove
if ("/s/" in row): #if /s/ is in the row, remove it
row=re.sub(".*/s/"," ",row)
if ("/S/" in row): #upper case of the last removal
row=re.sub(".*/S/"," ",row)
row = row.replace("\n"," ")
row=row.strip()#remove any weird characters
答案 0 :(得分:0)
在for
循环之后,你还没有分享你的其余代码是什么样的,但我猜一个非常简化的版本是这样的:
data = ["a\nb", "c\nd", "e\nf"]
for i in range(len(data)):
row = data[i]
row = row.replace("\n", "")
#let's see if that fixed it...
print(data)
#output: ['a\nb', 'c\nd', 'e\nf']
#hey, the newlines are still there! What gives?
这是因为在字符串上调用replace
不会就地改变它,并且为row
分配新值不会更改data
中存储的值。如果您也希望更改data
,则必须重新分配值。
data = ["a\nb", "c\nd", "e\nf"]
for i in range(len(data)):
row = data[i]
row = row.replace("\n", "")
data[i] = row
#let's see if that fixed it...
print data
#output: ['ab', 'cd', 'ef']
#looking good!
奖金风格提示:如果您的替换逻辑足够简单,可以在一个表达式中表达,您可以在一行中完成所有操作并避免弄乱range
和索引等:
data = [row.replace("\n", "") for row in data]