我有以下代码片段,它需要一个url打开它,解析JUST文本然后搜索小部件。它检测小部件的方式是查找单词widget1
,然后查找endwidget
,表示小部件的结尾。
基本上,代码一旦找到单词widget1
就会将所有文本行写入文件,并在读取endwidget
时结束。但是,我的代码在第一行widget1
行之后缩进所有行。
这是我的输出
widget1 this is a really cool widget
it does x, y and z
and also a, b and c
endwidget
我想要的是:
widget1 this is a really cool widget
it does x, y and z
and also a, b and c
endwidget
为什么我会收到这个缩进?这是我的代码......
for url in urls:
page = mech.open(url)
html = page.read()
soup = BeautifulSoup(html)
text= soup.prettify()
texts = soup.findAll(text=True)
def visible(element):
if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
# If the parent of your element is any of those ignore it
return False
elif re.match('<!--.*-->', str(element)):
# If the element matches an html tag, ignore it
return False
else:
# Otherwise, return True as these are the elements we need
return True
visible_texts = filter(visible, texts)
inwidget=0
# open a file for write
for line in visible_texts:
# if line doesn't contain .widget1 then ignore it
if ".widget1" in line and inwidget==0:
match = re.search(r'\.widget1 (\w+)', line)
line = line.split (".widget1")[1]
# make the next word after .widget1 the name of the file
filename = "%s" % match.group(1) + ".txt"
textfile = open (filename, 'w+b')
textfile.write("source:" + url + "\n\n")
textfile.write(".widget1" + line)
inwidget = 1
elif inwidget == 1 and ".endwidget" not in line:
print line
textfile.write(line)
elif ".endwidget" in line and inwidget == 1:
textfile.write(line)
inwidget= 0
else:
pass
答案 0 :(得分:1)
除了第一行之外,你在所有行中得到这个缩进的原因是你用textfile.write(".widget1" + line)
编辑行的第一行,但你直接从html文件中包含缩进的其余行。您可以使用行上的str.strip()删除不需要的空格,然后将textfile.write(line)
更改为textfile.write(line.strip())
。
答案 1 :(得分:0)
要从输出转到您想要的输出,请执行以下操作:
#a is your output
a= '\n'.join(map(lambda x: x.strip(),a.split('\n')))