Question

我有以下代码片段，它需要一个url打开它，解析JUST文本然后搜索小部件。它检测小部件的方式是查找单词widget1，然后查找endwidget，表示小部件的结尾。

基本上，代码一旦找到单词widget1就会将所有文本行写入文件，并在读取endwidget时结束。但是，我的代码在第一行widget1行之后缩进所有行。

这是我的输出

widget1 this is a really cool widget
       it does x, y and z 
       and also a, b and c
       endwidget

我想要的是：

widget1 this is a really cool widget
it does x, y and z 
and also a, b and c
endwidget

为什么我会收到这个缩进？这是我的代码......

 for url in urls:
        page = mech.open(url)
        html = page.read()
        soup = BeautifulSoup(html)
        text= soup.prettify()
        texts = soup.findAll(text=True) 

        def visible(element):
            if element.parent.name in ['style', 'script', '[document]', 'head', 'title']: 
            # If the parent of your element is any of those ignore it

                return False

            elif re.match('<!--.*-->', str(element)):
            # If the element matches an html tag, ignore it

                return False

            else:
            # Otherwise, return True as these are the elements we need

              return True

        visible_texts = filter(visible, texts)

        inwidget=0
        # open a file for write
        for line in visible_texts:
        # if line doesn't contain .widget1 then ignore it
            if ".widget1" in line and inwidget==0:
                match = re.search(r'\.widget1 (\w+)', line)
                line = line.split (".widget1")[1]   
                # make the next word after .widget1 the name of the file
                filename = "%s" % match.group(1) + ".txt"
                textfile = open (filename, 'w+b')
                textfile.write("source:" + url + "\n\n")
                textfile.write(".widget1" + line)
                inwidget = 1
            elif inwidget == 1 and ".endwidget" not in line:
                print line
                textfile.write(line)
            elif ".endwidget" in line and inwidget == 1:
                textfile.write(line)
                inwidget= 0
            else:
                pass

Answer 1

除了第一行之外，你在所有行中得到这个缩进的原因是你用textfile.write(".widget1" + line)编辑行的第一行，但你直接从html文件中包含缩进的其余行。您可以使用行上的str.strip()删除不需要的空格，然后将textfile.write(line)更改为textfile.write(line.strip())。

Answer 2

要从输出转到您想要的输出，请执行以下操作：

#a is your output
a= '\n'.join(map(lambda x: x.strip(),a.split('\n')))

写入文件并获得奇怪的缩进

2 个答案: