使用BeautifulSoup将解析的xml文件写入文本文件时如何摆脱列表?

时间:2017-06-24 20:49:48

标签: python xml python-3.x xml-parsing beautifulsoup

我有以下格式的xml文件:

<date>31,March,2001</date>
<post>



       urlLink The Register reports on "war driving"  - the wireless equivalent of war dialing.  Instead of having your modem dial into thousands of networks until you get in, you just drive within range of a wireless net with your wireless-equipped laptop and hack away.    related:   urlLink The latest issue of CIO  has a great feature on wireless.



</post>

我想提取每个帖子的内容并将其写在输出文本文件的新行上。这是我解析它的代码:

from bs4 import BeautifulSoup as Soup
def parseLog(file):
        with open(file, 'rb') as handler:
            soup = Soup(handler, "html.parser")
            for message in soup.findAll('post'):
                #print(len(str(message).strip()))
                content = message.contents
                if(len(str(content).strip()) > 300):
                    re.sub("[^a-zA-Z0-9]", "", str(content))
                    with open(dest, 'a', encoding="utf-8") as f:
                        f.write(str(message.contents) + "\n")

但是,输出文件现在将每个内容作为列表包含在内。此外,还有不受欢迎的&#34; \ r&#34;和&#34; \ n&#34;到处都是字符(我使用re.sub()来摆脱这些但它没有用):

  

[&#39; \ r \ n \ r \ n \ r \ n \ r \ n \ n Quotable Mindjack!来自迈克   Sugarbaker对Lemon的urlLink评论:&#34;如果你没有耐心等待   辉煌的疯子漫长的反思,柠檬不适合你。但   你读过Mindjack,所以你可能会进入那种事情,   对吗?&#34; \ r \ n \ r \ n \ r \ n \ r \ n&#39;] [&#39; \ r \ n \ r \ n \ r \ n \ n \ r \ n我和#39;米   不确定我喜欢urlLink FEED转向的方向。 &#34;该   过滤&#34;,一个链接到外部内容的新博客现在更具特色   比FEED的原创内容显着。它还不清楚是什么   过滤器和urlLink Plastic之间存在差异。\ r \ n
  \ r \ n \ r \ n \ r \ n&#39;]

如何摆脱这些?

1 个答案:

答案 0 :(得分:0)

listget_text()。您应该使用def parseLog(file): with open(file, 'rb') as handler: soup = Soup(handler, "html.parser") for message in soup.findAll('post'): content = message.get_text() #a string! if(len(content.strip()) > 300): re.sub("[^a-zA-Z0-9]", "", str(content)) with open(dest, 'a', encoding="utf-8") as f: f.write(content + "\n") 代替:

sudo make install