Question

我有以下格式的xml文件：

<date>31,March,2001</date>
<post>



       urlLink The Register reports on "war driving"  - the wireless equivalent of war dialing.  Instead of having your modem dial into thousands of networks until you get in, you just drive within range of a wireless net with your wireless-equipped laptop and hack away.    related:   urlLink The latest issue of CIO  has a great feature on wireless.



</post>

我想提取每个帖子的内容并将其写在输出文本文件的新行上。这是我解析它的代码：

from bs4 import BeautifulSoup as Soup
def parseLog(file):
        with open(file, 'rb') as handler:
            soup = Soup(handler, "html.parser")
            for message in soup.findAll('post'):
                #print(len(str(message).strip()))
                content = message.contents
                if(len(str(content).strip()) > 300):
                    re.sub("[^a-zA-Z0-9]", "", str(content))
                    with open(dest, 'a', encoding="utf-8") as f:
                        f.write(str(message.contents) + "\n")

但是，输出文件现在将每个内容作为列表包含在内。此外，还有不受欢迎的＆＃34; \ r＆＃34;和＆＃34; \ n＆＃34;到处都是字符（我使用re.sub（）来摆脱这些但它没有用）：

[＆＃39; \ r \ n \ r \ n \ r \ n \ r \ n \ n Quotable Mindjack！来自迈克   Sugarbaker对Lemon的urlLink评论：＆＃34;如果你没有耐心等待   辉煌的疯子漫长的反思，柠檬不适合你。但   你读过Mindjack，所以你可能会进入那种事情，   对吗？＆＃34; \ r \ n \ r \ n \ r \ n \ r \ n＆＃39;] [＆＃39; \ r \ n \ r \ n \ r \ n \ n \ r \ n我和＃39;米   不确定我喜欢urlLink FEED转向的方向。＆＃34;该   过滤＆＃34;，一个链接到外部内容的新博客现在更具特色   比FEED的原创内容显着。它还不清楚是什么   过滤器和urlLink Plastic之间存在差异。\ r \ n
  \ r \ n \ r \ n \ r \ n＆＃39;]

如何摆脱这些？

Answer 1

list是get_text()。您应该使用def parseLog(file): with open(file, 'rb') as handler: soup = Soup(handler, "html.parser") for message in soup.findAll('post'): content = message.get_text() #a string! if(len(content.strip()) > 300): re.sub("[^a-zA-Z0-9]", "", str(content)) with open(dest, 'a', encoding="utf-8") as f: f.write(content + "\n")代替：

sudo make install

使用BeautifulSoup将解析的xml文件写入文本文件时如何摆脱列表？

1 个答案: