我有以下格式的xml文件:
<date>31,March,2001</date>
<post>
urlLink The Register reports on "war driving" - the wireless equivalent of war dialing. Instead of having your modem dial into thousands of networks until you get in, you just drive within range of a wireless net with your wireless-equipped laptop and hack away. related: urlLink The latest issue of CIO has a great feature on wireless.
</post>
我想提取每个帖子的内容并将其写在输出文本文件的新行上。这是我解析它的代码:
from bs4 import BeautifulSoup as Soup
def parseLog(file):
with open(file, 'rb') as handler:
soup = Soup(handler, "html.parser")
for message in soup.findAll('post'):
#print(len(str(message).strip()))
content = message.contents
if(len(str(content).strip()) > 300):
re.sub("[^a-zA-Z0-9]", "", str(content))
with open(dest, 'a', encoding="utf-8") as f:
f.write(str(message.contents) + "\n")
但是,输出文件现在将每个内容作为列表包含在内。此外,还有不受欢迎的&#34; \ r&#34;和&#34; \ n&#34;到处都是字符(我使用re.sub()来摆脱这些但它没有用):
[&#39; \ r \ n \ r \ n \ r \ n \ r \ n \ n Quotable Mindjack!来自迈克 Sugarbaker对Lemon的urlLink评论:&#34;如果你没有耐心等待 辉煌的疯子漫长的反思,柠檬不适合你。但 你读过Mindjack,所以你可能会进入那种事情, 对吗?&#34; \ r \ n \ r \ n \ r \ n \ r \ n&#39;] [&#39; \ r \ n \ r \ n \ r \ n \ n \ r \ n我和#39;米 不确定我喜欢urlLink FEED转向的方向。 &#34;该 过滤&#34;,一个链接到外部内容的新博客现在更具特色 比FEED的原创内容显着。它还不清楚是什么 过滤器和urlLink Plastic之间存在差异。\ r \ n
\ r \ n \ r \ n \ r \ n&#39;]
如何摆脱这些?
答案 0 :(得分:0)
list
是get_text()
。您应该使用def parseLog(file):
with open(file, 'rb') as handler:
soup = Soup(handler, "html.parser")
for message in soup.findAll('post'):
content = message.get_text() #a string!
if(len(content.strip()) > 300):
re.sub("[^a-zA-Z0-9]", "", str(content))
with open(dest, 'a', encoding="utf-8") as f:
f.write(content + "\n")
代替:
sudo make install