使用BeautifulSoup

时间:2018-12-11 01:37:46

标签: python-2.7 beautifulsoup

这是我第一次使用Python和BeautifulSoup。问题是我正在将博客中的所有文章从一个网站迁移到另一个网站,并且要执行此操作,我要从xml文件中提取某些信息;我的代码的最后一部分告诉我们仅从meta标记中提取位置0到164之间的文本,这样就可以按其希望的方式在Google SERP上显示它们。

这里的问题是博客中的某些文章在其标签的第一行中包含img标签,我想删除它们,包括src属性,以便代码可以仅捕获那些img标签之后的文本。

我尝试了多种解决方法,但没有成功。

这是我的代码:

from bs4 import BeautifulSoup
from urllib2 import urlopen
import csv
import sys
import re

reload(sys)
sys.setdefaultencoding('utf8')

base_url = ("http://pimacleanpro.com/blog?rss=true")
soup = BeautifulSoup(urlopen(base_url).read(),"xml")

titles = soup("title")
slugs = soup("link")
bodies = soup("description")

with open("blog-data.csv", "w") as f:
    fieldnames = ("title", "content", "slug", "seo_title", "seo_description","site_id", "page_path", "category")
    output = csv.writer(f, delimiter=",")
    output.writerow(fieldnames)

    for i in xrange(len(titles)):
        output.writerow([titles[i].encode_contents(),bodies[i].encode_contents(formatter=None),slugs[i].get_text(),titles[i].encode_contents(),bodies[i].encode_contents(formatter=None)[4:164]])

print "Done writing file"

任何帮助将不胜感激。

1 个答案:

答案 0 :(得分:0)

这是我认为可以满足您需求的Python 2.7示例:

from bs4 import BeautifulSoup
from urllib2 import urlopen
from xml.sax.saxutils import unescape

base_url = ("http://pimacleanpro.com/blog?rss=true")

# Unescape to allow BS to parse the <img> tags
soup = BeautifulSoup(unescape(urlopen(base_url).read()))

titles = soup("title")
slugs = soup("link")
bodies = soup("description")

print bodies[2].encode_contents(formatter=None)[4:164]

# Remove all 'img' tags in all the 'description' tags in bodies
for body in bodies:
  for img in body("img"):
    img.decompose()

print bodies[2].encode_contents(formatter=None)[4:164]

# Proceed to writing to CSV, etc.

第一个打印语句输出以下内容:

<img src='"http://ekblog.s3.amazonaws.com/contentp/wp-content/uploads/2018/09/03082910/decoration-design-detail-691710-300x221.jpg"'><br>
<em>Whether you are up

在删除<img>标记之后的第二个标记如下:

<em>Whether you are upgrading just one room or giving your home a complete renovation, it’s likely that your first thought is to choose carpet for all of

当然,您可以在创建titlesslugsbodies(如果您不感兴趣)之前删除汤对象中的所有图像标签:

for tag in soup("img"):
    tag.decompose()