Question

我正在尝试使用urllib和美丽的汤在python中编写一个刮刀。我有一个用于新闻故事的网页csv，以及刮刀工作的约80％的页面，但是当故事顶部有一张图片时，脚本不再拉动时间或正文。我很困惑，因为汤.find和soup.find_all似乎不会产生不同的结果。我尝试了各种不同的标签，它们应该捕获文本以及“lxml”。和＆＃39; html.parser。＆＃39;

以下是代码：

testcount = 0
titles1 = []
bodies1 = []
times1 = []

data = pd.read_csv('URLsALLjun27.csv', header=None)
for url in data[0]:
try:
    html = urllib.request.urlopen(url).read()
    soup = BeautifulSoup(html, "lxml")

    titlemess = soup.find(id="title").get_text() #getting the title
    titlestring = str(titlemess) #make it a string
    title = titlestring.replace("\n", "").replace("\r","")
    titles1.append(title)

    bodymess = soup.find(class_="article").get_text() #get the body with markup
    bodystring = str(bodymess) #make body a string
    body = bodystring.replace("\n", "").replace("\u3000","") #scrub markup
    bodies1.append(body) #add to list for export

    timemess = soup.find('span',{"class":"time"}).get_text()
    timestring = str(timemess)
    time = timestring.replace("\n", "").replace("\r","").replace("年", "-").replace("月","-").replace("日", "")
    times1.append(time)

    testcount = testcount +1 #counter
    print(testcount)
except Exception as e:
    print(testcount, e)

以下是我得到的一些结果（标记为＆＃39; nonetype＆＃39;是标题成功拉出但身体/时间为空的那些）

1 http://news.xinhuanet.com/politics/2016-06/27/c_1119122255.htm

2 http://news.xinhuanet.com/politics/2016-05/22/c_129004569.htm＆＃39; NoneType＆＃39;对象没有属性＆＃39; get_text＆＃39;

任何帮助将不胜感激！感谢。

编辑：我没有10点声望点＆＃39;因此我无法发布更多测试链接，但如果您需要更多网页示例，我们会对其进行评论。

Answer 1

问题是网站上没有class="article"，其中包含图片，与"class":"time"相同。因此，您似乎必须检测网站上是否有图片，然后如果有图片，请按以下方式搜索日期和文字：

关于日期，请尝试：

timemess = soup.find(id="pubtime").get_text()

对于正文，似乎文章只是图片的标题。因此，您可以尝试以下方法：

bodymess = soup.find('img').findNext().get_text()

简而言之，soup.find('img')找到图像，findNext()转到下一个块，巧合的是，它包含文本。

因此，在您的代码中，我会做如下的事情：

try:
    bodymess = soup.find(class_="article").get_text()

except AttributeError:
    bodymess = soup.find('img').findNext().get_text()

try:
    timemess = soup.find('span',{"class":"time"}).get_text()

except AttributeError:
    timemess = soup.find(id="pubtime").get_text()

作为网络抓取的一般流程，我通常使用浏览器访问网站本身，并首先在浏览器中找到网站后端的元素。

Beautifulsoup无法找到文字

1 个答案: