Question

我正试图从网站上搜集一些json数据。我正在使用BeautifulSoup (bs4)，如下面的代码所示

import re
import csv
import json
import urllib2
from bs4 import BeautifulSoup as BS

city = 'Helsinki';

csvFile = csv.writer(open( city + ".csv", "wb+"))
csvFile.writerow(["tempid","latitude", "longitude"])

pageID = 0

locPage = urllib2.urlopen("http://runkeeper.com/user/maxspowers79/route/2481336")
soup = BS(locPage, "lxml").findAll('script',{"src":False})
print soup
pageID += 1
print pageID
for s in soup:
    if 'routePoints' in s.string:
        value = "[{" + s.string.split("}];")[0].split("[{")[1] + "}]"
        #print value
        jsonObj = json.loads(value)
        for x in jsonObj:
            csvFile.writerow([pageID,x["latitude"],x["longitude"]])

作为一个例子，这是我已经测试过的随机城市和随机路线的runkeeper网站。该代码适用于其他类似页面，但适用于更长的路径（如果您在浏览器中查看源代码，则更大的gps json）。

从发出的打印命令中可以看到soup变量被截断。因此，json无效，我无法解析它。

我尝试使用不同的解析器（html5lib），但情况更糟。 soup变量可以容纳多大的字符串是否有限制？

否则为什么会截断？

我该如何处理？

Answer 1

我测试了你的代码，似乎 - 是的 - BeautifulSoup对标签内容有一些限制。

考虑使用愚蠢而直接的字符串操作：

import re
import csv
import json
import urllib2

city = 'Helsinki';

csvFile = csv.writer(open( city + ".csv", "wb+"))
csvFile.writerow(["tempid","latitude", "longitude"])

pageID = 0

locPage = urllib2.urlopen("http://runkeeper.com/user/maxspowers79/route/2481336")
content = locPage.read()

start_at_s, end_at_s = 'var routePoints = ', 'mapController.initialize'

start_at_p = content.index(start_at_s) + len(start_at_s)
end_at_p = content.index(end_at_s)
raw_json = content[start_at_p:end_at_p].strip().strip(';')

jsonObj = json.loads(raw_json)

pageID += 1
print pageID


for x in jsonObj:
    print x
    csvFile.writerow([pageID,x["latitude"],x["longitude"]])

Answer 2

尝试使用lxml重写代码。它应该比beautifulsoup更健壮

Beautifulsoup Parser截断更大的json

2 个答案: