解析从url获取的lxml到json

时间:2017-01-03 17:45:45

标签: python json parsing url lxml

我想把披头士乐队所有歌曲的表格分解成JSON格式,用McCartney和Lenon编写的歌曲分类......

运行以下代码时得到的数据是lxml构成行:

from bs4 import BeautifulSoup
import urllib
import requests
import pandas as pd
import json
import collections
from collections import OrderedDict

url = 'https://en.wikipedia.org/wiki/List_of_songs_recorded_by_the_Beatles'
r = requests.get(url)
data = r.text
table_data = [[[cell.text for cell in row("td")],[cell.text for cell in row("th")]] for row in BeautifulSoup(data,"lxml").find_all('table')[4]("tr")]
for row in table_data:
    for i in row:
        if len(i) > 0:
            print(i)

现在,当我尝试使用urllib时,无效。

例如,由于以下错误,此代码未处理:

from bs4 import BeautifulSoup
import urllib
import requests
import pandas as pd
import json
import collections
from collections import OrderedDict

url = 'https://en.wikipedia.org/wiki/List_of_songs_recorded_by_the_Beatles'
response = urllib.request.urlopen(url)
r = json.loads(response)
data = r.text
print (data)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-217-b9bf4e8bed5c> in <module>()
      9 url = 'https://en.wikipedia.org/wiki/List_of_songs_recorded_by_the_Beatles'
     10 response = urllib.request.urlopen(url)
---> 11 r = json.loads(response)
     12 data = r.text
     13 print (data)

C:\Users\Mark\Anaconda3\lib\json\__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    310     if not isinstance(s, str):
    311         raise TypeError('the JSON object must be str, not {!r}'.format(
--> 312                             s.__class__.__name__))
    313     if s.startswith(u'\ufeff'):
    314         raise JSONDecodeError("Unexpected UTF-8 BOM (decode using utf-8-sig)",

TypeError: the JSON object must be str, not 'HTTPResponse'

可能是什么解决方案?我没有在API中找到任何有用的东西,也没有在google / stackoverflow中找到任何帮助。

2 个答案:

答案 0 :(得分:0)

你应该这样试试

  js_response = response.readall().decode('utf-8')
  obj = json.loads(js_response )

答案 1 :(得分:0)

CSV是这个简单表格的正确格式。

import requests, bs4,csv
r = requests.get('https://en.wikipedia.org/wiki/List_of_songs_recorded_by_the_Beatles')
soup = bs4.BeautifulSoup(r.text, 'lxml')

table = soup.find('table', class_="wikitable collapsible sortable")
with open('table.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    for tr in table('tr'):
        row = [t.text.replace('\n', '').strip('"') for t in tr(name=['td','th']) if '♠' not in t.text]
        writer.writerow(row)

出:

Title,Year,Album debut,Songwriter(s),Lead vocal(s),Chart position UK,Chart position US,Notes
12-Bar Original,1965,Anthology 2,"Lennon, McCartney, Harrison and Starkey",,—,—,
Across the Universe,1968,Let It Be,Lennon,Lennon,—,—,
Act Naturally,1965,UK: Help!US: Yesterday and Today,"Russell, Morrison",Starkey,—,"Cover, B-side"
Ain't She Sweet,1961,Anthology 1,"Yellen, Ager",Lennon,—,Cover. A 1969 recording appears on Anthology 3
All I've Got to Do,1963,UK: With the BeatlesUS: Meet The Beatles!,Lennon,Lennon,—,—,
All My Loving,1963,UK: With the BeatlesUS: Meet The Beatles!,McCartney,McCartney,—,
All Things Must Pass,1969,Anthology 3,Harrison,Harrison,—,—,