我正在尝试根据已填充的artist_title列更新album_title列。
我可以使用循环中的最后一个album_title重新整理album_title列: 用于专辑中的标记:
for album in tag:
cur.execute('INSERT OR IGNORE INTO Albums (album_title) VALUES (?)', (album, ))
for artist in artists:
artist = artist.string
cur.execute('INSERT OR IGNORE INTO Artists(artist_name) VALUES (?)', (artist, ))
cur.execute('UPDATE Artists SET album_title=? WHERE artist_name=?', (album, artist))
或者我只能使用正确的album_title更新最后一行。
for tag in albums:
for album in tag:
cur.execute('INSERT OR IGNORE INTO Albums (album_title) VALUES (?)', (album, ))
for artist in artists:
artist = artist.string
cur.execute('INSERT OR IGNORE INTO Artists(artist_name) VALUES (?)', (artist, ))
cur.execute('UPDATE Artists SET album_title=? WHERE artist_name=?', (album, artist))
我理解为什么会出现这些问题,但我无法弄清楚如何实现我想要的 - 每一行都更新了正确的专辑名称。 album_title名称将始终与artist_name处于相同的顺序。
我已经看到更新列在这里被广泛讨论,但由于我自己纠结的独特for循环,我无法解决这个问题。 如果我的问题是因为我的数据检索结构很差,我会很高兴听到如何修复它。
整个代码:
from urllib.request import Request, urlopen
from urllib.parse import urlparse
from urllib.parse import urljoin
from bs4 import BeautifulSoup
import urllib.error
import sqlite3
import json
import time
import ssl
#connect/create database
conn = sqlite3.connect('pitchscraper.sqlite')
#create way to talk to database
cur = conn.cursor()
#create table
cur.execute('''
CREATE TABLE IF NOT EXISTS Master (id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE, album_title TEXT UNIQUE, artist_name TEXT UNIQUE)''')
cur.execute('''
CREATE TABLE IF NOT EXISTS Albums (id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE, album_title TEXT UNIQUE)''')
cur.execute('''
CREATE TABLE IF NOT EXISTS Artists (id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE, artist_name TEXT UNIQUE, album_title TEXT, FOREIGN KEY(album_title) REFERENCES Albums(album_title))''')
#open and read page
req = Request('http://pitchfork.com/reviews/albums/?page=1', headers={'User-Agent': 'Mozilla/5.0'})
pitchpage = urlopen(req).read()
#parse with beautiful soup
soup = BeautifulSoup(pitchpage, "lxml")
albums = soup('h2')
artists = soup.find_all(attrs={"class" : "artist-list"})
for tag in albums:
for album in tag:
cur.execute('INSERT OR IGNORE INTO Albums (album_title) VALUES (?)', (album, ))
for artist in artists:
artist = artist.string
cur.execute('INSERT OR IGNORE INTO Artists(artist_name) VALUES (?)', (artist, ))
cur.execute('UPDATE Artists SET album_title=? WHERE artist_name=?', (album, artist))
print()
conn.commit()
输出失败:
+------+-------------------------------------------+-------------+
| id | artist_name | album_title |
+------+-------------------------------------------+-------------+
| "1" | "Sylvan Esso" | "Odd Hours" |
| "2" | "Mew" | "Odd Hours" |
| "3" | "Tara Jane O’Neil" | "Odd Hours" |
| "4" | "Real Life Buildings" | "Odd Hours" |
| "5" | "Bruce Springsteen and the E Street Band" | "Odd Hours" |
| "6" | "Ravyn Lenae" | "Odd Hours" |
| "7" | "Tee Grizzley" | "Odd Hours" |
| "8" | "Shugo Tokumaru" | "Odd Hours" |
| "9" | "Woods" | "Odd Hours" |
| "10" | "Formation" | "Odd Hours" |
| "11" | "Valgeir Sigurðsson" | "Odd Hours" |
| "12" | "Caddywhompus" | "Odd Hours" |
+------+-------------------------------------------+-------------+
期望的输出:
+------+-------------------------------------------+-------------------------------+
| id | artist_name | album_title |
+------+-------------------------------------------+-------------------------------+
| "1" | "Sylvan Esso" | "What Now" |
| "2" | "Mew" | "Visuals" |
| "3" | "Tara Jane O’Neil" | "Tara Jane O'Neil" |
| "4" | "Real Life Buildings" | "Significant Weather" |
| "5" | "Bruce Springsteen and the E Street Band" | "Hammersmirth Odeon, London" |
| "6" | "Ravyn Lenae" | "Midnight Moonlight EP" |
| "7" | "Tee Grizzley" | "My Moment" |
| "8" | "Shugo Tokumaru" | "TOSS" |
| "9" | "Woods" | "Love is Love" |
| "10" | "Formation" | "Look at the Powerful People" |
| "11" | "Valgeir Sigurðsson" | "Dissonance" |
| "12" | "Caddywhompus" | "Odd Hours" |
+------+-------------------------------------------+-------------------------------+
答案 0 :(得分:0)
albums = soup('h2')
artists = soup.find_all(attrs={"class" : "artist-list"})
问题是artists
列表包含所有艺术家。
您必须从每张专辑中提取循环内的艺术家列表。