我已经成功地使用BeautifulSoup遍历了bandsintown网页的数百页,在这里查看:https://www.bandsintown.com/?came_from=257&page=102
我能够遍历每个页面来创建所有事件日期的数组,称为“ uniqueDatesBucket”。 打印数组可以看到以下内容(有很多结果,下面提供了一个示例)。
print uniqueDatesBucket
结果:
[[<div class="event-b58f7990"><div class="event-ad736269">JAN</div><div class="event-d7a00339">08</div></div>, <div class="event-b58f7990"><div class="event-ad736269">JAN</div><div class="event-d7a00339">08</div></div>, ............................<div class="event-b58f7990"><div class="event-ad736269">JAN</div><div class="event-d7a00339">31</div></div>]]
这是预期的。然后,我想将Month和Day放置在单独的数组中,以便开始构建日期数据库。这是代码:
#Build empty array for month/date
uniqueMonth = []
uniqueDay = []
for i in uniqueDatesBucket[0]:
uniqueMonthDay = i.find_all('div')
uniqueMonth.append(uniqueMonthDay[0].text)
uniqueDay.append(uniqueMonthDay[1].text)
print uniqueDay
结果是:
[u'08', u'08', u'08', u'08', u'08', u'08', u'08', u'08', u'08', u'09', u'09', u'09', u'09', u'09', u'09', u'09', u'09', u'09']
我的问题是,为什么这只返回18个结果(在bandsintown页面的目标页面上有18个事件,但是我认为我使用前面描述的页面迭代器解决了这个问题)?在uniqueDatesBucket元素中显示了明显超过18个结果,后者是uniqueMonth数组的父级。
此外,结果中每个日期之前的“ u”是什么?
答案 0 :(得分:0)
我已尽力复制您的代码,但距离还很远。您提供的链接没有给我相同的输出,因此我无法尝试并完美地复制它。
使用您提供的列表,我自己运行该列表时没有遇到任何问题:
x = '<div class="event-b58f7990"><div class="event-ad736269">JAN</div><div class="event-d7a00339">08</div></div>, <div class="event-b58f7990"><div class="event-ad736269">JAN</div><div class="event-d7a00339">08</div></div>, <div class="event-b58f7990"><div class="event-ad736269">JAN</div><div class="event-d7a00339">31</div></div>'.split(', ')
x
这给了我以下内容:
['<div class="event-b58f7990"><div class="event-ad736269">JAN</div><div class="event-d7a00339">08</div></div>',
'<div class="event-b58f7990"><div class="event-ad736269">JAN</div><div class="event-d7a00339">08</div></div>',
'<div class="event-b58f7990"><div class="event-ad736269">JAN</div><div class="event-d7a00339">31</div></div>']
这是我要复制的内容:
uniqueDatesBucket = []
uniqueMonth = []
uniqueDay = []
for item in x:
uniqueDatesBucket.append(BeautifulSoup(item, 'html.parser'))
for i in uniqueDatesBucket:
uniqueMonthDay = i.find_all('div')
print('Day:\t' + uniqueMonthDay[2].text + '\tMonth:\t', uniqueMonthDay[1].text)
这是我的输出:
Day: 08 Month: JAN
Day: 08 Month: JAN
Day: 31 Month: JAN
请注意,索引与您用来获取相同内容的索引不同,因此会造成混淆。
但是,如果您从提供的网站上进行抓取,则所有内容都嵌入在JavaScript部分中,这使解析和获取正确值变得更加容易。这是我的代码,可从脚本中嵌入的JSON窃取它:
import requests
from bs4 import BeautifulSoup
import json
import re # regular expression, I just use it to extract the JSON from the JavaScript
x = requests.get('https://www.bandsintown.com/?came_from=257&page=102')
soup = BeautifulSoup(x.content, 'html.parser')
json_text = soup.find_all('script')[2].text # Gives you a JSON set to the valirable window.__data
json_extracted = re.search(r'^window.__data=(.+)', json_text).group(1) # Collect the JSON without variable assigning
json_parsed = json.loads(json_extracted)
# The dates are being hidden in json.homeView.body.popularEvents.events
for item in json_parsed['homeView']['body']['popularEvents']['events']:
print(item['artistName'])
print('Playing on', item['dayOfWeek'], item['dayOfMonth'], item['month'], '\n')
以下是输出:
Florence and The Machine
Playing on FRI 18 JAN
Maroon 5
Playing on FRI 22 FEB
Shawn Mendes
Playing on TUE 29 OCT
John Mayer
Playing on WED 27 MAR
Amy Shark
Playing on SAT 11 MAY
Post Malone
Playing on TUE 30 APR
John Butler Trio
Playing on THU 07 FEB
Florence and The Machine
Playing on SAT 19 JAN
Ocean Alley
Playing on THU 14 MAR
Bring Me the Horizon
Playing on SAT 13 APR
对于u'xyz'
字符串,这是因为BeautifulSoup可以将字符串输出为Unicode(这就是u
所代表的意思)。您可以通过u'xyz'.decode('utf-8')
来解决此问题。
答案 1 :(得分:0)
据我了解,您的问题不是解析html,而是处理数据或列表。
来自您的代码:
for i in uniqueDatesBucket[0]:
似乎您只循环第一个索引,是否意味着要循环所有?
for udb in uniqueDatesBucket:
for i in udb:
uniqueMonthDay = i.find_all('div')
uniqueMonth.append(uniqueMonthDay[0].text)
uniqueDay.append(uniqueMonthDay[1].text)