来自
url = 'https://www.example.com/MOVxxxx/YYYY-MM-DD'
我在页面上使用带有范围的循环(xxx = ID MOV)& (日期)
for page1 in mov_pages:
for page2 in date_pages:
response = get('https://www.example.com + page1 + '/' + page2)
page_html = BeautifulSoup(response.text, 'html.parser')
containers_1 = page_html.find_all('ul', class_='showtime-lists')
containers_2 = page_html.find_all('div', class_='day')
像这样的html结构:
<ul class="showtime-lists">
<li>...</li>
<format="2D" movie="3247" time="12.00"
<li>...</li>
<format="2D" movie="3247" time="13.30"
<li>...</li>
...
和另一个结构(同一页面)
<div class="day">
<a href date="2017-10-15">..</a>
<div class="day">
<a href date="2017-10-15">..</a>
...
我的目的是从列表中创建数据框,格式如下:
date time movie
2017-10-14 12.00 3247
2017-10-14 13.30 3247
2017-10-14 12.00 3252
...
2017-10-15
2017-10-15
... ... ...
问题是:
* the structure given different lengths
我最好的试用期:
* I can create the df with time&movie correctly but not the date (because the date didnt have the same length)
我的代码:
#extract movie & time was similar way
movie = []
for container in containers_1:
idx = container['movie']
movie.append(idx)
#extract date
date_id = []
for each in containers_2:
date_idx = each.a['date']
date_id.append(date_idx)
输出:
movie&time had same length but not with date