Beautifulsoup循环麻烦

时间:2018-04-18 12:38:14

标签: python beautifulsoup

我有两个循环我喜欢整合到我的脚本但不知道如何我还在学习这一切。

循环1)书籍作者的外部列表;从代码中删除它。随着时间的推移,列表会变得更长,因此将其移动到外部文件会很有帮助。如何引用列表并在代码中添加变量以在搜索中运行每个作者?

循环2)我必须在此阶段为每位作者重复我的代码。如何为下一位作者完成搜索后,如何编写一次并重复一遍。

*此处的最终目标是以HTML格式导出搜索,以便轻松阅读。

谢谢!

from bs4 import BeautifulSoup
import urllib.request
import time

#Loop 1
var1 = 'Stephen%20King'
var2 = 'J.%20K.%20Rowling'
var3 = 'James%20Patterson'
var4 = 'John%20Grisham'

timestr = time.strftime("%m-%d-%Y")

#Loop 2
file = open('/var/script/exp/exp_' + timestr + '.html', 'a+')

with open('/var/script/exp/exp_' + timestr + '.html', 'a') as file_1, open('/var/script/src/header.html', 'r') as file_2:
    for line in file_2:
         file_1.write(line)

with open('/var/script/exp/exp_' + timestr + '.html', 'a') as file_1, open('/var/script/src/subheader.html', 'r') as file_2:
    for line in file_2:
         file_1.write(line)

for i in range(5):
    url = 'https://www.example.com/Listings?st=' + var1 + '&sg=&c=&s=&lp=0&hp=999999&p={}'.format(i)
    source = urllib.request.urlopen(url)
    soup = BeautifulSoup(source, 'html.parser')

    for products in soup.find_all('li', class_='widget'):
        image = products.find('img', class_='lazy-load')
        itemurl = products.find('a', class_='product')
        title = products.find('div', class_='title').text
        countdown = products.find(class_='product-countdown')
        price = products.find(class_='product-price').find(class_="price").text
        file = open('/var/script/exp/exp_' + timestr + '.html', 'a+')
        file.write('<div class="col-md-15 col-xs-3">')
        file.write('<div class="card mb-4 box-shadow">')
        file.write('<img class="card-img-top" src="')
        file.write(image.get('data-src'))
        file.write('" alt="Card image cap" height="200px">')
        file.write('<div class="card-body">')
        file.write('<div><p class="card-text"><a href="https://www.example.com' + itemurl.get('href')+'" target="_blank">' +  title + '</a>' + price +  '</p>')
        file.write('<div class="d-flex justify-content-between align-items-center">')
        file.write('<div class="btn-group">')
        file.write('<button type="button" class="btn btn-sm btn-outline-secondary"><a href="https://www.example.com' + itemurl.get('href')+'" target="_blank">View</a></button>')
        file.write('</div><small class="text-muted">')
        file.write(countdown.get('data-countdown'))
        file.write('</small></div></div></div></div></div>')
        print

    file.close()

print(var1)

#Repeated Code
file = open('/var/script/exp/exp_' + timestr + '.html', 'a+')

with open('/var/script/exp/exp_' + timestr + '.html', 'a') as file_1, open('/var/script/src/subheader.html', 'r') as file_2:
    for line in file_2:
         file_1.write(line)


for i in range(5):

    url = 'https://www.example.com/Listings?st=' + var2 + '&sg=&c=&s=&lp=0&hp=999999&p={}'.format(i)
    source = urllib.request.urlopen(url)
    soup = BeautifulSoup(source, 'html.parser')

    for products in soup.find_all('li', class_='widget'):
        image = products.find('img', class_='lazy-load')
        itemurl = products.find('a', class_='product')
        title = products.find('div', class_='title').text
        countdown = products.find(class_='product-countdown')
        price = products.find(class_='product-price').find(class_="price").text
        #print(image.get('data-src'))
        #file.write('<img src="', + image.get('data-src'), + '">')
        file = open('/var/script/exp/exp_' + timestr + '.html', 'a+')
        file.write('<div class="col-md-15 col-xs-3">')
        file.write('<div class="card mb-4 box-shadow">')
        file.write('<img class="card-img-top" src="')
        file.write(image.get('data-src'))
        file.write('" alt="Card image cap" height="200px">')
        file.write('<div class="card-body">')
        file.write('<div><p class="card-text"><a href="https://www.example.com' + itemurl.get('href')+'" target="_blank">' +  title + '</a>' + price +  '</p>')
        file.write('<div class="d-flex justify-content-between align-items-center">')
        file.write('<div class="btn-group">')
        file.write('<button type="button" class="btn btn-sm btn-outline-secondary"><a href="https://www.example.com' + itemurl.get('href')+'" target="_blank">View</a></button>')
        file.write('</div><small class="text-muted">')
        file.write(countdown.get('data-countdown'))
        file.write('</small></div></div></div></div></div>')
        print

    file.close()

print(var2)

非常感谢DoubleDouble的响应 所以这里是更新的代码,我可以使用作者列表,男人我很惊讶它有这个!! LOL

from bs4 import BeautifulSoup
import urllib.request
import time

for i in range(5): #searches through pages
lines = open('C:\\Users\\ataylor_dev\\Documents\\VSCODE\\Python\\BeautifulSoup\\Training\\authors.txt').read().splitlines()
for author in lines:   
    url = 'https://www.example.com/Listings?st=' + author + '&sg=&p={}'.format(i) #adds authors and pages to 
    print(url)

#how to repeat code with next author


Output:

https://www.example.com/Listings?st=Stephen%20King&sg=&c=&s=&lp=0&hp=999999&p=0
https://www.example.com/Listings?st=J.%20K.%20Rowling&sg=&c=&s=&lp=0&hp=999999&p=0
https://www.example.com/Listings?st=James%20Patterson&sg=&c=&s=&lp=0&hp=999999&p=0
https://www.example.com/Listings?st=John%20Grisham&sg=&c=&s=&lp=0&hp=999999&p=0
John%20Grisham
https://www.example.com/Listings?st=Stephen%20King&sg=&c=&s=&lp=0&hp=999999&p=1
https://www.example.com/Listings?st=J.%20K.%20Rowling&sg=&c=&s=&lp=0&hp=999999&p=1
https://www.example.com/Listings?st=James%20Patterson&sg=&c=&s=&lp=0&hp=999999&p=1
https://www.example.com/Listings?st=John%20Grisham&sg=&c=&s=&lp=0&hp=999999&p=1
John%20Grisham
https://www.example.com/Listings?st=Stephen%20King&sg=&c=&s=&lp=0&hp=999999&p=2
https://www.example.com/Listings?st=J.%20K.%20Rowling&sg=&c=&s=&lp=0&hp=999999&p=2
https://www.example.com/Listings?st=James%20Patterson&sg=&c=&s=&lp=0&hp=999999&p=2
https://www.example.com/Listings?st=John%20Grisham&sg=&c=&s=&lp=0&hp=999999&p=2
John%20Grisham
https://www.example.com/Listings?st=Stephen%20King&sg=&c=&s=&lp=0&hp=999999&p=3
https://www.example.com/Listings?st=J.%20K.%20Rowling&sg=&c=&s=&lp=0&hp=999999&p=3
https://www.example.com/Listings?st=James%20Patterson&sg=&c=&s=&lp=0&hp=999999&p=3
https://www.example.com/Listings?st=John%20Grisham&sg=&c=&s=&lp=0&hp=999999&p=3
John%20Grisham
https://www.example.com/Listings?st=Stephen%20King&sg=&c=&s=&lp=0&hp=999999&p=4
https://www.example.com/Listings?st=J.%20K.%20Rowling&sg=&c=&s=&lp=0&hp=999999&p=4
https://www.example.com/Listings?st=James%20Patterson&sg=&c=&s=&lp=0&hp=999999&p=4
https://www.example.com/Listings?st=John%20Grisham&sg=&c=&s=&lp=0&hp=999999&p=4
John%20Grisham

我现在如何以正确的顺序重复代码? Stephen%20King第1 - 5页,然后是下一位作者..第1 - 5页。

我觉得自己越来越近了,再次感谢!

1 个答案:

答案 0 :(得分:0)

如果我理解正确,在循环1中你想从一些外部文件中读取作者姓名。请参阅以下问题(和摘要)

In Python, how do I read a file line-by-line into a list?

with open('filename') as f:
    lines = f.readlines()

然后,不是为每个变量复制粘贴代码,而是想通过您读入的列表循环

This Link解释得比我更好,更彻底,但这是一个小片段。

>>> li = ['a', 'b', 'c', 'd', 'e']
>>> for i in range(len(li)):
...     print li[i]

为了帮助您了解这对您有用的方法,请比较以下内容:

var1 = 'Stephen%20King'
var2 = 'J.%20K.%20Rowling'
var3 = 'James%20Patterson'
var4 = 'John%20Grisham'

print(var1, var2, var3, var4)

#VS

vars = []

vars.append('Stephen%20King')
vars.append('J.%20K.%20Rowling')
vars.append('James%20Patterson')
vars.append('John%20Grisham')

print(vars)
print(vars[0], vars[1], vars[2], vars[3])

for author in vars:
    print(author)