如何从.txt文件BeautifulSoup获取网址? 我是网络废料的新手。我想制作多页剪贴簿留言,并且需要从txt文件中提取这些页面。
import pandas as pd
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
chrome_driver_path = r'C:\chromedriver_win32\chromedriver.exe'
driver = webdriver.Chrome(executable_path=chrome_driver_path)
urls = r'C:\chromedriver_win32\asin.txt'
url = ('https://www.amazon.com/dp/'+urls)
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'lxml')
stock = soup.find(id='availability').get_text()
stok_kontrol = pd.DataFrame( { 'Url': [url], 'Stok Durumu': [stock] })
stok_kontrol.to_csv('stok-kontrol.csv', encoding='utf-8-sig')
print(stok_kontrol)
此记事本具有亚马逊asin数字。
C:\chromedriver_win32\asin.txt
文件位于:
B00004SU18
B07L9178GQ
B01M35N6CZ
答案 0 :(得分:0)
如果我正确地理解了这个问题,您只需要获取ASIN编号即可传递给url来告诉BeautifulSoup抓取的内容,这只是一个简单的文件操作,然后循环遍历该文件以获取编号并传递每个一个给BeautifulSoup抓
urls = r'C:\chromedriver_win32\asin.txt'
with open(urls, 'r') as f:
for line in f:
url = ('https://www.amazon.com/dp/'+line)
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'lxml')
stock = soup.find(id='availability').get_text()
stok_kontrol = pd.DataFrame( { 'Url': [url], 'Stok Durumu': [stock] } )
stok_kontrol.to_csv('stok-kontrol.csv', encoding='utf-8-sig')
print(stok_kontrol)
答案 1 :(得分:0)
这将获取产品网址以及产品是否有库存。
将该信息打印到控制台,然后
将其保存到文件'stok-kontrol.csv'
经过以下测试:Python 3.7.4
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
import re
chrome_driver_path = r'C:\chromedriver_win32\chromedriver.exe'
driver = webdriver.Chrome(executable_path=chrome_driver_path)
# Gets whether the products in the array, are in stock, from www.amazon.com
# Returns an Array of Dictionaries, with keys ['asin','instock','url']
def IsProductsInStock(array_of_ASINs):
results = []
for asin in array_of_ASINs:
url = 'https://www.amazon.com/dp/'+str(asin)
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'lxml')
stock = soup.find(id='availability').get_text().strip()
isInStock = False
if('In Stock' in stock):
# If 'In Stock' is the text of 'availability' element
isInStock=True
else:
# If Not, extract the number from it, if any, and see if it's in stock.
tmp = re.search(re.compile('[0-9]+'), stock)
if( tmp is not None and int(tmp[0]) > 0):
isInStock = True
results.append({"asin": asin, "instock": isInStock, "url": url})
return results
# Saves the product information to 'toFile'
# Returns a pandas.core.frame.DataFrame object, with the product info ['url', 'instock'] as columns
# inStockDict MUST be either a Dictionary, or a 'list' of Dictionaries with, ['asin','instock','url'] keys
def SaveProductInStockInformation(inStockDict, toFile):
if(isinstance(inStockDict, dict)):
stok_kontrol = pd.DataFrame( { 'Url': [inStockDict['url']], 'Stok Durumu': [inStockDict['instock']] } )
elif(isinstance(inStockDict, list)):
stocksSimple = []
for stock in inStockDict:
stocksSimple.append([stock['url'], stock['instock']])
stok_kontrol = pd.DataFrame(stocksSimple, columns=['Url', 'Stok Durumu'])
else:
raise Exception("inStockDict parm, Must be Either a dictionary, or a 'list' of dictionaries with, ['asin','instock','url'] keys!")
stok_kontrol.to_csv(toFile, encoding='utf-8-sig')
return stok_kontrol
# Get ASINs From File
f = open(r'C:\chromedriver_win32\asin.txt','r')
urls = f.read().split()
f.close()
# Get a list of Dictionaries containing all the products information
stocks = IsProductsInStock(urls)
# Save and Print the ['url', 'instock'] information
print( SaveProductInStockInformation(stocks, 'stok-kontrol.csv') )
# Remove if you need to use the driver later on in the program
driver.close()
结果:(文件'stok-kontrol.csv')
,Url,Stok Durumu
0,https://www.amazon.com/dp/B00004SU18,True
1,https://www.amazon.com/dp/B07L9178GQ,True
2,https://www.amazon.com/dp/B01M35N6CZ,True