我有一个包含html文件的文件夹,我正在尝试抓取所有指向不同页面的网址,并将这些网址保存为CSV文件。
我在这里读过Stackoverflow,并尝试修改我之前使用的代码,但没有成功。 Python正在浏览文件,但它无法获取我需要的数据。
我一个月前写了我的第一个Python代码,所以我还是一个新手,我希望那里有人可以提供帮助!
我一直在使用的代码:
from bs4 import BeautifulSoup
import csv
import urllib2
import os
def processData( pageFile ):
f = open(pageFile, "r")
page = f.read()
f.close()
soup = BeautifulSoup(page)
urldata = soup.findAll('a', {'href': True})
urls = []
for html in urldata:
html = soup('<body><a href="123">qwe</a><a href="456">asd</a></body>')
csvfile = open('url.csv', 'ab')
writer = csv.writer(csvfile)
for url in zip(urls):
writer.writerow([url])
csvfile.close()
dir = "myurlfiles"
csvFile = "url.csv"
csvfile = open(csvFile, 'wb')
writer = csv.writer(csvfile)
writer.writerow(["URLS"])
csvfile.close()
fileList = os.listdir(dir)
totalLen = len(fileList)
count = 1
for htmlFile in fileList:
path = os.path.join(dir, htmlFile) # get the file path
processData(path) # process the data in the file
print "Processed '" + path + "'(" + str(count) + "/" + str(totalLen) + ")..."
count = count + 1
网址按以下方式存储在html代码中:
<div class="item" style="overflow: hidden;">
<div class="item_image" style="width: 180px; height: 125px;" id="image_255"><a href="https://silkroad6ownowfk.onion.to/items/200mg-high-quality-dmt" style="display: block; width: 180px; height: 125px;"></a></div>
<div class="item_body">
<div class="item_title"><a href="https://silkroad6ownowfk.onion.to/items/200mg-high-quality-dmt">200mg High Quality DMT</a></div>
<div class="item_details">
vendor: <a href="https://silkroad6ownowfk.onion.to/users/ringo-deathstarr">ringo deathstarr</a><br>
ships from: United States<br>
ships to: Worldwide
</div>
</div>
<div class="item_price">
<div class="price_big">฿0.031052</div>
<a href="https://silkroad6ownowfk.onion.to/items/200mg-high-quality-dmt#shipping">add to cart</a>
</div>
答案 0 :(得分:2)
您可以使用glob按html
掩码查找目录中的所有*.html
个文件,通过BeautifulSoup
的{{1}}找到所有链接,将它们写入文件(看起来你根本不需要find_all()
模块):
csv
请注意,在将文件传递给BeautifulSoup构造函数之前,您不需要读取文件 - 它也支持类似文件的对象。此外,请遵循最佳做法,并在处理文件时使用import glob
from bs4 import BeautifulSoup
path = 'myurlfiles/*.html'
urls = []
for file_name in glob.iglob(path):
with open(file_name) as f:
soup = BeautifulSoup(f)
urls += [link['href'] for link in soup.find_all('a', {'href': True})]
with open("url.csv", "wb") as f:
f.write("\n".join(urls))
上下文管理器。
希望有所帮助。