我在目录中有大约500个html文件,我想从中提取数据并将结果保存为CSV格式。
我正在使用的代码没有收到任何错误消息,并且似乎正在扫描所有文件,但生成的CSV除了顶行之外是空的。
我对python很新,我显然做错了什么。我希望有人可以提供帮助!
from bs4 import BeautifulSoup
import csv
import urllib2
import os
def processData( pageFile ):
f = open(pageFile, "r")
page = f.read()
f.close()
soup = BeautifulSoup(page)
metaData = soup.find_all('div class="item_details"')
priceData = soup.find_all('div class="price_big"')
# define where we will store info
vendors = []
shipsfroms = []
shipstos = []
prices = []
for html in metaData:
text = BeautifulSoup(str(html).strip()).get_text().encode("utf-8").replace("\n", "")
vendors.append(text.split("vendor:")[1].split("ships from:")[0].strip())
shipsfroms.append(text.split("ships from:")[1].split("ships to:")[0].strip())
shipstos.append(text.split("ships to:")[1].strip())
for price in priceData:
prices.append(BeautifulSoup(str(price)).get_text().encode("utf-8").strip())
csvfile = open('drugs.csv', 'ab')
writer = csv.writer(csvfile)
for shipsto, shipsfrom, vendor, price in zip(shipstos, shipsfroms, vendors, prices):
writer.writerow([shipsto, shipsfrom, vendor, price])
csvfile.close()
dir = "drugs"
csvFile = "drugs.csv"
csvfile = open(csvFile, 'wb')
writer = csv.writer(csvfile)
writer.writerow(["Vendors", "ShipsTo", "ShipsFrom", "Prices"])
csvfile.close()
fileList = os.listdir(dir)
totalLen = len(fileList)
count = 1
for htmlFile in fileList:
path = os.path.join(dir, htmlFile)
processData(path)
print "Processed '" + path + "'(" + str(count) + "/" + str(totalLen) + ")..."
count = count + 1
我怀疑我告诉BS要查看html代码的错误部分?但我看不出应该是什么。这里是html代码的摘录,其中包含我需要的信息:
</div>
<div class="item" style="overflow: hidden;">
<div class="item_image" style="width: 180px; height: 125px;" id="image_255"><a href="https://silkroad6ownowfk.onion.to/items/200mg-high-quality-dmt" style="display: block; width: 180px; height: 125px;"></a></div>
<div class="item_body">
<div class="item_title"><a href="https://silkroad6ownowfk.onion.to/items/200mg-high-quality-dmt">200mg High Quality DMT</a></div>
<div class="item_details">
vendor: <a href="https://silkroad6ownowfk.onion.to/users/ringo-deathstarr">ringo deathstarr</a><br>
ships from: United States<br>
ships to: Worldwide
</div>
</div>
<div class="item_price">
<div class="price_big">฿0.031052</div>
<a href="https://silkroad6ownowfk.onion.to/items/200mg-high-quality-dmt#shipping">add to cart</a>
</div>
</div>
免责声明:该信息适用于有关在线药品贸易的研究项目。
答案 0 :(得分:1)
你做的方式是错的。这是一个有效的例子:
metaData = soup.find_all("div", {"class":"item_details"})
priceData = soup.find_all("div", {"class":"price_big"})
您可以从here找到有关它的用法的更多信息。