Web抓取的CSV格式问题

时间:2013-11-01 20:46:18

标签: python loops csv beautifulsoup

我编写了以下脚本来从download.com抓取一些数据:

from bs4 import BeautifulSoup
import urllib2
import csv

query = raw_input("Please enter the query value: ")
limit = raw_input("Please enter the results limit (1-100): ")
minreview = raw_input("Please enter the min user rating (1-5): ")
maxreview = raw_input("Please enter the max user rating (1-5): ")
csvname = raw_input("Enter a filename.csv for CSV output: ")

cnetFile = urllib2.urlopen("http://developer.api.cnet.com/rest/v1.0/softwareProductSearch?partKey=APIKEYGOESHERE&partTag=APIKEYGOESHERE&query=" + query + "$
cnetXml = cnetFile.read()
cnetFile.close()

soup = BeautifulSoup(cnetXml, features="xml")
#print soup.prettify()

f = csv.writer(open(csvname, "w"))
f.writerow(["Name", "Link", "Mfg", "Mfg Link", "Price", "Downloads", "User Rating Summary", "User Rating Product"])

data = soup.find_all(['Name', 'Price', 'TotalDownloads', 'LinkURL', 'Rating'])
#print data

for x in data:
        strip1 = x.contents
        print strip1
        f.writerow(strip1)

对于返回2个产品,CSV的输出如下所示:(每个产品应返回8个字段,如代码中的标题,但偶尔会丢失一个,如第1个产品中的第8个”。)

Name,Link,Mfg,Mfg Link,Price,Downloads,User Rating Summary,User Rating Product

Firegraphic

http://www.download.com/firegraphic/3000-2192_4-10367545.html?tag=api

Firegraphic

http://www.firegraphic.com

$49.95

2546868

2.0



MP3 CD Maker

http://www.download.com/mp3-cd-maker/3000-2140_4-10065486.html?tag=api

ZY Computing

http://www.dvdsanta.com

$24.95

1653394

2.0

2.0

以下是汤变量中数据的示例:

<?xml version="1.0" encoding="utf-8"?>
<CNETResponse realm="cnet" version="1.0" xmlns="http://developer.api.cnet.com/re
st/v1.0/ns" xmlns:xlink="http://www.w3.org/1999/xlink">
<SoftwareProducts numFound="898" numReturned="2" start="0">
<SoftwareProduct id="11889531" setId="10367545" xlink:href="http://developer.api
.cnet.com/rest/v1.0/softwareProduct?productSetId=10367545&amp;iod=userRatings&am
p;partKey=572kjgq8h2mqbsup36cubkys&amp;partTag=572kjgq8h2mqbsup36cubkys">
<Name>Firegraphic</Name>
<Version>11.0</Version>
<LinkURL>http://www.download.com/firegraphic/3000-2192_4-10367545.html?tag=api</
LinkURL>
<Publisher id="6268727">
<Name>Firegraphic</Name>
<LinkURL>http://www.firegraphic.com</LinkURL>
<UrsRegId/>
</Publisher>
<License>Free to try</License>
<BetaRelease>false</BetaRelease>
<Price currency="USD">$49.95</Price>
<Summary>Import, organize, view, edit, print, and share your digital images.</Su
mmary>
<Description>&lt;p&gt;Firegraphic is an image viewer for photography professiona
ls, Web, and graphic designers to import, organize, view, edit, print, and share
 their digital images. The new Firegraphic has improved its memory usage and con
sumes very low memory, which leaves more memory for you to edit your photos in t
he image editor. Firegraphic now supports the RAW file formats from digital came
ras. Firegraphic gives you the ability to open multiple photos in the Viewer and
 compare photos side-by-side to choose your best shot. You also can customize th
e tools in your toolbar and the Context menu in the Viewer. The Firegraphic user
 interface lets you change the skin color and edit photos with a third-party ima
ge editor.&lt;/p&gt;</Description>
<WhatsNew/>
<Requirements> </Requirements>
<Platform>Windows</Platform>
<OperatingSystems>
<OperatingSystem id="3">Windows</OperatingSystem>
<OperatingSystem id="17">Windows 2000</OperatingSystem>
<OperatingSystem id="25">Windows XP</OperatingSystem>
<OperatingSystem id="43">Windows 2003</OperatingSystem>
<OperatingSystem id="52">Windows Vista</OperatingSystem>
<OperatingSystem id="133">Windows 7</OperatingSystem>
</OperatingSystems>
<EditorsRating outOf="5">3.0</EditorsRating>
<EditorsNote/>
<PreferredNode id="2192"/>
<WeeklyDownloads>8</WeeklyDownloads>
<TotalDownloads>2546868</TotalDownloads>
<CreatedDate>2011-04-21 17:41:19.0</CreatedDate>
<ReleaseDate>2011-04-21 00:00:00.0</ReleaseDate>
<ReviewDate>2008-11-09 00:00:00.0</ReviewDate>
<Limitations>30-day trial</Limitations>
<BuyNowUrl type=""> </BuyNowUrl>
<TrialPayUrl/>
<CleverBridgeUrl/>
<UpsellUnit/>
<ButtonPartner/>
<CNETContentIds/>
<FileSize>8358576</FileSize>
<Category id="2192" xlink:href="http://developer.api.cnet.com/rest/v1.0/category
?categoryId=2192&amp;siteId=4&amp;partKey=572kjgq8h2mqbsup36cubkys&amp;partTag=5
72kjgq8h2mqbsup36cubkys"/>
<UserRatingSummary>
<Rating outOf="5">2.0</Rating>
<TotalVotes>7</TotalVotes>
</UserRatingSummary>
<UserRatingProduct>
<Rating outOf="5"/>
<TotalVotes>0</TotalVotes>
</UserRatingProduct>
<EditorsPick/>
<ListingType>STANDARD</ListingType>
</SoftwareProduct>
<SoftwareProduct id="10296367" setId="10065486" xlink:href="http://developer.api
.cnet.com/rest/v1.0/softwareProduct?productSetId=10065486&amp;iod=userRatings&am
p;partKey=572kjgq8h2mqbsup36cubkys&amp;partTag=572kjgq8h2mqbsup36cubkys">
<Name>MP3 CD Maker</Name>
<Version>2.0</Version>
<LinkURL>http://www.download.com/mp3-cd-maker/3000-2140_4-10065486.html?tag=api<
/LinkURL>
<Publisher id="83016">
<Name>ZY Computing</Name>
<LinkURL>http://www.dvdsanta.com</LinkURL>
<UrsRegId/>
</Publisher>
<License>Free to try</License>
<BetaRelease>false</BetaRelease>
<Price currency="USD">$24.95</Price>
<Summary>Create audio CDs from your MP3 collection.</Summary>
<Description>&lt;p&gt;MP3 CD Maker works with a CD recorder to create audio CDs
from collections of MP3 audio files. It directly converts MP3 files into the CD
audio format and can decode any MP3 file into WAV or raw audio. A normalization
feature lets you ensure that all MP3s in a set have the same volume level. &lt;/
p&gt;&lt;p&gt;Version 2.0 adds support for 200 more CD-R/RW drives.&lt;/p&gt;</D
escription>
<WhatsNew/>
<Requirements>Windows 95/98/Me/NT/2000/XP</Requirements>
<Platform>Windows</Platform>
<OperatingSystems>
<OperatingSystem id="3">Windows</OperatingSystem>
<OperatingSystem id="5">Windows 95</OperatingSystem>
<OperatingSystem id="8">Windows NT</OperatingSystem>
<OperatingSystem id="6">Windows 98</OperatingSystem>
<OperatingSystem id="7">Windows Me</OperatingSystem>
<OperatingSystem id="17">Windows 2000</OperatingSystem>
<OperatingSystem id="25">Windows XP</OperatingSystem>
</OperatingSystems>
<EditorsRating outOf="5">4.0</EditorsRating>
<EditorsNote/>
<PreferredNode id="2140"/>
<WeeklyDownloads>103</WeeklyDownloads>
<TotalDownloads>1653394</TotalDownloads>
<CreatedDate>2004-06-16 19:07:46.0</CreatedDate>
<ReleaseDate>2004-06-16 00:00:00.0</ReleaseDate>
<ReviewDate>2009-02-27 00:00:00.0</ReviewDate>
<Limitations>limited to 4 songs on a CD</Limitations>
<BuyNowUrl type="dl_buy_ond">http://send.onenetworkdirect.net/z/126524/CD103284/
</BuyNowUrl>
<TrialPayUrl/>
<CleverBridgeUrl/>
<UpsellUnit/>
<ButtonPartner/>
<CNETContentIds/>
<FileSize>1283187</FileSize>
<Category id="2140" xlink:href="http://developer.api.cnet.com/rest/v1.0/category
?categoryId=2140&amp;siteId=4&amp;partKey=572kjgq8h2mqbsup36cubkys&amp;partTag=5
72kjgq8h2mqbsup36cubkys"/>
<UserRatingSummary>
<Rating outOf="5">2.0</Rating>
<TotalVotes>3</TotalVotes>
</UserRatingSummary>
<UserRatingProduct>
<Rating outOf="5">2.0</Rating>
<TotalVotes>3</TotalVotes>
</UserRatingProduct>
<EditorsPick/>
<ListingType>STANDARD</ListingType>
</SoftwareProduct>
</SoftwareProducts>
</CNETResponse>

如何修复我的循环,以便返回的第一个产品的数据将进入8列,然后每个后续产品将从新行开始,每个产品的数据都会传输?

谢谢!


根据Birei的帮助,我能够获得数据,并且我知道如何在使用此代码返回的每8个项目之后开始新行:

strip1 = []
for y in data:
    strip1.extend(y.contents)
    print strip1
for x in xrange(0,len(strip1),8):
    f.writerow(strip1[x:x+8])

我留下的唯一问题是,有时'评级'的find_all会获得2个评分,有时只会评分为1。因为有时候只返回了7个项目,所以每8次开始一个新的行。如果仅返回1个评级,如何在第二个“评级”中打印“无”?

1 个答案:

答案 0 :(得分:1)

使用writerow()作为标题的数据。您不需要转换任何内容,因为contents属性返回一个列表:

for x in data:
    strip1 = x.contents
    f.writerow(strip1)

编辑:如果上述解决方案不起作用,因为contents每次都返回一个元素,请尝试将它们保存到数组并在结尾打印:

strip1 = []
for x in data:
    strip1.extend(x.contents)
f.writerow(strip1)

新编辑:查看xml文件后,我的方法是遍历每个<SoftwareProduct>元素并从中提取您想要的字段,例如:< / p>

for product in soup.find_all('SoftwareProduct'):
    strip1 = []
    strip1.extend(product.Name.contents)
    strip1.extend(product.Price.contents)
    ...
    f.writerow(strip1)