我已经用python编写了一个解析器,它正在完美地完成它的工作,除了一些重复项。而且,当我打开csv文件时,我可以看到每个结果都被方括号包围。是否有任何解决方法可以动态删除重复数据和方括号?这是我尝试过的:
import csv
import requests
from lxml import html
def parsingdata(mpg):
data = set()
outfile=open('RealYP.csv','w',newline='')
writer=csv.writer(outfile)
writer.writerow(["Name","Address","Phone"])
pg=1
while pg<=mpg:
url="https://www.yellowpages.com/search?search_terms=Coffee%20Shops&geo_location_terms=Los%20Angeles%2C%20CA&page="+str(pg)
page=requests.get(url)
tree=html.fromstring(page.text)
titles = tree.xpath('//div[@class="info"]')
items = []
for title in titles:
comb = []
Name = title.xpath('.//span[@itemprop="name"]/text()')
Address = title.xpath('.//span[@itemprop="streetAddress" and @class="street-address"]/text()')
Phone = title.xpath('.//div[@itemprop="telephone" and @class="phones phone primary"]/text()')
try:
comb.append(Name[0])
comb.append(Address[0])
comb.append(Phone[0])
except:
continue
items.append(comb)
pg+=1
for item in items:
writer.writerow(item)
parsingdata(3)
现在工作正常。 编辑:整理部分取自bjpreisler
答案 0 :(得分:3)
当我使用.csv文件时,此脚本会删除重复项。检查这是否适合你:)
with open(file_out, 'w') as f_out, open(file_in, 'r') as f_in:
# write rows from in-file to out-file until all the data is written
checkDups = set() # set for removing duplicates
for line in f_in:
if line in checkDups: continue # skip duplicate
checkDups.add(line)
f_out.write(line)
答案 1 :(得分:1)
您目前正在向csv写一个列表(项目),这就是为什么它在括号中。为避免这种情况,请使用另一个可能如下所示的for循环:
for title in titles:
comb = []
Name = title.xpath('.//span[@itemprop="name"]/text()')
Address = title.xpath('.//span[@itemprop="streetAddress" and @class="street-address"]/text()')
Phone = title.xpath('.//div[@itemprop="telephone" and @class="phones phone primary"]/text()')
if Name:
Name = Name[0]
if Address:
Address = Address[0]
if Phone:
Phone = Phone[0]
comb.append(Name)
comb.append(Address)
comb.append(Phone)
print comb
items.append(comb)
pg+=1
for item in items:
writer.writerow(item)
parsingdata(3)
这应该将每个项目分别写入您的csv。事实证明,你附加梳子的项目是列表本身,因此这将提取它们。
答案 2 :(得分:1)
我最近发现的这款刮刀的简洁版本是:
import csv
import requests
from lxml import html
url = "https://www.yellowpages.com/search?search_terms=Coffee%20Shops&geo_location_terms=Los%20Angeles%2C%20CA&page={0}"
def parsingdata(link):
outfile=open('YellowPage.csv','w',newline='')
writer=csv.writer(outfile)
writer.writerow(["Name","Address","Phone"])
for page_link in [link.format(i) for i in range(1, 4)]:
page = requests.get(page_link).text
tree = html.fromstring(page)
for title in tree.xpath('//div[@class="info"]'):
Name = title.findtext('.//span[@itemprop="name"]')
Address = title.findtext('.//span[@itemprop="streetAddress"]')
Phone = title.findtext('.//div[@itemprop="telephone"]')
print([Name, Address, Phone])
writer.writerow([Name, Address, Phone])
parsingdata(url)