构建一个web scraper,以便在我管理的网站上内部使用,我遇到输出xpath列表的问题,恰好在输出字符串中包含逗号。我知道我需要处理字符串列表中的逗号,而不是将值分隔成列的逗号分隔
# -*- coding: utf-8 -*-
import requests
from lxml import html
import urlparse
import collections
import csv
import time
# Settings
statingurl = 'http://www.somdomain.com'
domain = 'somedomain'
# filename
timestr = time.strftime("%m-%d-%Y-%H-%M-%S")
f = open('scrape-output\\'+domain+'-metadata-'+timestr+'.csv', 'a+')
# Create URL Queue, Set Start, Crawl
urls_queue = collections.deque()
urls_queue.append(statingurl)
found_urls = set()
found_urls.add(statingurl)
# Set Column Headers for the file
colheader = "URL Crawled, Title Tag, Meta Description, H1, H2, H3, H4, H5, H6, Image Source, Image Alt"
f.write(colheader)
f.write("\n")
while len(urls_queue):
url = urls_queue.popleft()
page_url = url
print "\n"
print "************************************************************"
print "\n"
# Use Requests to get Metadata
if url.startswith(statingurl):
print "Connecting to %s" % (url,)
page = requests.get(url)
tree = html.fromstring(page.content)
print "\n"
# Extract Metadata elements from the html tree
title = tree.xpath('//title/text()')
description = tree.xpath("//head/meta[@name='description']/@content")
h1 = tree.xpath('//h1/text()')
h2 = tree.xpath('//h2/text()')
h3 = tree.xpath('//h3/text()')
h4 = tree.xpath('//h4/text()')
h5 = tree.xpath('//h5/text()')
h6 = tree.xpath('//h6/text()')
imgsrc = tree.xpath('//img/@src')
imgalt = tree.xpath('//img/@alt')
# Output Metadata
print 'Found %s Title' % len(title)
print title,"\n"
print 'Found %s Description' % len(description)
print description,"\n"
print 'Found %s H1' % len(h1)
print h1
print 'Found %s H2' % len(h2)
print h2
print 'Found %s H3' % len(h3)
print h3
print 'Found %s H4' % len(h4)
print h4
print 'Found %s H5' % len(h5)
print h5
print 'Found %s H6' % len(h6)
print h6
print '\n'
print 'Found %s Image Paths' % len(imgsrc)
print 'Images Src:'
print imgsrc
print "\n"
print 'Found %s Image Alt Tags' % len(imgsrc)
print 'Image Alt:'
print imgalt
print "\n"
# Finds links on page; Add URL to Queue
print "Looking for links"
links = {urlparse.urljoin(page.url, url) for url in tree.xpath('//a/@href') if urlparse.urljoin(page.url, url).startswith('http')}
print "Set difference to find new URLs"
# Set difference to find new URLs
for link in (links - found_urls):
found_urls.add(link)
urls_queue.append(link)
print '\n %s URLs in Queue' % len(urls_queue)
# Write Output to file and repeat loop
output = "%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s" % (page_url.encode('utf-8'), title, description, h1, h2, h3, h4, h5, h6, imgsrc, imgalt)
f.write(output)
f.write('\n')
如果有人可以帮我理解描述值的说法如何确保描述对象中的逗号被解析为带有逗号的字符串,进入csv文件,我将非常感激。在此过程中还有更多工作要做,但这是我的直接问题。
感谢。
答案 0 :(得分:0)
在您的代码中使用它:
import csv
headers = ["URL Crawled", "Title Tag", "Meta Description", "H1",
"H2", "H3", "H4", "H5", "H6", "Image Source", "Image Alt"]
f = open('file.csv', 'ab')
writer = csv.writer(f)
writer.writerow(headers)
writer.writerow(["some", "data, is here"])
此外,对于解析Web,最好使用unicodecsv
模块来处理unicode内容。
你必须通过pip install unicodecsv
使用pip来安装它。
unicodecsv
与csv
模块具有相同的功能。
安装unicodecsv
模块后,您只需要替换
import csv
通过
import unicodecsv
一切都应该更好。
答案 1 :(得分:0)
考虑使用join()
方法将xpath字符串与双引号连接起来,以便在字符串中转义逗号:
output = '","'.join([page_url.encode('utf-8'), title, description,
h1, h2, h3, h4, h5, h6, imgsrc, imgalt])
f.write(output)
f.write('\n')
或者根据其他人的建议使用csv模块。由于with()
:
colheader = ['URL Crawled', 'Title Tag', 'Meta Description',
'H1', 'H2', 'H3', 'H4', 'H5', 'H6', 'Image Source', 'Image Alt']
with open('scrape-output\\'+domain+'-metadata-'+timestr+'.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(colheader)
while len(urls_queue):
...
...rest of loop code...
...
writer.writerow([page_url.encode('utf-8'), title, description,
h1, h2, h3, h4, h5, h6, imgsrc, imgalt])