使用逗号将xpath对象写入Python中的CSV

时间:2016-01-24 15:29:25

标签: python csv xpath

构建一个web scraper,以便在我管理的网站上内部使用,我遇到输出xpath列表的问题,恰好在输出字符串中包含逗号。我知道我需要处理字符串列表中的逗号,而不是将值分隔成列的逗号分隔

# -*- coding: utf-8 -*-
import requests
from lxml import html
import urlparse
import collections
import csv
import time

# Settings
statingurl = 'http://www.somdomain.com'
domain = 'somedomain'

# filename
timestr = time.strftime("%m-%d-%Y-%H-%M-%S")
f = open('scrape-output\\'+domain+'-metadata-'+timestr+'.csv', 'a+')

# Create URL Queue, Set Start, Crawl
urls_queue = collections.deque()
urls_queue.append(statingurl)
found_urls = set()
found_urls.add(statingurl)

# Set Column Headers for the file
colheader = "URL Crawled, Title Tag, Meta Description, H1, H2, H3, H4, H5, H6, Image Source, Image Alt"
f.write(colheader)
f.write("\n")

while len(urls_queue):
    url = urls_queue.popleft()
    page_url = url
    print "\n"
    print "************************************************************"
    print "\n"

    # Use Requests to get Metadata
    if url.startswith(statingurl):
        print "Connecting to %s" % (url,)
        page = requests.get(url)
        tree = html.fromstring(page.content)
        print "\n"

    # Extract Metadata elements from the html tree
    title = tree.xpath('//title/text()')
    description = tree.xpath("//head/meta[@name='description']/@content")
    h1 = tree.xpath('//h1/text()')
    h2 = tree.xpath('//h2/text()')
    h3 = tree.xpath('//h3/text()')
    h4 = tree.xpath('//h4/text()')
    h5 = tree.xpath('//h5/text()')
    h6 = tree.xpath('//h6/text()')
    imgsrc = tree.xpath('//img/@src')
    imgalt = tree.xpath('//img/@alt')

    # Output Metadata
    print 'Found %s Title' % len(title) 
    print title,"\n"
    print 'Found %s Description' % len(description)
    print description,"\n"  
    print 'Found %s H1' % len(h1)   
    print h1
    print 'Found %s H2' % len(h2)   
    print h2
    print 'Found %s H3' % len(h3)   
    print h3
    print 'Found %s H4' % len(h4)   
    print h4
    print 'Found %s H5' % len(h5)   
    print h5
    print 'Found %s H6' % len(h6)   
    print h6    
    print '\n'
    print 'Found %s Image Paths' % len(imgsrc)
    print 'Images Src:'
    print imgsrc 
    print "\n"
    print 'Found %s Image Alt Tags' % len(imgsrc)   
    print 'Image Alt:'
    print imgalt
    print "\n"

    # Finds links on page; Add URL to Queue
    print "Looking for links"
    links = {urlparse.urljoin(page.url, url) for url in tree.xpath('//a/@href') if urlparse.urljoin(page.url, url).startswith('http')}

    print "Set difference to find new URLs"
    # Set difference to find new URLs
    for link in (links - found_urls):
        found_urls.add(link)
        urls_queue.append(link) 
    print '\n %s URLs in Queue' % len(urls_queue)

    # Write Output to file and repeat loop
    output = "%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s" % (page_url.encode('utf-8'), title, description, h1, h2, h3, h4, h5, h6, imgsrc, imgalt)
    f.write(output)
    f.write('\n')

如果有人可以帮我理解描述值的说法如何确保描述对象中的逗号被解析为带有逗号的字符串,进入csv文件,我将非常感激。在此过程中还有更多工作要做,但这是我的直接问题。

感谢。

2 个答案:

答案 0 :(得分:0)

在您的代码中使用它:

import csv
headers = ["URL Crawled", "Title Tag", "Meta Description", "H1",
           "H2", "H3", "H4", "H5", "H6", "Image Source", "Image Alt"]
f = open('file.csv', 'ab')
writer = csv.writer(f)
writer.writerow(headers)
writer.writerow(["some", "data, is here"])

此外,对于解析Web,最好使用unicodecsv模块来处理unicode内容。 你必须通过pip install unicodecsv使用pip来安装它。

unicodecsvcsv模块具有相同的功能。

安装unicodecsv模块后,您只需要替换

import csv

通过

import unicodecsv

一切都应该更好。

答案 1 :(得分:0)

考虑使用join()方法将xpath字符串与双引号连接起来,以便在字符串中转义逗号:

output = '","'.join([page_url.encode('utf-8'), title, description, 
                     h1, h2, h3, h4, h5, h6, imgsrc, imgalt])
f.write(output)
f.write('\n')

或者根据其他人的建议使用csv模块。由于with()

,请注意其他缩进
colheader = ['URL Crawled', 'Title Tag', 'Meta Description', 
             'H1', 'H2', 'H3', 'H4', 'H5', 'H6', 'Image Source', 'Image Alt']

with open('scrape-output\\'+domain+'-metadata-'+timestr+'.csv', 'w', newline='') as f:
    writer = csv.writer(f)    
    writer.writerow(colheader)

    while len(urls_queue):
       ...
       ...rest of loop code...
       ...
       writer.writerow([page_url.encode('utf-8'), title, description, 
                        h1, h2, h3, h4, h5, h6, imgsrc, imgalt])