使用lxml进行刮取时返回奇数列

时间:2011-04-26 11:14:45

标签: python lxml

我正在学习python并尝试构建一个刮刀来从供应商网站收集零件数据。我现在的问题是我从解析的表行获取不同的列数,我知道每行具有相同的列数。这个问题必须是我忽视的问题,经过两天尝试不同的事情后,我要求在我的代码上添加一些眼睛以找出我的错误。没有太多的python编码经验无疑是我最大的障碍。

首先是数据。而不是粘贴我存储在我的数据库中的html,我会给你一个链接到我已经抓取并存储在我的数据库中的实时网站。第一个链接是this one

问题是我得到的结果大多正确。但是,我经常会在列数中得到偏移的值。我似乎找不到原因。

这是一个有缺陷的结果的例子:

----------------------------------------------------------------------------------
Record: 1 Section:Passenger  /  Light Truck Make: ACURA SubMake: 
Model: CL SubModel:  Year: 1997 Engine: L4 1.6L 1590cc
----------------------------------------------------------------------------------
Rec:1 Row 6 Col 1 part Air Filter
Rec:1 Row 6 Col 2 2 
Rec:1 Row 6 Col 3 part_no 46395
Rec:1 Row 6 Col 4 filter_loc 
Rec:1 Row 6 Col 5 engine 
Rec:1 Row 6 Col 6 vin_code V6 3.0L 2997cc
Rec:1 Row 6 Col 7 comment Engine Code J30A1

**请注意,引擎值已移至vin_code字段。

并证明它在某些时候有效:


Record: 2 Section:Passenger  /  Light Truck Make: ACURA SubMake: 
Model: CL SubModel:  Year: 1998 Engine: L4 1.6L 1590cc
----------------------------------------------------------------------------------
Rec:3 Row 4 Col 1 part Oil Filter
Rec:3 Row 4 Col 2 2 
Rec:3 Row 4 Col 3 part_no 51334
Rec:3 Row 4 Col 4 filter_loc 
Rec:3 Row 4 Col 5 engine L4 2.3L 2254cc
Rec:3 Row 4 Col 6 vin_code
Rec:3 Row 4 Col 7 comment Engine Code F23A1

**请注意此记录中的字段排列......

我怀疑在我的解析器没有寻找的表格单元格中有什么东西,或者我错过了一些微不足道的东西。

以下是我的代码的重要部分:

# Per Query
while records:
    # Per Query Loop
    #print str(records)
    for record in records:
        print 'Record Count:'+str(rec_cnt)
        items = ()
        item = {}
        source = record['doc']
        page = html.fromstring(source)

        for rows in page.xpath('//div/table'):
            #records = []
            item = {}
            cntx = 0
            for row in list(rows):
                cnty = 0 # Column Counter
                found_oil = 0 # Found oil filter record flag
                data = {} # Data
                # Data fields
                field_data = {'part':'',   'part_no':'', 'filter_loc':'',  'engine':'',  'vin_code':'',  'comment':'', 'year':''}
                print
                print '----------------------------------------------------------------------------------'
                print 'Record: '+str(record['id']), 'Section:'+str(record['section']),  'Make: '+str(record['make']),   'SubMake: '+str(record['submake'])
                print  'Model: '+str(record['model']),  'SubModel: '+str(record['submodel']),  'Year: '+str(record['year']),  'Engine: '+str(record['engine'])
                print '----------------------------------------------------------------------------------'

                #
                # Rules for extracting data columns
                # 1. First column always has a link to the bullet image
                # 2. Second column is part name
                # 3. Third column always empty
                # 4. Fourth column is  part number
                # 5. Fith column is empty
                # 6. Sixth column is part location
                # 7. Seventh column is always empty
                # 8. Eigth column is engine size
                # 9. Ninth column is vin code
                # 10. Tenth column is COmment
                # 11. Eleventh column does not exist.
                #
                for column in row.xpath('./td[@class="blackmedium"][text()="0xa0"] | ./td[@class="blackmedium"][text()="\n"]/text() | ./td[@class="blackmeduim"]/img[@src]/text()  | ./td[@class="blackmedium"][text()=""]/text() | ./td[@class="blackmedium"]/b/text() | ./td[@class="blackmedium"]/a/text() |./td[@class="blackmedium"]/text() | ./td[@class="blackmedium"][text()=" "]/text() | ./td[@class="blackmedium"][text()="&#160"]/text() | ./td[@class="blackmedium"][text()=None]/text()'): 
                    #' | ./td[position()>1]/a/text() | ./td[position()>1]/text() | self::node()[position()=1]/td/text()'):
                    cnty+=1
                    if ('Oil Filter' == column.strip() or 'Air Filter' == column.strip()) and found_oil == 0:
                        found_oil = 1

                    if found_oil == 1:
                        print 'Rec:'+str(rec_cnt), 'Row '+str(cntx),  'Col '+str(cnty),  _fields[cnty],  column.strip()
                        #cnty+= 1
                        #print
                    else:
                        print 'Rec: '+str(rec_cnt),  'Col: '+str(cnty)

                    field_data[ str(_fields[cnty]) ] = str(column.strip())
                    #cnty = cnty+1

                # Save data to db dest table
                if found_oil == 1:
                    data['source_id'] = record['id']
                    data['section_id'] = record['section_id']
                    data['section'] = record['section']
                    data['make_id'] = record['make_id']
                    data['make'] = record['make']
                    data['submake_id'] = record['submake_id']
                    data['submake'] = record['submake']
                    data['model_id'] = record['model_id']
                    data['model'] = record['model']
                    data['submodel_id'] = record['submodel_id']
                    data['submodel'] = record['submodel']
                    data['year_id'] = record['year_id']
                    data['year'] = record['year']
                    data['engine_id'] = record['engine_id']
                    data['engine'] = record['engine']
                    data['part'] = field_data['part']
                    data['part_no'] = field_data['part_no']
                    data['filter_loc'] = field_data['filter_loc']
                    data['vin_code'] = field_data['vin_code']
                    data['comment'] = conn.escape_string(field_data['comment'])

                    data['url'] = record['url']
                    save_data(data)
                    print 'Filed Data:'
                    print field_data

                cntx+=1
            rec_cnt+=1
    #End main per query loop 
    delay() # delay if wait was passed on cmd line
    records = get_data()
    has_offset = 1
    #End Queries

谢谢大家的帮助和你的眼睛......

4 个答案:

答案 0 :(得分:0)

通常当我遇到这样的问题时,我会做两件事:

  1. 将问题分解为更小的块。使用python函数或类来执行功能子集,以便您可以单独测试函数的正确性。
  2. 使用Python Debugger检查运行时的代码,以了解代码失败的位置。例如,在这种情况下,我会在显示import pdb; pdb.set_trace()的行之前添加cnty+=1
  3. 然后,当代码运行时,您将获得一个交互式解释器,您可以检查各种变量并发现为什么没有得到您期望的结果。

    使用pdb的几个提示:

    使用c允许程序继续(直到下一个断点或set_trace);使用n跳到程序中的下一行。使用q引发异常(通常是中止)。

答案 1 :(得分:0)

您能否传递报废流程的详细信息?间歇性故障可以基于html数据的解析。

答案 2 :(得分:0)

问题似乎是你的xpath表达式搜索文本节点。没有找到空单元格的匹配项,导致代码“跳过”列。尝试迭代td元素本身,然后从元素“向下看”到其内容。为了帮助您入门:

# just iterate over child elements of the row, which are always td
# use enumerate to easily get a counter for the columns
for col_no, td in enumerate(row, start=1):
    # use the xpath function string() to get the string value for the element
    # this will yield an empty string for empty elements
    print col_no, td.xpath('string()')

请注意,string() xpath函数的使用在某些情况下可能不够/太简单,无法满足您的需求。在您的示例中,您可能会找到类似<td><a>51334</a><sup>53</sup></td>的内容(请参阅机油滤清器)。我的例子会给你“5133453”,你似乎需要“51334”(不确定这是否是故意的,或者如果你没有注意到“缺失”部分,如果你只想要超链接,请使用{ {1}})

答案 3 :(得分:0)

我要感谢过去几天给予我帮助的所有人。您的所有输入都产生了我正在使用的工作应用程序。我想将更改后的更改发布到我的代码中,以便那些看到这里的人可以找到答案或至少有关他们如何解决问题的信息。下面是我的代码的重写部分,它解决了我遇到的问题:

#
# get_column_index()
# returns a dict of column names/column number pairs
#
def get_column_index(row): 
    index = {}
    col_no = 0
    td = None
    name = ''
    for col_no,  td in enumerate(row,  start=0):
        mystr = str(td.xpath('string()').encode('ascii',  'replace'))
        name =  str.lower(mystr).replace(' ', '_')
        idx = name.replace('.', '')
        index[idx] =  col_no

    if int(options.verbose) > 2:
        print 'Field Index:',  str(index)

    return index




def run():
    global has_offset
    records = get_data()

    #print 'Records',  records
    rec_cnt = 0

    # Per Query
    while records:
        # Per Query Loop
        #print str(records)
        for record in records:
            if int(options.verbose) > 0:
                print 'Record Count:'+str(rec_cnt)

            items = ()
            item = {}
            source = record['doc']
            page = html.fromstring(source)
            col_index = {}

            for rows in page.xpath('//div/table'):
                #records = []
                item = {}
                cntx = 0
                for row in list(rows):
                    data = {} # Data
                    found_oil = 0 #found proper part flag
                    # Data fields
                    field_data = {'part':'',   'part_no':'', 'part_note':'',  'filter_loc':'',  'engine':'',  'vin_code':'',  'comment':'', 'year':''}

                    if int(options.verbose) > 0:
                        print
                        print '----------------------------------------------------------------------------------'
                        print 'Row'+str(cntx), 'Record: '+str(record['id']), 'Section:'+str(record['section']),  'Make: '+str(record['make']),   'SubMake: '+str(record['submake'])
                        print  'Model: '+str(record['model']),  'SubModel: '+str(record['submodel']),  'Year: '+str(record['year']),  'Engine: '+str(record['engine'])
                        print '----------------------------------------------------------------------------------'

                   # get column indexes
                    if cntx == 1:
                        col_index = get_column_index(row)

                    if col_index != None and cntx > 1:
                        found_oil = 0

                        for col_no,  td in enumerate(row):

                            if ('part' in col_index) and (col_no == col_index['part']):
                                part = td.xpath('string()').strip()
                                if 'Oil Filter' == part or 'Air Filter' == part or 'Fuel Filter' == part or 'Transmission Filter' == part:
                                    found_oil = 1
                                    field_data['part'] = td.xpath('string()').strip()

                            # Part Number
                            if ('part_no' in col_index) and (col_no == col_index['part_no']):
                                field_data['part_no'] = str(td.xpath('./a/text()')).strip().replace('[', '').replace(']', '').replace("'", '')
                                field_data['part_note'] = str(td.xpath('./sup/text()')).strip().replace('[', '').replace(']', '').replace("'", '')

                            # Filter Location
                            if ('filterloc' in col_index) and (col_no == col_index['filterloc']):
                                field_data['filter_loc'] = td.xpath('string()').strip()

                            # Engine
                            if ('engine' in col_index) and (col_no == col_index['engine']):
                                field_data['engine'] = td.xpath('string()').strip()

                            if ('vin_code' in col_index) and (col_no == col_index['vin_code']):
                                field_data['vin_code'] = td.xpath('string()').strip()

                            if ('comment' in col_index) and (col_no == col_index['comment']):
                                field_data['comment'] = td.xpath('string()').strip()

                            if int(options.verbose) == 0:
                                print ',' 


                        if int(options.verbose) > 0:
                            print 'Field Data: ',  str(field_data)
                        elif int(options.verbose) == 0:
                            print '.'

                    # Save data to db dest table
                    if found_oil == 1:
                        data['source_id'] = record['id']
                        data['section_id'] = record['section_id']
                        data['section'] = record['section']
                        data['make_id'] = record['make_id']
                        data['make'] = record['make']
                        data['submake_id'] = record['submake_id']
                        data['submake'] = record['submake']
                        data['model_id'] = record['model_id']
                        data['model'] = record['model']
                        data['submodel_id'] = record['submodel_id']
                        data['submodel'] = record['submodel']
                        data['year_id'] = record['year_id']
                        data['year'] = record['year']
                        data['engine_id'] = record['engine_id']
                        data['engine'] = field_data['engine'] #record['engine']
                        data['part'] = field_data['part']
                        data['part_no'] = field_data['part_no']
                        data['part_note'] = field_data['part_note']
                        data['filter_loc'] = field_data['filter_loc']
                        data['vin_code'] = field_data['vin_code']
                        data['comment'] = conn.escape_string(field_data['comment'])

                        data['url'] = record['url']
                        save_data(data)
                        found_oil = 0

                        if int(options.verbose) > 2:
                            print 'Data:', str(data)

                    cntx+=1
                rec_cnt+=1
        #End main per query loop 
        delay() # delay if wait was passed on cmd line
        records = get_data()
        has_offset = 1
        #End Queries