Question

我正在尝试抓取此网页的所有不同变体。例如应该抓取此网页的代码http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=11849。应该与我用来刮掉这个网页的代码相同 http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=11849

def extract_contact(url):
    r=requests.get(url)
    soup=BeautifulSoup(r.content,'lxml')
    tbl=soup.findAll('table')[2]
    list=[]
    Contact=tbl.findAll('p')[0]

    for br in Contact.findAll('br'):
        next = br.nextSibling
        if not (next and isinstance(next,NavigableString)):
            continue
        next2 = next.nextSibling
        if next2 and isinstance(next2,Tag) and next2.name == 'br':  
            text = re.sub(r'[\n\r\t\xa0]','',next).replace('Phone:','').strip()
            list.append(text)
    print list      

    #Street=list.pop(0)
    #CityStateZip=list.pop(0)
    #Phone=list.pop(0)
    #City,StateZip= CityStateZip.split(',')
    #State,Zip= StateZip.split(' ') 
    #ContactName = Contact.findAll('b')[1]
    #ContactEmail = Contact.findAll('a')[1]
    #Body=tbl.findAll('p')[1]
    #Website = Contact.findAll('a')[2]
    #Email = ContactEmail.text.strip()
    #ContactName = ContactName.text.strip()
    #Website = Website.text.strip()
    #Body = Body.text
    #Body = re.sub(r'[\n\r\t\xa0]','',Body).strip()
    #list.extend([Street,City,State,Zip,ContactName,Phone,Email,Website,Body])
    return list

我认为我需要编写代码才能使其工作的方式是设置它以便打印列表返回相同数量的值，排序相同。目前，上面的脚本返回这些值

[u'2133 Craigs Store Road'，u'Afton，VA 22920'，u'434-882-3150'] [u'Alexandria，VA 22305']

计算缺失值，以便能够一致地解析此页面，我需要print list命令返回类似于

的内容

[u'2133 Craigs Store Road', u'Afton,VA 22920', u'434-882-3150']
['',u'Alexandria,VA 22305','']

这样我就可以按位置操纵值（因为它们将按照一致的顺序）。问题是我不知道如何完成这个，因为我仍然是解析的新手。如果有人对如何解决问题有任何见解，我将非常感激。

Answer 1

def extract_contact(url):
    r=requests.get(url)
    soup=BeautifulSoup(r.content,'lxml')
    tbl=soup.findAll('table')[2]
    list=[]
    Contact=tbl.findAll('p')[0]

    for br in Contact.findAll('br'):
        next = br.nextSibling
        if not (next and isinstance(next,NavigableString)):
            continue
        next2 = next.nextSibling
        if next2 and isinstance(next2,Tag) and next2.name == 'br':  
            text = re.sub(r'[\n\r\t\xa0]','',next).replace('Phone:','').strip()
            list.append(text)
    Street=[s for s in list if ',' not in s and '-' not in s]
    CityStateZip=[s for s in list if ',' in s]  
    Phone = [s for s in list if '-' in s]
    if not Street:
        Street=''
    else:
        Street=Street[0]    
    if not CityStateZip:
        CityStateZip=''
    else:
        City,StateZip= CityStateZip[0].split(',')
        State,Zip= StateZip.split(' ')  
    if not Phone:
        Phone=''
    else:
        Phone=Phone[0]      
    list=[]

我找到了使用子字符串和if语句的替代解决方案。由于列表中最多只有3个值，所有具有定义特征的我都意识到我可以通过查找特殊字符而不是记录的位置进行委托。

用占位符解析

1 个答案: