匹配表中的行

时间:2016-01-21 14:34:19

标签: python pandas scrapy

我正在抓取这个维基百科页面:

https://en.wikipedia.org/wiki/List_of_shopping_malls_in_the_South_Florida_metropolitan_area

从表中获取数据,如下所示:

Location = response.xpath('//*[@id="mw-content-text"]/table/tr/td[2]/a/text()').extract()[0]

Name =  response.xpath('//*[@id="mw-content-text"]/table/tr/td[1]/a/text()').extract()

一旦我拥有它,计划是将这些列表添加到数据框中。我得到的问题是:

len(Name)
 40

 len(Location)
47

这是因为在位置列的某些行中有几个元素,就像第三列中的那样:椰林,迈阿密 在那里,我得到了元素。

3 个答案:

答案 0 :(得分:2)

您可以使用read_html,而df是[{1}}的{​​{1}}的首位:

df

答案 1 :(得分:1)

您只需要正确的xpath:

rows = response.xpath('//table[@class="wikitable"]//tr[not(./th)]')
for row in rows:
    print ''.join(row.xpath('.//td[1]//text()').extract()), ' | ' , ''.join(row.xpath('.//td[2]//text()').extract())

Aventura Mall  |  Aventura
Bal Harbour Shops  |  Bal Harbour
Bayside Marketplace  |  Downtown Miami
Boynton Beach Mall  |  Boynton Beach
CityPlace  |  West Palm Beach
CocoWalk  |  Coconut Grove, Miami
Coral Square  |  Coral Springs
Dadeland Mall  |  Kendall
Dolphin Mall  |  Sweetwater
Downtown at the Gardens  |  Palm Beach Gardens
The Falls  |  Kendall
Galeria International Mall  |  Downtown Miami
The Galleria at Fort Lauderdale  |  Fort Lauderdale
The Gardens Mall  |  Palm Beach Gardens
The Grand Doubletree Shops  |  Downtown Miami
Las Olas Riverfront  |  Fort Lauderdale
Las Olas Shops  |  Fort Lauderdale
Lincoln Road Mall  |  Miami Beach
Loehmann's Fashion Island  |  Aventura
Mall of the Americas  |  Miami
The Mall at 163rd Street  |  North Miami Beach
The Mall at Wellington Green  |  Wellington
Miami International Mall  |  Doral
Miracle Marketplace  |  Miami
Metrofare Shops & Cafe  |  Government Center, Downtown Miami
Pembroke Lakes Mall  |  Pembroke Pines
Pompano Citi Centre  |  Pompano Beach
Sawgrass Mills  |  Sunrise
Seminole Paradise  |  Hollywood
The Shops at Fontainebleau  |  Miami Beach
The Shops at Mary Brickell Village  |  Brickell, Miami
The Shops at Midtown Miami  |  Midtown Miami
The Shops at Pembroke Gardens  |  Pembroke Pines
The Shops at Sunset Place  |  South Miami
Southland Mall  |  Cutler Bay
Town Center at Boca Raton  |  Boca Raton
The Village at Gulfstream Park  |  Hallandale Beach
Village of Merrick Park  |  Coral Gables
Westfield Broward  |  Plantation
Westland Mall  |  Hialeah

答案 2 :(得分:0)

如果您想要的是将两个单词视为一个单词,则可以对整个单词执行字符串替换,以使用空字符串替换逗号:

location = [loc.replace(',', '') for loc in location]