lxml webscraping:解析一列中有多个要素的列

时间:2017-06-23 04:36:20

标签: python pandas lxml

在弄清楚这一点时,我们将不胜感激。

我基本上是尝试使用python库'lxml'从Expedia中抓取数据并将数据移动到数据框中。

酒店设施等一些专栏有几个条目。我正在尝试解析酒店设施和其他列中的几个条目并将它们移动到单独的列中。所以每个设施都有它自己的专栏。

再次感谢您的帮助。

from lxml import html
import requests
import lxml.html
from lxml.etree import XPath
from lxml import etree
import urllib
import pandas as pd
from fake_useragent import UserAgent

ua = UserAgent()
header = {'user-agent':ua.chrome}

Sumisho_url = requests.get('https://www.expedia.com/Tokyo-Hotels-Sumisho-Hotel.h2221301.Hotel-Information?chkin=6%2F22%2F2017&chkout=6%2F23%2F2017&rm1=a2&regionId=179900&hwrqCacheKey=65e880f7-4254-472b-a76c-a9d652938f8cHWRQ1498148578719&vip=false&c=80642461-a7d7-49bb-856e-df5db3b7cec9&', headers=header)
Sumisho_tree = html.fromstring(Sumisho_url.content)

Sumisho_columns = ['Name', 'Address','Telephone','Neighborhood','Star_Rating','Hotel_Features','Hotel_Amenities','Room_Amenities','Check_In','Check_Out']
Sumisho_df = pd.DataFrame(index=range(0,0),columns=Sumisho_columns)

Sumisho_df['Name'] = Sumisho_tree.xpath('//*[@id="hotel-name"]/text()')
Sumisho_df['Address'] = Sumisho_tree.xpath('//*[@id="license-plate"]/div[2]/a/span[2]/text()')
Sumisho_df['Telephone'] = Sumisho_tree.xpath('//*[@id="license-plate"]/div[2]/span/span/text()')
Sumisho_df['Neighborhood'] = ', '.join(Sumisho_tree.xpath('/html/body/div/div/section/div/div/p/text()'))
Sumisho_df['Star_Rating'] = Sumisho_tree.xpath('//*[@id="license-plate"]/div[1]/strong/span/text()')
Sumisho_df['Hotel_Features'] = ', '.join(Sumisho_tree.xpath('/html/body/div/div[7]/section/div[11]/div[2]/p[2]/text()'))
Sumisho_df['Room_Amenities'] = ', '.join(Sumisho_tree.xpath('//*[@id="show-more-room"]/ul/li/text()'))
Sumisho_df['Hotel_Amenities'] = ', '.join(Sumisho_tree.xpath('//*[@id="show-more-general"]/ul/li/text()'))
Sumisho_df['Check_In'] = Sumisho_tree.xpath('//*[@id="policies-and-fees"]/div[1]/p/text()')
Sumisho_df['Check_Out'] = Sumisho_tree.xpath('//*[@id="policies-and-fees"]/div[2]/p/text()')

Sumisho_df

Dataframe image

1 个答案:

答案 0 :(得分:0)

您已将数据Hotel_Amenities作为list抓取,您可以循环列表并将其分配给具有不同列名的数据框:

Sumisho_columns = ['Name', 'Address','Telephone','Neighborhood','Star_Rating','Hotel_Features','Hotel_Amenities','Room_Amenities','Check_In','Check_Out']
Sumisho_df = pd.DataFrame(index=range(0,0),columns=Sumisho_columns)

Sumisho_df['Name'] = Sumisho_tree.xpath('//*[@id="hotel-name"]/text()')
Sumisho_df['Address'] = Sumisho_tree.xpath('//*[@id="license-plate"]/div[2]/a/span[2]/text()')
Sumisho_df['Telephone'] = Sumisho_tree.xpath('//*[@id="license-plate"]/div[2]/span/span/text()')
Sumisho_df['Neighborhood'] = ', '.join(Sumisho_tree.xpath('/html/body/div/div/section/div/div/p/text()'))
Sumisho_df['Star_Rating'] = Sumisho_tree.xpath('//*[@id="license-plate"]/div[1]/strong/span/text()')
Sumisho_df['Hotel_Features'] = ', '.join(Sumisho_tree.xpath('/html/body/div/div[7]/section/div[11]/div[2]/p[2]/text()'))
Sumisho_df['Room_Amenities'] = ', '.join(Sumisho_tree.xpath('//*[@id="show-more-room"]/ul/li/text()'))
hotel_amenities = Sumisho_tree.xpath('//*[@id="show-more-general"]/ul/li/text()')
for i, e in enumerate(hotel_amenities):
    Sumisho_df['Hotel_Amenities'+str(i)]=e.strip() #assign to separated columns
Sumisho_df['Check_In'] = Sumisho_tree.xpath('//*[@id="policies-and-fees"]/div[1]/p/text()')
Sumisho_df['Check_Out'] = Sumisho_tree.xpath('//*[@id="policies-and-fees"]/div[2]/p/text()')
Sumisho_df

然后您的数据框将包含分隔的列:

Hotel_Amenities1            Hotel_Amenities2    Hotel_Amenities3    Hotel_Amenities4                    Hotel_Amenities5
Total number of rooms - 83  Conference space    Free WiFi           Breakfast available (surcharge)     Free wired high-speed Internet  Laundry facilities

您还可以解析其他列有多个条目。

<强>更新

你可以尝试:

foo = lambda x: pd.Series([i for i in x.split(',')])
df1 = df['Hotel_Amenities'].apply(foo)
df.join(df1)