在弄清楚这一点时,我们将不胜感激。
我基本上是尝试使用python库'lxml'从Expedia中抓取数据并将数据移动到数据框中。
酒店设施等一些专栏有几个条目。我正在尝试解析酒店设施和其他列中的几个条目并将它们移动到单独的列中。所以每个设施都有它自己的专栏。
再次感谢您的帮助。
from lxml import html
import requests
import lxml.html
from lxml.etree import XPath
from lxml import etree
import urllib
import pandas as pd
from fake_useragent import UserAgent
ua = UserAgent()
header = {'user-agent':ua.chrome}
Sumisho_url = requests.get('https://www.expedia.com/Tokyo-Hotels-Sumisho-Hotel.h2221301.Hotel-Information?chkin=6%2F22%2F2017&chkout=6%2F23%2F2017&rm1=a2®ionId=179900&hwrqCacheKey=65e880f7-4254-472b-a76c-a9d652938f8cHWRQ1498148578719&vip=false&c=80642461-a7d7-49bb-856e-df5db3b7cec9&', headers=header)
Sumisho_tree = html.fromstring(Sumisho_url.content)
Sumisho_columns = ['Name', 'Address','Telephone','Neighborhood','Star_Rating','Hotel_Features','Hotel_Amenities','Room_Amenities','Check_In','Check_Out']
Sumisho_df = pd.DataFrame(index=range(0,0),columns=Sumisho_columns)
Sumisho_df['Name'] = Sumisho_tree.xpath('//*[@id="hotel-name"]/text()')
Sumisho_df['Address'] = Sumisho_tree.xpath('//*[@id="license-plate"]/div[2]/a/span[2]/text()')
Sumisho_df['Telephone'] = Sumisho_tree.xpath('//*[@id="license-plate"]/div[2]/span/span/text()')
Sumisho_df['Neighborhood'] = ', '.join(Sumisho_tree.xpath('/html/body/div/div/section/div/div/p/text()'))
Sumisho_df['Star_Rating'] = Sumisho_tree.xpath('//*[@id="license-plate"]/div[1]/strong/span/text()')
Sumisho_df['Hotel_Features'] = ', '.join(Sumisho_tree.xpath('/html/body/div/div[7]/section/div[11]/div[2]/p[2]/text()'))
Sumisho_df['Room_Amenities'] = ', '.join(Sumisho_tree.xpath('//*[@id="show-more-room"]/ul/li/text()'))
Sumisho_df['Hotel_Amenities'] = ', '.join(Sumisho_tree.xpath('//*[@id="show-more-general"]/ul/li/text()'))
Sumisho_df['Check_In'] = Sumisho_tree.xpath('//*[@id="policies-and-fees"]/div[1]/p/text()')
Sumisho_df['Check_Out'] = Sumisho_tree.xpath('//*[@id="policies-and-fees"]/div[2]/p/text()')
Sumisho_df
答案 0 :(得分:0)
您已将数据Hotel_Amenities
作为list
抓取,您可以循环列表并将其分配给具有不同列名的数据框:
Sumisho_columns = ['Name', 'Address','Telephone','Neighborhood','Star_Rating','Hotel_Features','Hotel_Amenities','Room_Amenities','Check_In','Check_Out']
Sumisho_df = pd.DataFrame(index=range(0,0),columns=Sumisho_columns)
Sumisho_df['Name'] = Sumisho_tree.xpath('//*[@id="hotel-name"]/text()')
Sumisho_df['Address'] = Sumisho_tree.xpath('//*[@id="license-plate"]/div[2]/a/span[2]/text()')
Sumisho_df['Telephone'] = Sumisho_tree.xpath('//*[@id="license-plate"]/div[2]/span/span/text()')
Sumisho_df['Neighborhood'] = ', '.join(Sumisho_tree.xpath('/html/body/div/div/section/div/div/p/text()'))
Sumisho_df['Star_Rating'] = Sumisho_tree.xpath('//*[@id="license-plate"]/div[1]/strong/span/text()')
Sumisho_df['Hotel_Features'] = ', '.join(Sumisho_tree.xpath('/html/body/div/div[7]/section/div[11]/div[2]/p[2]/text()'))
Sumisho_df['Room_Amenities'] = ', '.join(Sumisho_tree.xpath('//*[@id="show-more-room"]/ul/li/text()'))
hotel_amenities = Sumisho_tree.xpath('//*[@id="show-more-general"]/ul/li/text()')
for i, e in enumerate(hotel_amenities):
Sumisho_df['Hotel_Amenities'+str(i)]=e.strip() #assign to separated columns
Sumisho_df['Check_In'] = Sumisho_tree.xpath('//*[@id="policies-and-fees"]/div[1]/p/text()')
Sumisho_df['Check_Out'] = Sumisho_tree.xpath('//*[@id="policies-and-fees"]/div[2]/p/text()')
Sumisho_df
然后您的数据框将包含分隔的列:
Hotel_Amenities1 Hotel_Amenities2 Hotel_Amenities3 Hotel_Amenities4 Hotel_Amenities5
Total number of rooms - 83 Conference space Free WiFi Breakfast available (surcharge) Free wired high-speed Internet Laundry facilities
您还可以解析其他列有多个条目。
<强>更新强>
你可以尝试:
foo = lambda x: pd.Series([i for i in x.split(',')])
df1 = df['Hotel_Amenities'].apply(foo)
df.join(df1)