我有一些有关旅行产品标题的数据。
import pandas as pd
import nltk
import numpy as np
datadict = {'deals_raw_text': {0: 'Cheap flights from Moscow to NEW DELHI, INDIA from €255',
1: '5* Qatar Airways flights from the UK to India (Kochi, Chennai, Ahmedabad, Bangalore, Kozhikode) from £340!',
2: "What a corker: 2nts in Portugal's wine country from £155pp incl. design hotel, flights & breakfast",
3: 'WIZZ AIR New Route from Budapest to KAZAN, RUSSIA',
4: 'Weekend in Venice from only 105 € with flights and hotel',
5: 'Oslo, Norway to New York, USA for only €184 roundtrip',
6: 'Wizz Air announces new route between Budapest and Kazan!',
7: 'Backpacking in 2019: 3 weeks by Costa Rica & Panama City for 640 € with accommodation flights',
8: 'Vacaciones en Croacia: 3/7 noches de hotel + vuelos directos desde solo 259€',
9: 'Business Class from Stockholm, Sweden to Singapore for only €1196 roundtrip (lie-flat seats)'}}
testdf = pd.DataFrame(datadict)
目标是最终从数据帧中的每个字符串中解析出起点和终点。因此,理想情况下,我将获得具有以下内容的数据框
1: Origin: 'UK', Destination: 'India'
2: Origin: np.nan, Destination: 'Portugal'
3: Origin: 'Budapest', Destination: 'Kazan, Russia'
...
9: Origin: 'Stockholm, Sweden', Destination: 'Singapore'
如您所见,字符串不一致,但有时会遵循f'{Origin}到{Destination}到{Price}'的结构。这还不足以编写基于规则的功能,因此我转向NLTK并尝试对元素进行标记化以识别城市。当然,我对regex和nltk不太熟悉。
编写了一个功能以标记ize:
def ie_preprocess(text=df.deals_raw_text[0]):
sentences = nltk.sent_tokenize(text)
sentences = [nltk.word_tokenize(sent) for sent in sentences]
sentences = [nltk.pos_tag(sent) for sent in sentences]
return sentences
现在我们应用函数:(我很确定语法字符串是错误的,但这就是我想出的)
grammar = "NP: {<NNP>?<TO>*<NNP>}"
results = []
for i in enumerate(testdf):
print (i[1])
cp = nltk.RegexpParser(grammar)
r = cp.parse(ie_preprocess(i[1])[0])
results.append(r)
但是我被困在这里。如何将这些标记化的元组转换为起源/目标城市的理想结果?谢谢。