从字符串(NLTK)解析起点/终点

时间:2019-07-08 13:05:53

标签: python regex python-3.x pandas nltk

我有一些有关旅行产品标题的数据。

import pandas as pd
import nltk
import numpy as np

datadict = {'deals_raw_text': {0: 'Cheap flights from Moscow to NEW DELHI, INDIA from €255',
  1: '5* Qatar Airways flights from the UK to India (Kochi, Chennai, Ahmedabad, Bangalore, Kozhikode) from £340!',
  2: "What a corker: 2nts in Portugal's wine country from £155pp incl. design hotel, flights & breakfast",
  3: 'WIZZ AIR New Route from Budapest to KAZAN, RUSSIA',
  4: 'Weekend in Venice from only 105 € with flights and hotel',
  5: 'Oslo, Norway to New York, USA for only €184 roundtrip',
  6: 'Wizz Air announces new route between Budapest and Kazan!',
  7: 'Backpacking in 2019: 3 weeks by Costa Rica & Panama City for 640 € with accommodation flights',
  8: 'Vacaciones en Croacia: 3/7 noches de hotel + vuelos directos desde solo 259€',
  9: 'Business Class from Stockholm, Sweden to Singapore for only €1196 roundtrip (lie-flat seats)'}}

testdf = pd.DataFrame(datadict)

目标是最终从数据帧中的每个字符串中解析出起点和终点。因此,理想情况下,我将获得具有以下内容的数据框

1: Origin: 'UK', Destination: 'India'
2: Origin: np.nan, Destination: 'Portugal'
3: Origin: 'Budapest', Destination: 'Kazan, Russia'
...
9: Origin: 'Stockholm, Sweden', Destination: 'Singapore'

如您所见,字符串不一致,但有时会遵循f'{Origin}到{Destination}到{Price}'的结构。这还不足以编写基于规则的功能,因此我转向NLTK并尝试对元素进行标记化以识别城市。当然,我对regex和nltk不太熟悉。

编写了一个功能以标记ize:

def ie_preprocess(text=df.deals_raw_text[0]):
    sentences = nltk.sent_tokenize(text)
    sentences = [nltk.word_tokenize(sent) for sent in sentences]
    sentences = [nltk.pos_tag(sent) for sent in sentences]

    return sentences

现在我们应用函数:(我很确定语法字符串是错误的,但这就是我想出的)

grammar = "NP: {<NNP>?<TO>*<NNP>}"

results = []

for i in enumerate(testdf):
    print (i[1])
    cp = nltk.RegexpParser(grammar)
    r = cp.parse(ie_preprocess(i[1])[0])
    results.append(r)

但是我被困在这里。如何将这些标记化的元组转换为起源/目标城市的理想结果?谢谢。

0 个答案:

没有答案