TL; DR

Question

我有一个pandas数据框，其中一列是一串带有某些旅行细节的字符串。我的目标是解析每个字符串，以提取始发城市和目的地城市（我希望最终有两列名为“起源”和“目的地”的新列）。

数据：

df_col = [
    'new york to venice, italy for usd271',
    'return flights from brussels to bangkok with etihad from â‚¬407',
    'from los angeles to guadalajara, mexico for usd191',
    'fly to australia new zealand from paris from â‚¬422 return including 2 checked bags'
]

这应该导致：

Origin: New York, USA; Destination: Venice, Italy
Origin: Brussels, BEL; Destination: Bangkok, Thailand
Origin: Los Angeles, USA; Destination: Guadalajara, Mexico
Origin: Paris, France; Destination: Australia / New Zealand (this is a complicated case given two countries)

到目前为止，我已经尝试过：各种各样的NLTK方法，但是让我最接近的是使用nltk.pos_tag方法标记字符串中的每个单词。结果是带有每个单词和相关标签的元组列表。这是一个例子...

[('Fly', 'NNP'), ('to', 'TO'), ('Australia', 'NNP'), ('&', 'CC'), ('New', 'NNP'), ('Zealand', 'NNP'), ('from', 'IN'), ('Paris', 'NNP'), ('from', 'IN'), ('â‚¬422', 'NNP'), ('return', 'NN'), ('including', 'VBG'), ('2', 'CD'), ('checked', 'VBD'), ('bags', 'NNS'), ('!', '.')]

我在此阶段处于困境，不确定如何最好地实现这一目标。有人能指出我正确的方向吗？谢谢。

Answer 1

TL; DR

乍看之下几乎是不可能的，除非您可以访问某些包含相当复杂的组件的API。

长话

从第一眼看，似乎您是在要求神奇地解决自然语言问题。但是，让我们分解一下它的范围，将其范围扩展到可以构建某些东西的程度。

首先，要识别国家和城市，您需要枚举它们的数据，因此，请尝试：https://www.google.com/search?q=list+of+countries+and+cities+in+the+world+json

在搜索结果的顶部，我们找到https://datahub.io/core/world-cities，它指向world-cities.json文件。现在，我们将它们加载到多个国家和城市中。

import requests
import json

cities_url = "https://pkgstore.datahub.io/core/world-cities/world-cities_json/data/5b3dd46ad10990bca47b04b4739a02ba/world-cities_json.json"
cities_json = json.loads(requests.get(cities_url).content.decode('utf8'))

countries = set([city['country'] for city in cities_json])
cities = set([city['name'] for city in cities_json])

现在已提供数据，让我们尝试构建第一个组件：

任务：检测文本中的任何子字符串是否与城市/国家/地区匹配。
工具： https://github.com/vi3k6i5/flashtext（快速的字符串搜索/匹配）
指标：正确识别的字符串中城市/国家的数量

让它们放在一起。

import requests
import json
from flashtext import KeywordProcessor

cities_url = "https://pkgstore.datahub.io/core/world-cities/world-cities_json/data/5b3dd46ad10990bca47b04b4739a02ba/world-cities_json.json"
cities_json = json.loads(requests.get(cities_url).content.decode('utf8'))

countries = set([city['country'] for city in cities_json])
cities = set([city['name'] for city in cities_json])


keyword_processor = KeywordProcessor(case_sensitive=False)
keyword_processor.add_keywords_from_list(sorted(countries))
keyword_processor.add_keywords_from_list(sorted(cities))


texts = ['new york to venice, italy for usd271',
'return flights from brussels to bangkok with etihad from â‚¬407',
'from los angeles to guadalajara, mexico for usd191',
'fly to australia new zealand from paris from â‚¬422 return including 2 checked bags']
keyword_processor.extract_keywords(texts[0])

[输出]：

['York', 'Venice', 'Italy']

嘿，出了什么问题？！

做尽职调查，首先的预感是数据中没有“纽约”，

>>> "New York" in cities
False

什么？！＃$％^＆*为了理智起见，我们检查以下内容：

>>> len(countries)
244
>>> len(cities)
21940

是的，您不能仅信任单个数据源，因此请尝试获取所有数据源。

在https://www.google.com/search?q=list+of+countries+and+cities+in+the+world+json中，您找到了另一个链接https://github.com/dr5hn/countries-states-cities-database，让我们对其进行补充...

import requests
import json

cities_url = "https://pkgstore.datahub.io/core/world-cities/world-cities_json/data/5b3dd46ad10990bca47b04b4739a02ba/world-cities_json.json"
cities1_json = json.loads(requests.get(cities_url).content.decode('utf8'))

countries1 = set([city['country'] for city in cities1_json])
cities1 = set([city['name'] for city in cities1_json])

dr5hn_cities_url = "https://raw.githubusercontent.com/dr5hn/countries-states-cities-database/master/cities.json"
dr5hn_countries_url = "https://raw.githubusercontent.com/dr5hn/countries-states-cities-database/master/countries.json"

cities2_json = json.loads(requests.get(dr5hn_cities_url).content.decode('utf8'))
countries2_json = json.loads(requests.get(dr5hn_countries_url).content.decode('utf8'))

countries2 = set([c['name'] for c in countries2_json])
cities2 = set([c['name'] for c in cities2_json])

countries = countries2.union(countries1)
cities = cities2.union(cities1)

现在我们变得神经质了，我们进行健全性检查。

>>> len(countries)
282
>>> len(cities)
127793

哇，那儿的城市比以前多了。

让我们再次尝试flashtext代码。

from flashtext import KeywordProcessor

keyword_processor = KeywordProcessor(case_sensitive=False)
keyword_processor.add_keywords_from_list(sorted(countries))
keyword_processor.add_keywords_from_list(sorted(cities))

texts = ['new york to venice, italy for usd271',
'return flights from brussels to bangkok with etihad from â‚¬407',
'from los angeles to guadalajara, mexico for usd191',
'fly to australia new zealand from paris from â‚¬422 return including 2 checked bags']

keyword_processor.extract_keywords(texts[0])

[输出]：

['York', 'Venice', 'Italy']

严重吗？没有纽约吗？ $％^＆*

好吧，要进行更多的完整性检查，只需要在城市列表中查找“纽约”即可。

>>> [c for c in cities if 'york' in c.lower()]
['Yorklyn',
 'West York',
 'West New York',
 'Yorktown Heights',
 'East Riding of Yorkshire',
 'Yorke Peninsula',
 'Yorke Hill',
 'Yorktown',
 'Jefferson Valley-Yorktown',
 'New York Mills',
 'City of York',
 'Yorkville',
 'Yorkton',
 'New York County',
 'East York',
 'East New York',
 'York Castle',
 'York County',
 'Yorketown',
 'New York City',
 'York Beach',
 'Yorkshire',
 'North Yorkshire',
 'Yorkeys Knob',
 'York',
 'York Town',
 'York Harbor',
 'North York']

尤里卡！这是因为它被称为“纽约市”而不是“纽约”！

您：这是什么恶作剧？

语言学家：：欢迎来到自然语言处理的世界，在这里，自然语言是受制于社区和公共场所的社会建构。

您：废话，告诉我如何解决。

NLP Practitioner （一种真正的可处理嘈杂的用户生成文本的语言）：您只需将其添加到列表中即可。但是在此之前，请根据您已经拥有的列表检查您的指标。

对于示例“测试集”中的每个文本，都应提供一些真相标签，以确保您可以“度量指标”。

from itertools import zip_longest
from flashtext import KeywordProcessor

keyword_processor = KeywordProcessor(case_sensitive=False)
keyword_processor.add_keywords_from_list(sorted(countries))
keyword_processor.add_keywords_from_list(sorted(cities))

texts_labels = [('new york to venice, italy for usd271', ('New York', 'Venice', 'Italy')),
('return flights from brussels to bangkok with etihad from â‚¬407', ('Brussels', 'Bangkok')),
('from los angeles to guadalajara, mexico for usd191', ('Los Angeles', 'Guadalajara')),
('fly to australia new zealand from paris from â‚¬422 return including 2 checked bags', ('Australia', 'New Zealand', 'Paris'))]

# No. of correctly extracted terms.
true_positives = 0
false_positives = 0
total_truth = 0

for text, label in texts_labels:
    extracted = keyword_processor.extract_keywords(text)

    # We're making some assumptions here that the order of 
    # extracted and the truth must be the same.
    true_positives += sum(1 for e, l in zip_longest(extracted, label) if e == l)
    false_positives += sum(1 for e, l in zip_longest(extracted, label) if e != l)
    total_truth += len(label)

    # Just visualization candies.
    print(text)
    print(extracted)
    print(label)
    print()

实际上，看起来还不错。我们的准确度达到90％：

>>> true_positives / total_truth
0.9

但是我％^＆*（-ing想要100％提取!!

好的，好的，所以请看上面方法产生的“唯一”错误，只是“纽约”不在城市列表中。

您：我们为什么不将“纽约”添加到城市列表中，即

keyword_processor.add_keyword('New York')

print(texts[0])
print(keyword_processor.extract_keywords(texts[0]))

[输出]：

['New York', 'Venice', 'Italy']

您：瞧，我做到了！！！现在我该喝啤酒了。 语言学家：'I live in Marawi'怎么样？

>>> keyword_processor.extract_keywords('I live in Marawi')
[]

NLP从业者（参与）：'I live in Jeju'怎么样？

>>> keyword_processor.extract_keywords('I live in Jeju')
[]

Raymond Hettinger的粉丝（来自遥远的地方）：“一定有更好的方法！”

是的，如果我们只是尝试一些愚蠢的事情，例如在我们的keyword_processor中添加以“城市”结尾的城市关键字，会怎样？

for c in cities:
    if 'city' in c.lower() and c.endswith('City') and c[:-5] not in cities:
        if c[:-5].strip():
            keyword_processor.add_keyword(c[:-5])
            print(c[:-5])

有效！

现在让我们重试回归测试示例：

from itertools import zip_longest
from flashtext import KeywordProcessor

keyword_processor = KeywordProcessor(case_sensitive=False)
keyword_processor.add_keywords_from_list(sorted(countries))
keyword_processor.add_keywords_from_list(sorted(cities))

for c in cities:
    if 'city' in c.lower() and c.endswith('City') and c[:-5] not in cities:
        if c[:-5].strip():
            keyword_processor.add_keyword(c[:-5])

texts_labels = [('new york to venice, italy for usd271', ('New York', 'Venice', 'Italy')),
('return flights from brussels to bangkok with etihad from â‚¬407', ('Brussels', 'Bangkok')),
('from los angeles to guadalajara, mexico for usd191', ('Los Angeles', 'Guadalajara')),
('fly to australia new zealand from paris from â‚¬422 return including 2 checked bags', ('Australia', 'New Zealand', 'Paris')),
('I live in Florida', ('Florida')), 
('I live in Marawi', ('Marawi')), 
('I live in jeju', ('Jeju'))]

# No. of correctly extracted terms.
true_positives = 0
false_positives = 0
total_truth = 0

for text, label in texts_labels:
    extracted = keyword_processor.extract_keywords(text)

    # We're making some assumptions here that the order of 
    # extracted and the truth must be the same.
    true_positives += sum(1 for e, l in zip_longest(extracted, label) if e == l)
    false_positives += sum(1 for e, l in zip_longest(extracted, label) if e != l)
    total_truth += len(label)

    # Just visualization candies.
    print(text)
    print(extracted)
    print(label)
    print()

[输出]：

new york to venice, italy for usd271
['New York', 'Venice', 'Italy']
('New York', 'Venice', 'Italy')

return flights from brussels to bangkok with etihad from â‚¬407
['Brussels', 'Bangkok']
('Brussels', 'Bangkok')

from los angeles to guadalajara, mexico for usd191
['Los Angeles', 'Guadalajara', 'Mexico']
('Los Angeles', 'Guadalajara')

fly to australia new zealand from paris from â‚¬422 return including 2 checked bags
['Australia', 'New Zealand', 'Paris']
('Australia', 'New Zealand', 'Paris')

I live in Florida
['Florida']
Florida

I live in Marawi
['Marawi']
Marawi

I live in jeju
['Jeju']
Jeju

100％是，NLP蹦极!!!

但认真的说，这只是问题的提示。如果您有这样的句子，该怎么办：

>>> keyword_processor.extract_keywords('Adam flew to Bangkok from Singapore and then to China')
['Adam', 'Bangkok', 'Singapore', 'China']

为什么Adam被提取为城市？！

然后您再进行一些神经质检查：

>>> 'Adam' in cities
Adam

恭喜，您跳入了另一个多义词的NLP兔子洞，其中同一个词具有不同的含义，在这种情况下，Adam很可能是句子中的一个人，但它恰好是一个城市（根据您提取的数据）。

我知道您在这里做了什么...即使我们忽略了这种多义性的废话，您仍然没有给我想要的输出：

[输入]：

['new york to venice, italy for usd271',
'return flights from brussels to bangkok with etihad from â‚¬407',
'from los angeles to guadalajara, mexico for usd191',
'fly to australia new zealand from paris from â‚¬422 return including 2 checked bags'
]

[输出]：

Origin: New York, USA; Destination: Venice, Italy
Origin: Brussels, BEL; Destination: Bangkok, Thailand
Origin: Los Angeles, USA; Destination: Guadalajara, Mexico
Origin: Paris, France; Destination: Australia / New Zealand (this is a complicated case given two countries)

语言学家：即使假设该城市之前的介词（例如from，to）为您提供了“起源” /“目的地”标签，您将要处理“多航段”航班的情况，例如

>>> keyword_processor.extract_keywords('Adam flew to Bangkok from Singapore and then to China')

这句话的预期输出是什么？

> Adam flew to Bangkok from Singapore and then to China

也许是这样吗？规格是多少？输入文本的结构（非结构化）？

> Origin: Singapore
> Departure: Bangkok
> Departure: China

让我们尝试构建两个分量以检测介词。

让我们以您所拥有的假设为前提，并尝试对相同的flashtext方法进行一些修改。

如果我们将to和from添加到列表怎么办？

from itertools import zip_longest
from flashtext import KeywordProcessor

keyword_processor = KeywordProcessor(case_sensitive=False)
keyword_processor.add_keywords_from_list(sorted(countries))
keyword_processor.add_keywords_from_list(sorted(cities))

for c in cities:
    if 'city' in c.lower() and c.endswith('City') and c[:-5] not in cities:
        if c[:-5].strip():
            keyword_processor.add_keyword(c[:-5])

keyword_processor.add_keyword('to')
keyword_processor.add_keyword('from')

texts = ['new york to venice, italy for usd271',
'return flights from brussels to bangkok with etihad from â‚¬407',
'from los angeles to guadalajara, mexico for usd191',
'fly to australia new zealand from paris from â‚¬422 return including 2 checked bags']


for text in texts:
    extracted = keyword_processor.extract_keywords(text)
    print(text)
    print(extracted)
    print()

[输出]：

new york to venice, italy for usd271
['New York', 'to', 'Venice', 'Italy']

return flights from brussels to bangkok with etihad from â‚¬407
['from', 'Brussels', 'to', 'Bangkok', 'from']

from los angeles to guadalajara, mexico for usd191
['from', 'Los Angeles', 'to', 'Guadalajara', 'Mexico']

fly to australia new zealand from paris from â‚¬422 return including 2 checked bags
['to', 'Australia', 'New Zealand', 'from', 'Paris', 'from']

呵呵，这是往/从使用的非常糟糕的规则，

如果“发件人”指的是机票价格怎么办？
如果在国家（地区）/城市之前没有“到/从（to / from）”那怎么办？

好吧，让我们使用上面的输出，看看我们如何处理该问题1. 也许要检查from后面的术语是否为城市，如果不是，请删除to / from？

from itertools import zip_longest
from flashtext import KeywordProcessor

keyword_processor = KeywordProcessor(case_sensitive=False)
keyword_processor.add_keywords_from_list(sorted(countries))
keyword_processor.add_keywords_from_list(sorted(cities))

for c in cities:
    if 'city' in c.lower() and c.endswith('City') and c[:-5] not in cities:
        if c[:-5].strip():
            keyword_processor.add_keyword(c[:-5])

keyword_processor.add_keyword('to')
keyword_processor.add_keyword('from')

texts = ['new york to venice, italy for usd271',
'return flights from brussels to bangkok with etihad from â‚¬407',
'from los angeles to guadalajara, mexico for usd191',
'fly to australia new zealand from paris from â‚¬422 return including 2 checked bags']


for text in texts:
    extracted = keyword_processor.extract_keywords(text)
    print(text)

    new_extracted = []
    extracted_next = extracted[1:]
    for e_i, e_iplus1 in zip_longest(extracted, extracted_next):
        if e_i == 'from' and e_iplus1 not in cities and e_iplus1 not in countries:
            print(e_i, e_iplus1)
            continue
        elif e_i == 'from' and e_iplus1 == None: # last word in the list.
            continue
        else:
            new_extracted.append(e_i)

    print(new_extracted)
    print()

这似乎可以解决问题，并删除不在城市/国家/地区之前的from。

[输出]：

new york to venice, italy for usd271
['New York', 'to', 'Venice', 'Italy']

return flights from brussels to bangkok with etihad from â‚¬407
from None
['from', 'Brussels', 'to', 'Bangkok']

from los angeles to guadalajara, mexico for usd191
['from', 'Los Angeles', 'to', 'Guadalajara', 'Mexico']

fly to australia new zealand from paris from â‚¬422 return including 2 checked bags
from None
['to', 'Australia', 'New Zealand', 'from', 'Paris']

但是“来自纽约”仍然无法解决！

语言学家：请仔细考虑，应该通过做出明智的决定以使歧义变得明显来解决歧义吗？如果是这样，知情决定中的“信息”是什么？在填充歧义之前，是否应该先遵循某个模板来检测信息？

您：我对您失去耐心了……您使我无处不在，可以从新闻和Google不断听到的能理解人类语言的AI在那里和Facebook等等？

您：您给我的内容都是基于规则的，而这些方面的AI在哪里？

NLP从业者：您不是想要100％吗？编写“业务逻辑”或基于规则的系统将是在没有给定特定数据集的情况下真正实现“ 100％”的唯一方法，而该数据集可用于“训练AI”。

您：您训练AI是什么意思？为什么我不能只使用Google或Facebook或Amazon或Microsoft甚至IBM的AI？

NLP从业者：让我向您介绍

欢迎来到计算语言学和自然语言处理世界！

简而言之

是的，还没有真正的现成的魔术解决方案，如果您想使用“ AI”或机器学习算法，则很可能需要更多的训练数据，如上面显示的texts_labels对例子。

从字符串中解析来源城市/目的地城市

1 个答案:

TL; DR

长话

现在已提供数据，让我们尝试构建第一个组件：

嘿，出了什么问题？！

现在我们变得神经质了，我们进行健全性检查。

严重吗？没有纽约吗？ $％^＆*

尤里卡！这是因为它被称为“纽约市”而不是“纽约”！

对于示例“测试集”中的每个文本，都应提供一些真相标签，以确保您可以“度量指标”。

但是我％^＆*（-ing想要100％提取!!

有效！

100％是，NLP蹦极!!!

我知道您在这里做了什么...即使我们忽略了这种多义性的废话，您仍然没有给我想要的输出：

让我们尝试构建两个分量以检测介词。

呵呵，这是往/从使用的非常糟糕的规则，

但是“来自纽约”仍然无法解决！

简而言之