我有一个数据集,其中一列的标题是“您的位置和时区是什么?”
这意味着我们有像
这样的条目甚至
有没有办法从中提取城市,国家和时区?
我正在考虑创建一个包含所有国家/地区名称(包括简短形式)的数组(来自开源数据集)以及城市名称/时区,然后如果数据集中的任何单词与城市/国家/时区或短格式将其填充到同一数据集中的新列中并对其进行计数。
这是否实用?
===========基于NLTK答案的REPLT ============
运行与Alecxe相同的代码
Traceback (most recent call last):
File "E:\SBTF\ntlk_test.py", line 19, in <module>
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
File "C:\Python27\ArcGIS10.4\lib\site-packages\nltk\tag\__init__.py", line 110, in pos_tag
tagger = PerceptronTagger()
File "C:\Python27\ArcGIS10.4\lib\site-packages\nltk\tag\perceptron.py", line 141, in __init__
self.load(AP_MODEL_LOC)
File "C:\Python27\ArcGIS10.4\lib\site-packages\nltk\tag\perceptron.py", line 209, in load
self.model.weights, self.tagdict, self.classes = load(loc)
File "C:\Python27\ArcGIS10.4\lib\site-packages\nltk\data.py", line 801, in load
opened_resource = _open(resource_url)
File "C:\Python27\ArcGIS10.4\lib\site-packages\nltk\data.py", line 924, in _open
return urlopen(resource_url)
File "C:\Python27\ArcGIS10.4\lib\urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout)
File "C:\Python27\ArcGIS10.4\lib\urllib2.py", line 431, in open
response = self._open(req, data)
File "C:\Python27\ArcGIS10.4\lib\urllib2.py", line 454, in _open
'unknown_open', req)
File "C:\Python27\ArcGIS10.4\lib\urllib2.py", line 409, in _call_chain
result = func(*args)
File "C:\Python27\ArcGIS10.4\lib\urllib2.py", line 1265, in unknown_open
raise URLError('unknown url type: %s' % type)
URLError: <urlopen error unknown url type: c>
答案 0 :(得分:6)
我会使用自然语言处理和nltk
提供的实体。
示例(严重基于this gist),它从文件中标记每一行,将其拆分为块并递归查找每个块的NE
(命名实体)标签。更多解释here:
import nltk
def extract_entity_names(t):
entity_names = []
if hasattr(t, 'label') and t.label:
if t.label() == 'NE':
entity_names.append(' '.join([child[0] for child in t]))
else:
for child in t:
entity_names.extend(extract_entity_names(child))
return entity_names
with open('sample.txt', 'r') as f:
for line in f:
sentences = nltk.sent_tokenize(line)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)
entities = []
for tree in chunked_sentences:
entities.extend(extract_entity_names(tree))
print(entities)
对于sample.txt
包含:
Denmark, CET
Location is Devon, England, GMT time zone
Australia. Australian Eastern Standard Time. +10h UTC.
My location is Eugene, Oregon for most of the year or in Seoul, South Korea depending on school holidays. My primary time zone is the Pacific time zone.
For the entire May I will be in London, United Kingdom (GMT+1). For the entire June I will be in either Norway (GMT+2) or Israel (GMT+3) with limited internet access. For the entire July and August I will be in London, United Kingdom (GMT+1). And then from September, 2015, I will be in Boston, United States (EDT)
打印:
['Denmark', 'CET']
['Location', 'Devon', 'England', 'GMT']
['Australia', 'Australian Eastern Standard Time']
['Eugene', 'Oregon', 'Seoul', 'South Korea', 'Pacific']
['London', 'United Kingdom', 'Norway', 'Israel', 'London', 'United Kingdom', 'Boston', 'United States', 'EDT']
输出并不理想,但对你来说可能是一个好的开始。