从数据框中存在的文本返回子字符串

时间:2018-02-01 19:31:45

标签: python pandas

  

被修改

     

这是根据社区对该主题的澄清后编辑的问题,请参阅更改以查看原始内容   问题

我正在使用pythonnumpy,我的地址如下:

address = '4835 e. cactus rd suite 445 nightingale drive az 85254 usa' 

和名为Data Frame的{​​{1}}如下:

roads_dataframe

我想要做的是,获取ID road_match_array road_width b12 cactus rd 132 dk24 rosemont blvd 93 A93 research drive 843 h3 colorado blvd 328 address road_match_array列中存在的子字符串。换句话说,我想获得roads_dataframe中存在的地址字符串的一部分。

条件是road_match_array中可能有许多匹配可能是地址的一部分,有些可能是重复的,而有些可能是唯一的。在任何一种情况下,重复的和唯一的一个都应该是输出数据帧的一部分。

我总共有100万条道路,其中我想识别给定地址字符串中存在的道路。没有道路,1条道路或2条道路,完全取决于输入地址

3 个答案:

答案 0 :(得分:0)

你能不能这样做:

road = [i for i in road_match_array if i in address]

# road = ['cactus rd']

对于大型操作,请考虑使用pandas模块。 Taken from my answer here,假设您有一个DataFrame对象,其中包含一个名为Address的列:

new_df = pd.concat([df['address'], df['address'].str.extract('(?P<road>{})'.format("|".join(road_match_array)), expand=True)], axis=1)

应返回下面类似的DataFrame

                                            Address             road
0  4835 e. cactus rd suite 445 nightingale drive ...       cactus rd
1  4835 e. research drive suite 445 nightingale d...  research drive

此解决方案假设您最多只能从road_match_array获得任意一行匹配。

答案 1 :(得分:0)

这可能是一个相当抽象的答案,但这是我将如何处理这个问题。

由于您要从CSV加载子字符串,因此从CSV文件中流式传输记录可能是值得的。

此外,您可以使用Python的多线程库或dask传递搜索作业。我有一个关于Python并行性的blog post

我的样本csv&#34; substrings&#34;搜索:

road one
former road
night road
president street
road one
former road
one lane
highway drive
sunset blvd
one lane
highway drive
sunset blvd
one lane
highway drive
sunset blvd
one lane
highway drive
sunset blvd
night road
president street
road one
former road
one lane
highway drive
sunset blvd
night road
president street
road one
former road
night road
president street
road one
former road
night road
president street
one lane
highway drive
sunset blvd
road one
former road
night road
president street
road one
former road
night road
president street
one lane
highway drive
one lane
highway drive
sunset blvd
one lane
highway drive
sunset blvd
one lane
highway drive
sunset blvd
sunset blvd
road one
former road
night road
president street
road one
former road
night road
president street
road one
former road
night road
president street
one lane
highway drive
sunset blvd

实际代码:

import csv
from multiprocessing.dummy import Pool

my_address = "1234 sunset blvd hollywood highway drive, california 91210"


def search_address(my_csv_row):
    if my_csv_row[0] in my_address:  # the 0th index is the column in question
        return my_csv_row[0]


pool = Pool()
with open('sample.csv') as infile:
    reader = csv.reader(infile)
    results = pool.map(search_address, reader)
pool.close()
pool.join()

print([x for x in results if x])

结果:

['highway drive', 'sunset blvd', 'highway drive', 'sunset blvd', 'highway drive', 'sunset blvd', 'highway drive', 'sunset blvd', 'highway drive', 'sunset blvd', 'highway drive', 'sunset blvd', 'highway drive', 'highway drive', 'sunset blvd', 'highway drive', 'sunset blvd', 'highway drive', 'sunset blvd', 'sunset blvd', 'highway drive', 'sunset blvd']

这种方法的好处是:

  1. 您不必将所有100万条记录加载到内存中。您只是逐行流式传输,以便您的Python线程进行评估。
  2. 您正在跨线程并行搜索,以减少搜索时间。

答案 2 :(得分:0)

我最终使用nltk来寻找ngram方法。下面是代码:

from nltk.util import ngrams
from nltk import word_tokenize


all_grams = sorted([' '.join(t) for i in range(1, 6) for t in ngrams(word_tokenize(address), i)], key=len)
intersections = roads_dataframe.loc[roads_dataframe['road_match_array'].isin(all_grams)]