熊猫,搜索真的很难吗?

时间:2020-02-05 07:19:11

标签: python python-3.x pandas dataframe

如果要匹配/找到全文,请在这里搜索paper_title列中reference列的值,获取该引用行的_id (不是{{ _id行中的1}} 匹配的位置,并将paper_title保存在_id列中。

paper_title_in

出局:

dataframe

预期结果:

Expected Result

最后是具有唯一值的最终结果数据帧:

请注意,In[1]: d ={ "_id": [ "Y100", "Y100", "Y100", "Y101", "Y101", "Y101", "Y102", "Y102", "Y102" ] , "paper_title": [ "translation using information on dialogue participants", "translation using information on dialogue participants", "translation using information on dialogue participants", "#emotional tweets", "#emotional tweets", "#emotional tweets", "#supportthecause: identifying motivations to participate in online health campaigns", "#supportthecause: identifying motivations to participate in online health campaigns", "#supportthecause: identifying motivations to participate in online health campaigns" ] , "reference": [ "beattie, gs (2005, november) #supportthecause: identifying motivations to participate in online health campaigns may 31, 2017, from", "burton, n (2012, june 5) depressive realism retrieved may 31, 2017, from", "gotlib, i h, 27 hammen, c l (1992) #supportthecause: identifying motivations to participate in online health campaigns new york: wiley", "paul ekman 1992 an argument for basic emotions cognition and emotion, 6(3):169200", "saif m mohammad 2012a #tagspace: semantic embeddings from hashtags in mail and books to appear in decision support systems", "robert plutchik 1985 on emotion: the chickenand-egg problem revisited motivation and emotion, 9(2):197200", "alastair iain johnston, rawi abdelal, yoshiko herrera, and rose mcdermott, editors 2009 translation using information on dialogue participants cambridge university press", "j richard landis and gary g koch 1977 the measurement of observer agreement for categorical data biometrics, 33(1):159174", "tomas mikolov, kai chen, greg corrado, and jeffrey dean 2013 #emotional tweets arxiv:13013781" ] } import pandas as pd df=pd.DataFrame(d) df 列的所有paper_title_in标题都作为列表显示在_id列中。

Final_dataframe

我尝试了此操作,但是它返回的reference的{​​{1}}列的_id比匹配的paper_title列要搜索。预期结果数据框给出了更清晰的思路。在那里看看。

paper_presented_in

1 个答案:

答案 0 :(得分:0)

因此,要解决您的问题,您将需要两个字典和一个列表以临时存储一些值。

# A list to store unique paper titles
unique_paper_title


# A dict to store mapping of unique paper to unique ids
mapping_dict_paper_to_id = dict()

# A dict to store mapping unique idx to the ids
mapping_id_to_idx = dict()


# This gives us the unique paper title's list
unique_paper_title = df["paper_title"].unique()



# Storing values in the dict mapping_dict_paper_to_id

for value in unique_paper_title:
    mapping_dict_paper_to_id[value] = df["_id"][df["paper_title"]==value].unique()[0]



# Storing values in the dict mapping_id_to_idx

for value in unique_paper_title:

    # this gives us the indexes of the matched string ie. the paper_title
    idx_list = df[df['reference'].str.contains(value)].index

    # Storing values in the dictionary
    for idx in idx_list:
        mapping_id_to_idx[idx] = mapping_dict_paper_to_id[value]


# This loops check if the index have any refernce's id and then updates the paper_present_in field accordingly

for i in df.index:
    if i in mapping_id_to_idx:
        df['paper_present_in'][i] = mapping_id_to_idx[i]
    else:
        df['paper_present_in'][i] = "None"

上面的代码将检查并更新数据框中的搜索值。