Question

我已经拥有了一个数据集，我仅可以从中提取计算机科学术语，因此我需要将list1与该数据集进行比较，以比较该任务的数据集。

https://www.aminer.org/oag2019

list1 = [“文档类型”，“调查和概述”，“参考作品”，“一般性会议记录”，“传记”，“一般文献”，“计算标准，RFC和指南”，“交叉计算”工具和技术”，......]

list1的总数是ACM的2112个计算机科学术语。

我必须以

形式在数据框列中与之比较（字符串比较）list1的数据框

df_train14year ['keywords']。head（）

0个“核磁共振谱”，“质谱”，“纳米... 1“ plk1”，“阳离子二烷基组氨酸”，“晶体”。 2“案例控制”，“儿童”，“燃料”，“碳氢化合物”，“ ... 3“ Ca2 +处理”，“ CaMKII”，“心肌细胞”，“续... 4
名称：关键字，dtype：对象

在数据框中的每个这些列表中，每个列表中最多有10个关键字min（3），并且在数据框中有数百万条记录。

因此，如果两个列表中都匹配了3个以上的单词，则必须将每个关键字与原始list1进行比较，并使用这些值填充数据框，可能还需要子字符串匹配。

如何在python中以低效率的方式完成此任务，我所做的是通过for循环与整个列表相比，每个关键字都存在三个循环，因此效率很低。

# for i in range(5):
#    df.loc[i] = ['<some value for first>','<some value for second>','<some value for third>']

count  = 0;
i = 0;
for index, row in df_train14year.iterrows():
  # print("index",index)
  i=1+1;
  # if(i==50):
  #   break
  for outr in row['keywords'].split(","):
      #print(count)   
      if (count>1):
          # print("found1")
          count = 0;
          break;
      for inr in computerList:
          # outr= outr.replace("[","")   # i skip the below three lines because i applied the pre- processing on data to remove the [] and "
          # outr= outr.replace("]","")
          outr= outr.replace('"',"")
          #print("outr",outr,"inr",inr)
          if outr in inr:
              count = count+1
              if (count>10):
                #print("outr",outr,"inr",inr)
                # print("found2")
                # df12.loc[i] = [index,row['keywords']]
                #df12.insert(index,"keywords",row['keywords'])
                df14_4_match = df14_4_match.append({'abstract': row['abstract'],'keywords': row['keywords'],'title': row['title'],'year': row['year']}, ignore_index=True)
                break;
          # else:
          #     print('not found')```

Answer 1

kewwords =数据帧中的[nmr光谱”，“质谱”，“ nanos”]行

预处理： dfk['list_keywords']=[[x for x in j.split('[')[1].split(']')[0].split('"')[1:-1] if x not in[',',' , ',', ']] for j in dfk['keywords']]

将组织列表转换为集合

dfk['set_keywords']=dfk['list_keywords'].map(lambda x: set(x))

我们比较问题中提到的kewordlist和computerlist（list1）的交集获取匹配项或关键字的数量

dfk['set_keywords']=dfk['set_keywords'].map(lambda x:x.intersection(proceseedComputerList))

使用此功能获取长度

dfk['len_keywords']=dfk['set_keywords'].map(lambda x:len(x))

升序排列

dfk.head()```

将长列表与数据框中的字符串进行比较，并根据匹配结果在Python中填充数据框

1 个答案: