无法比较2个包含字符串的Python集

时间:2018-07-12 21:57:14

标签: python jupyter-notebook jupyter-lab

我已经创建了2个python集,它们是从2个包含某些字符串的CSV文件创建的。

我正在尝试匹配两个集合,以便它将返回两个的交集(应该返回两个集合中的公共字符串)。

这是我的代码的样子:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
import nltk
#using content mmanager to open and read file
#converted the text file into csv file at the source using Notepad++
with open(r'skills.csv', 'r', encoding="utf-8-sig") as f:
    myskills = f.readlines()
    #converting mall the string in the list to lowercase
    list_of_myskills = map(lambda x: x.lower(), myskills)
    set_of_myskills = set(list_of_myskills)
    #print(type(nodup_filtered_content))
print(set_of_myskills)
#open and read by line from the text file
with open(r'list_of_skills.csv', 'r') as f2:
    #using readlines() instead of read(), becasue it reads line by line (each 
    line as a string obj in the python list)
    contents_f2 = f2.readlines()
    #converting mall the string in the list to lowercase
    list_of_skills = map(lambda x: x.lower(), contents_f2)
    #converting into sets
    set_of_skills = set(list_of_skills)
print(set_of_skills)

这是我正在使用的功能:

def set_compare(set1,set2):
if(set1 & set2):
    return print('The matching skills are: '(set1 & set2))
else:
    print("No matching skills")

运行代码后:

    set_compare(set_of_skills,set_of_myskills)

输出:

No matching skills

'skills.csv'的内容为:

{'critical thinking,identify user needs,business intelligence,business analysis,teamwork,database,data visualization,data analysis,relational database,mysql,oracle sql,design,entity-relationship,develop ,use-cases ,scenarios,project development ,user requirement,design,sequence diagram,state diagram,identifying,uml diagrams,html5,css3,php,clean,analyze,plot,data,python,pandas,numpy,matplotlib,ipython notebook,spyder,anaconda,jupyterlab,data analysis,data visualization,tableau,database,surveys,prototyping,logical data models,data models,requirement elicitation.,leadreship,mysq,team,prioratization,analyze,articulate,'}


文件“ list_of_skills.csv”的内容:

{'assign passwords and maintain database access,agile development,agile project methodology,amazon web services (aws),analytics,analytical,analyze and recommend database improvements,analyze impact of database changes to the business,audit database access and requests,apis,application and server monitoring tools,applications,application development,attention to detail,architecture,big data,business analytics,business intelligence,business process modeling,cloud applications,cloud based visualizations,cloud hosting services,cloud maintenance tasks,cloud management tools,cloud platforms,cloud scalability,cloud services,cloud systems administration,code,coding,computer,communication,configure database software,configuration,configuration management,content strategy,content management,continually review processes for improvement ,continuous deployment,continuous integration,critical thinking,customer support,database,data analysis,data analytics,data imports,data imports,data intelligence,data mining,data modeling,data science,data strategy,data storage,data visualization tools,data visualizations,database administration,deploying applications in a cloud environment,deployment automation tools,deployment of cloud services,design,desktop support,design,design and build database management system,design principles,design prototypes,design specifications,design tools,develop and secure network structures,develop and test methods to synchronize data ,developer,development,documentation,emerging technologies,file systems,flexibility,front end design,google analytics,hardware,help desk,identify user needs ,implement backup and recovery plan ,implementation,information architecture,information design,information systems,interaction design,interaction flows,"install, maintain, and merge databases ",installation,integrated technologies,integrating security protocols with cloud design,internet,it optimization,it security,it soft skills,it solutions,it support,languages,logical thinking,leadership,linux,management,messaging,methodology,metrics,microsoft office,migrating existing workloads into cloud systems,mobile applications,motivation,networks,network operations,networking,open source technology integration,operating systems,operations,optimize queries on live data,optimizing user experiences,optimizing website performance,organization,presentation,programming,problem solving,process flows,product design,product development,prototyping methods,product development,product management,product support,product training,project management,repairs,reporting,research emerging technology,responsive design,review existing solutions,search engine optimization (seo),security,self motivated,self starting,servers,software,software development,software engineering,software quality assurance (qa),solid project management capabilities ,solid understanding of company’s data needs ,storage,strong technical and interpersonal communication ,support,systems software,tablets,team building,team oriented,teamwork,technology,tech skills,technical support,technical writing,testing,time management,tools,touch input navigation,training,troubleshooting,troubleshooting break-fix scenarios,user research,user testing,usability,user-centered design,user experience,user flows,user interface,user interaction diagrams,user research,user testing,ui / ux,utilizing cloud automation tools,virtualization,visual design,web analytics,web applications,web development,web design,web technologies,wireframes,work independently,'}

尽管我可以实际看到匹配的关键字,但我不明白为什么我没有得到输出。

也没有收到任何错误

2 个答案:

答案 0 :(得分:0)

比较两组字符串将不会比较那些字符串的子字符串。程序实际上在做什么

foo = {'ABC', 'DEF', 'GHI'}
bar = {'AB', 'CD', 'DE', 'FG', 'HI'}

foo.intersection(bar) # returns {}

仅因为在不同集合中的字符串之间共享字符并不意味着这些集合具有交集。字符串'ABC'位于第一个而不是第二个,字符串'AB'位于第二个而不是第一个,等等。

目前还不清楚您到底要尝试比较两个csv之间的交集。您要查找两个单元格中的单个单元格吗?它们是否也必须在列中匹配?如果您提供有关预期输出的更多信息,那么我可以编辑此答案以提供更多信息。

[编辑] 根据您的评论,看起来您想要的是在逗号上分割那些巨大的字符串,以使集合的元素成为单个单元格。当前,这些集合每个都只有一个元素,每个元素只是一个包含许多技能的巨型字符串。如果您更换

list_of_myskills = map(lambda x: x.lower(), myskills)

使用

list_of_myskills = [y.strip().lower() for x in myskills for y in x.split(',')]

并相应地替换另一行,那么您将很可能接近预期。

答案 1 :(得分:0)

这有效:更改.csv文件以包含技能单词,并用“,”分隔。每个文件一行。

import pandas as pd
myskills = pd.read_csv("skills.csv",header=None)
set_of_my_skills = set(myskills.iloc[0,])
list_of_skills = pd.read_csv("list_of_skills.csv",header=None)
set_of_skills = set(list_of_skills.iloc[0,])
print(set_of_my_skills & set_of_skills)

{'business intelligence', 'design', 'critical thinking', 'data analysis', 'database', 'teamwork'}

skills.csv : critical thinking,identify user needs,business intelligence,business analysis,teamwork,database,data visualization,data analysis,relational database,mysql,oracle sql,design,entity-relationship,develop ,use-cases ,scenarios,project development ,user requirement,design,sequence diagram,state diagram,identifying,uml diagrams,html5,css3,php,clean,analyze,plot,data,python,pandas,numpy,matplotlib,ipython notebook,spyder,anaconda,jupyterlab,data analysis,data visualization,tableau,database,surveys,prototyping,logical data models,data models,requirement elicitation.,leadreship,mysq,team,prioratization,analyze,articulate         
list_of_skills.csv: assign passwords and maintain database access,agile development,agile project methodology,amazon web services (aws),analytics,analytical,analyze and recommend database improvements,analyze impact of database changes to the business,audit database access and requests,apis,application and server monitoring tools,applications,application development,attention to detail,architecture,big data,business analytics,business intelligence,business process modeling,cloud applications,cloud based visualizations,cloud hosting services,cloud maintenance tasks,cloud management tools,cloud platforms,cloud scalability,cloud services,cloud systems administration,code,coding,computer,communication,configure database software,configuration,configuration management,content strategy,content management,continually review processes for improvement ,continuous deployment,continuous integration,critical thinking,customer support,database,data analysis,data analytics,data imports,data imports,data intelligence,data mining,data modeling,data science,data strategy,data storage,data visualization tools,data visualizations,database administration,deploying applications in a cloud environment,deployment automation tools,deployment of cloud services,design,desktop support,design,design and build database management system,design principles,design prototypes,design specifications,design tools,develop and secure network structures,develop and test methods to synchronize data ,developer,development,documentation,emerging technologies,file systems,flexibility,front end design,google analytics,hardware,help desk,identify user needs ,implement backup and recovery plan ,implementation,information architecture,information design,information systems,interaction design,interaction flows,"install, maintain, and merge databases ",installation,integrated technologies,integrating security protocols with cloud design,internet,it optimization,it security,it soft skills,it solutions,it support,languages,logical thinking,leadership,linux,management,messaging,methodology,metrics,microsoft office,migrating existing workloads into cloud systems,mobile applications,motivation,networks,network operations,networking,open source technology integration,operating systems,operations,optimize queries on live data,optimizing user experiences,optimizing website performance,organization,presentation,programming,problem solving,process flows,product design,product development,prototyping methods,product development,product management,product support,product training,project management,repairs,reporting,research emerging technology,responsive design,review existing solutions,search engine optimization (seo),security,self motivated,self starting,servers,software,software development,software engineering,software quality assurance (qa),solid project management capabilities ,solid understanding of company’s data needs ,storage,strong technical and interpersonal communication ,support,systems software,tablets,team building,team oriented,teamwork,technology,tech skills,technical support,technical writing,testing,time management,tools,touch input navigation,training,troubleshooting,troubleshooting break-fix scenarios,user research,user testing,usability,user-centered design,user experience,user flows,user interface,user interaction diagrams,user research,user testing,ui / ux,utilizing cloud automation tools,virtualization,visual design,web analytics,web applications,web development,web design,web technologies,wireframes,work independently