我有一个excel表,其中包含许多软件名称,如Visual Studio 2012,Visual Studio 2013,Visual Studio 2017,Adobe Reader英语,Adobe Reader Deutsche,Power shell 4.0,Power shell 2.0,Power Shell 5.0。
我想只获得一个相关的软件版本名称。例如,在这种情况下,我希望我的输出是Visual Studio 2013,Power shell 4.0,Adobe Reader英语,剩下的就剩下了。我正在使用Python NLP。我删除了所有垃圾字符和版本号,但我不确定如何继续进行。
任何进一步构建的想法?在获得两个没有任何数字和垃圾字符的软件名称后,我尝试了序列匹配,但结果并不准确和有效。
import pandas as pd
from nltk.tokenize import wordpunct_tokenize
df = pd.read_csv('C:\\Users\\533471\\Desktop\\Book2.csv', encoding='Windows-1252')
saved_column = df.RowLabels[:]
second_column = df.RowLabels[:]
print(saved_column)
for eachcol in saved_column:
eachword = eachcol.split()
print(eachword)
for secondcol in second_column:
sentence = None
wordo = None
punct = None
x = []
copy = []
secondword = secondcol.split()[:]
####proceed only if the first word is equal
if eachword[0] in secondword[0]:
print("true")
sentence = eachword[:]
sentence += secondword
####splitting according to punctuations.
for token in sentence:
word = wordpunct_tokenize(token)
if wordo is None:
wordo = word
else:
wordo += word
####Removing all the punctuations.
punct = [item for item in wordo if item.isalpha()]
t = punct[:]
t.reverse()
for p in punct:
print(p)
if len(x) > 0:
print(x, "Appended")
a = str(p)
x += [p]
if p == x[0]:
break
else:
print("list is empty")
x += [p]
x.pop()
for z in t:
print(z)
if len(copy) > 0:
print(copy, "appended")
copy += [z]
if z == punct[0]:
break
else:
print("list is empty")
copy += [z]
print(copy)
else:
print("false")