我正在尝试将字符串合并到称为df的数据帧中。如下:
s=['vic','tory','ban','ana']
df=pd.DataFrame(s,columns=['Tokens'])
请注意,我将仅将其用于另一种语言,而不是英语。
我想做的是合并df列中的行,并检查字典中组合的单词,如果该单词存在,那么它将被保存到另一个数据集中,并且df中的单词部分也将被删除。例如,我将df [0]和df [1]组合在一起,它就变成了“胜利”,然后在字典中寻找它并找到了它。然后将从df中删除“ vic”和“ tory”。我应该如何解决呢?任何帮助表示赞赏。
答案 0 :(得分:1)
如果您有字符串列表,并且想要检查连续字符串的组合是否构成一个单词,则可以遍历字符串并检查可能的组合。为此,您可以使用内置的python手段:
LIMIT = 3 # max amount of strings to combine
def process_strings(strings, words):
ans = list()
stop = len(strings)
current = 0
# iterate over strings
while current < stop:
word = ''
counter = 0
# iterate over LIMIT strings starting from current string
while True:
# check boundary conditions
if counter >= LIMIT or current + counter >= stop:
current += 1
break
word += strings[current + counter]
# word found among words
if word in words:
current += 1 + counter
ans.append(word)
# print('found word: {}'.format(word))
break
# word not found
else:
counter += 1
return ans
words = {'victory', 'banana', 'python'}
strings = [
'vic', 'tory',
'mo', 'th', 'er',
'ban', 'ana',
'pyt', 'on',
'vict', 'ory',
'pyt', 'hon',
'vi', 'ct', 'or', 'y',
'ba', 'na', 'na']
words_found = process_strings(strings, words)
print('found words:\n{}'.format(words_found))
输出:
found words:
['victory', 'banana', 'victory', 'python', 'banana']
编辑
修改后的版本,适用于1)任意数量的字符串可以组合,2)诸如words = {'victory', 'victor'}
,strings = ['vi', 'ct', 'or', 'y']
之类的情况-可以找到两个词:
def process_strings(strings, words):
MAXLEN = max(map(len, words))
ans = list()
stop = len(strings)
current = 0
# iterate over strings
while current < stop:
word = ''
counter = 0
# iterate over some amount of strings starting from current string
while True:
# check boundary conditions
if len(word) > MAXLEN or current + counter >= stop:
current += 1
break
word += strings[current + counter]
# word found among words
if word in words:
ans.append(word)
# there is no case `word not found`, exit only by boundary condition (length of the combined substrings)
counter += 1
return ans