我有一个Pandas Dataframe,它具有列值作为字符串列表。每个列表可以包含一个或多个字符串。对于具有多个单词的字符串,我想将它们拆分为单个单词,以便每个列表仅包含单个单词。在以下数据框中,只有sent_tags
列具有包含可变长度字符串的列表。
DataFrame :
import pandas as pd
pd.set_option('display.max_colwidth', -1)
df = pd.DataFrame({"fruit_tags": [["'apples'", "'oranges'", "'pears'"], ["'melons'", "'peaches'", "'kiwis'"]], "sent_tags":[["'apples'", "'sweeter than oranges'", "'pears sweeter than apples'"], ["'melons'", "'sweeter than peaches'", "'kiwis sweeter than melons'"]]})
print(df)
fruit_tags sent_tags
0 ['apples', 'oranges', 'pears'] ['apples', 'sweeter than oranges', 'pears sweeter than apples']
1 ['melons', 'peaches', 'kiwis'] ['melons', 'sweeter than peaches', 'kiwis sweeter than melons']
我的尝试:
我决定使用NLTK库中的word_tokenize
将此类字符串分解为单个单词。我确实获得了列表中特定选择的标记化单词,但无法将它们组合在一起成为每一行的每个列表:
from nltk.tokenize import word_tokenize
df['sent_tags'].str[1].str.strip("'").apply(lambda x:word_tokenize(x.lower()))
#Output
0 [sweeter, than, oranges]
1 [sweeter, than, peaches]
Name: sent_tags, dtype: object
所需结果:
fruit_tags sent_tags
0 ['apples', 'oranges', 'pears'] ['apples', 'sweeter', 'than', 'oranges', 'pears', 'sweeter', 'than', 'apples']
1 ['melons', 'peaches', 'kiwis'] ['melons', 'sweeter', 'than', 'peaches', 'kiwis', 'sweeter', 'than', 'melons']
答案 0 :(得分:2)
对所有文本函数-strip
,lower
和split
使用列表理解和变平:
s = df['sent_tags'].apply(lambda x: [z for y in x for z in y.strip("'").lower().split()])
或者:
s = [[z for y in x for z in y.strip("'").lower().split()] for x in df['sent_tags']]
df['sent_tags'] = s
print(df)
fruit_tags \
0 ['apples', 'oranges', 'pears']
1 ['melons', 'peaches', 'kiwis']
sent_tags
0 [apples, sweeter, than, oranges, pears, sweeter, than, apples]
1 [melons, sweeter, than, peaches, kiwis, sweeter, than, melons]
答案 1 :(得分:0)
另一种可能的方法是:
df['sent_tags'].apply(lambda x: [item for elem in [y.split() for y in x] for item in elem])