我在下面有一个数据框:
import pandas
df = pandas.DataFrame({"terms" : [[['the', 'boy', 'and', 'the goat'],['a', 'girl', 'and', 'the cat']], [['fish', 'boy', 'with', 'the dog'],['when', 'girl', 'find', 'the mouse'], ['if', 'dog', 'see', 'the cat']]]})
我想要的结果如下:
df2 = pandas.DataFrame({"terms" : ['the boy and the goat','a girl and the cat', 'fish boy with the dog','when girl find the mouse', 'if dog see the cat']})
有没有一种简单的方法可以完成此操作,而不必使用for循环为每个元素和子字符串遍历每一行:
result = pandas.DataFrame()
for i in range(len(df.terms.tolist())):
x = df.terms.tolist()[i]
for y in x:
z = str(y).replace(",",'').replace("'",'').replace('[','').replace(']','')
flattened = pandas.DataFrame({'flattened_term':[z]})
result = result.append(flattened)
print(result)
谢谢。
答案 0 :(得分:3)
这肯定不是避免循环的方法,至少不是隐式的。创建Pandas并不是将list
对象作为元素来处理,它可以很好地处理数字数据,并且可以很好地处理字符串。无论如何,您的根本问题是您正在循环中使用pd.Dataframe.append
,这是一个二次时间算法(在每次迭代中都会重新创建整个数据帧)。但是您可能可以避免以下情况,它应该快得多:
>>> df
terms
0 [[the, boy, and, the goat], [a, girl, and, the...
1 [[fish, boy, with, the dog], [when, girl, find...
>>> pandas.DataFrame([' '.join(term) for row in df.itertuples() for term in row.terms])
0
0 the boy and the goat
1 a girl and the cat
2 fish boy with the dog
3 when girl find the mouse
4 if dog see the cat
>>>