我正在尝试合并多组单词数据。读入的每个 csv 文件(有 4 个文件)包含一列用于表示书中每个唯一单词的列,以及一列表示该单词出现的次数。应该发生的是,所有这些 csv 文件的单词列都应该在我尝试创建的这个新矩阵文件中合并为一个,但是当我尝试合并每个 csv 文件及其数据时,一个空的数据框是返回。
csv 文件如下:
Word Count
Thou 100
O 20
Hither 8
我希望它们像这样合并:
Word Book1 Book2 Book3
Thou 50 0 88
Hello 32 35 27
No 89 38 0
Yes 80 99 0
import os
from os import listdir
from os.path import isfile, join
import pandas as pd
dataPath = 'data/'
fileNames = [f for f in listdir(dataPath) if isfile(join(dataPath, f))]
columns = [os.path.splitext(x)[0] for x in fileNames]
columns.remove('rows')
columns.remove('cols')
columns.remove('matrix')
columns.insert(0, "Word")
wordData = []
matrix = pd.DataFrame(columns=columns)
for file in fileNames:
if '.txt' in file:
continue
elif 'matrix' in file:
continue
else:
myFile = open(f"./data/{file}", "r")
readFile = myFile.read()
dataVector = pd.read_csv(f"./data/{file}", sep=",")
#print(dataVector)
matrix.merge(dataVector, how="outer", on=["Word"])
print(matrix)
myFile.close()
pd.set_option("display.max_rows", None, "display.max_columns", None)
matrix = matrix.fillna(0)
matrix.to_csv(path_or_buf="./data/matrix.csv")
答案 0 :(得分:1)
我认为这可能是您需要的东西。
数据:
import pandas as pd
book_list = []
book_list.append(pd.DataFrame({'Word': ['a', 'b'], 'Count': [1, 2]}))
book_list.append(pd.DataFrame({'Word': ['b', 'c'], 'Count': [3, 4]}))
book_list.append(pd.DataFrame({'Word': ['d', 'e', 'f'], 'Count': [5, 6, 7]}))
book_list.append(pd.DataFrame({'Word': ['c', 'e'], 'Count': [8, 9]}))
代码:
result = None
for idx_book, book in enumerate(book_list):
if result is None:
result = book
else:
result = result.merge(book, how="outer", on=["Word"], suffixes=(idx_book-1, idx_book))
结果:
Word Count0 Count1 Count2 Count3
0 a 1.0 NaN NaN NaN
1 b 2.0 3.0 NaN NaN
2 c NaN 4.0 NaN 8.0
3 d NaN NaN 5.0 NaN
4 e NaN NaN 6.0 9.0
5 f NaN NaN 7.0 NaN
答案 1 :(得分:0)
最终使用这个 lambda 函数解决了这个问题:
matrix = reduce(lambda left,right: pd.merge(left,right,on=['Word'],how='outer'), wordData).fillna(0)