我有一个大文件/单词对保存如下:
This/DT gene/NN called/VBN gametocide/NN
现在我想将这些对放入一个DataFrame,其计数如下:
DT | NN --
This| 1 0
Gene| 0 1
:
我尝试使用dict计算对,然后将其放入DataFrame:
file = open("data.txt", "r")
train = file.read()
words = train.split()
data = defaultdict(int)
for i in words:
data[i] += 1
matrixB = pd.DataFrame()
for elem, count in data.items():
word, tag = elem.split('/')
matrixB.loc[tag, word] = count
但这需要很长时间(文件有300000个)。有更快的方法吗?
答案 0 :(得分:1)
your other question的答案有什么问题?
from collections import Counter
with open('data.txt') as f:
train = f.read()
c = Counter(tuple(x.split('/')) for x in train.split())
s = pd.Series(c)
df = s.unstack().fillna(0)
print(df)
产量
DT NN VBN
This 1 0 0
called 0 0 1
gametocide 0 1 0
gene 0 1 0
答案 1 :(得分:0)
我认为这个问题非常相似......你为什么要发两次?
from collection import Counter
text = "This/DT gene/NN called/VBN gametocide/NN"
>>> pd.Series(Counter(tuple(pair.split('/')) for pair in text.split())).unstack().fillna(0)
DT NN VBN
This 1 0 0
called 0 0 1
gametocide 0 1 0
gene 0 1 0