编辑:忘了提到这必须在熊猫中完成
将某个文件读入pandas数据框时遇到了一些问题。 我尝试过:
numba+vectorization+empty 3µs
np.square 4µs
numba+vectorization 7µs
numba missed vectorization 11µs
cython+mult 14µs
cython+pow 356µs
如果我使用包含诸如“ Hello this is a test”之类的.txt进行尝试,则可以正常工作,但是尝试the actual readme.md时会出错,提示:
import pandas as pd
import matplotlib.pyplot as plt
dataframe = pd.read_csv('/home/leon/Desktop/Uni/ML Lab/Text.txt',
delim_whitespace=True, header=None)
print(dataframe)
我正在将其读取到数据帧中,这样我就可以计算出不重复单词的数量和总体上单词的出现次数。 对于这个初学者的问题,我感到很抱歉,但是我刚开始使用Python! 问候。
答案 0 :(得分:0)
pandas
数据帧不适合此任务。您应该只加载文件,按行分割,然后从那里汇总计数。您可以通过读取文件,按行分割然后展平结果列表来实现此目的。最后,您可以使用Counter
中的collections
进行汇总。
from collections import Counter
with open("README.md") as f:
file_split = [line.split() for line in f]
file_split_flatten = [val for sublist in file_split for val in sublist]
count_dict = dict(zip(Counter(file_split_flatten).keys(), Counter(file_split_flatten).values()))
然后访问计数即可:
print(count_dict['Tensorflow'])
答案 1 :(得分:0)
看看是否有帮助:
>>> import pandas as pd
>>> dataframe = pd.read_table('README.md.1', skip_blank_lines=True)
>>> dataframe = dataframe.rename(columns={'# Tensorflow Object Detection API':'Tensorflow'}
>>> dataframe.head()
Tensorflow
0 Creating accurate machine learning models capa...
1 multiple objects in a single image remains a c...
2 The TensorFlow Object Detection API is an open...
3 TensorFlow that makes it easy to construct, tr...
4 models. At Google we’ve certainly found this ...