使用熊猫阅读readme.md遇到麻烦

时间:2018-11-06 13:08:21

标签: python pandas parsing dataframe io

编辑:忘了提到这必须在熊猫中完成

将某个文件读入pandas数据框时遇到了一些问题。 我尝试过:

numba+vectorization+empty     3µs
np.square                     4µs
numba+vectorization           7µs
numba missed vectorization   11µs
cython+mult                  14µs
cython+pow                  356µs

如果我使用包含诸如“ Hello this is a test”之类的.txt进行尝试,则可以正常工作,但是尝试the actual readme.md时会出错,提示:

import pandas as pd
import matplotlib.pyplot as plt

dataframe = pd.read_csv('/home/leon/Desktop/Uni/ML Lab/Text.txt', 
delim_whitespace=True, header=None)
print(dataframe)

我正在将其读取到数据帧中,这样我就可以计算出不重复单词的数量和总体上单词的出现次数。 对于这个初学者的问题,我感到很抱歉,但是我刚开始使用Python! 问候。

2 个答案:

答案 0 :(得分:0)

pandas数据帧不适合此任务。您应该只加载文件,按行分割,然后从那里汇总计数。您可以通过读取文件,按行分割然后展平结果列表来实现此目的。最后,您可以使用Counter中的collections进行汇总。

from collections import Counter

with open("README.md") as f:
    file_split = [line.split() for line in f]

file_split_flatten = [val for sublist in file_split for val in sublist]

count_dict = dict(zip(Counter(file_split_flatten).keys(), Counter(file_split_flatten).values()))

然后访问计数即可:

print(count_dict['Tensorflow'])

答案 1 :(得分:0)

看看是否有帮助:

>>> import pandas as pd
>>> dataframe  = pd.read_table('README.md.1', skip_blank_lines=True)
>>> dataframe = dataframe.rename(columns={'# Tensorflow Object Detection API':'Tensorflow'}
>>> dataframe.head()
                                          Tensorflow
0  Creating accurate machine learning models capa...
1  multiple objects in a single image remains a c...
2  The TensorFlow Object Detection API is an open...
3  TensorFlow that makes it easy to construct, tr...
4  models.  At Google we’ve certainly found this ...