python中每个注释的平均单词和字符数

时间:2018-08-20 13:14:56

标签: python

我正在尝试计算在this数据集中添加2列的注释列的平均单词和字符数。我尝试了这段代码,但是它只计算整个数据集而未指定每一行。

with open('diabetes-public-review-comment','r') as f:
    lines = f.readlines()
    print(sum(len(line) for line in lines)/len(lines))

1 个答案:

答案 0 :(得分:0)

如果您愿意使用第三方库,则可以使用Pandas。冗长但节省内存的类似想法可以通过csv模块和len / str.split模块应用于常规Python。

x = '''Comment
Table 5.2.3 (SMPG Plots results metadata): the Result Identifier should not be null. 
"such as": so there are other uses. It is recognized that the FDA currently does not accept files in Dataset-XML format (but they should). However, Dataset-XML is even not mentioned in the document - it should. Many of the "rules" (such as "maximum 40 characters") are based on the limitations of SAS Transport 5 that do not apply to Dataset-XML. A comment on this in the text where such rules are mentioned would be appropriate.
All "Type" columns in the model refer to SAS-XPT. It would be much better if the "define-XML" datatypes are listed, e.g. "integer" for --DY, "date/datetime" for --DTC, "text" for --ORRES, etc..
Space needed in middle of classifyinghypoglycemia'''

import pandas as pd

# replace StringIO(x) with 'file.csv'
df = pd.read_csv(StringIO(x), delimiter='|')  # use an arbirary delimiter not used in file

# calculate and assign new columns
df['Characters'] = df['Comment'].str.len()
df['Words'] = df['Comment'].str.split().str.len()

# export to CSV
df.to_csv('out.csv', index=False)

print(df)

#                                              Comment  Characters  Words
# 0  Table 5.2.3 (SMPG Plots results metadata): the...          85     13
# 1  such as: so there are other uses. It is recogn...         427     75
# 2  All "Type" columns in the model refer to SAS-X...         193     31
# 3  Space needed in middle of classifyinghypoglycemia          49      6

然后通过以下方式计算平均值:

mean_characters = df['Characters'].mean()
mean_words = df['Words'].mean()