我正在尝试计算在this数据集中添加2列的注释列的平均单词和字符数。我尝试了这段代码,但是它只计算整个数据集而未指定每一行。
with open('diabetes-public-review-comment','r') as f:
lines = f.readlines()
print(sum(len(line) for line in lines)/len(lines))
答案 0 :(得分:0)
如果您愿意使用第三方库,则可以使用Pandas。冗长但节省内存的类似想法可以通过csv
模块和len
/ str.split
模块应用于常规Python。
x = '''Comment
Table 5.2.3 (SMPG Plots results metadata): the Result Identifier should not be null.
"such as": so there are other uses. It is recognized that the FDA currently does not accept files in Dataset-XML format (but they should). However, Dataset-XML is even not mentioned in the document - it should. Many of the "rules" (such as "maximum 40 characters") are based on the limitations of SAS Transport 5 that do not apply to Dataset-XML. A comment on this in the text where such rules are mentioned would be appropriate.
All "Type" columns in the model refer to SAS-XPT. It would be much better if the "define-XML" datatypes are listed, e.g. "integer" for --DY, "date/datetime" for --DTC, "text" for --ORRES, etc..
Space needed in middle of classifyinghypoglycemia'''
import pandas as pd
# replace StringIO(x) with 'file.csv'
df = pd.read_csv(StringIO(x), delimiter='|') # use an arbirary delimiter not used in file
# calculate and assign new columns
df['Characters'] = df['Comment'].str.len()
df['Words'] = df['Comment'].str.split().str.len()
# export to CSV
df.to_csv('out.csv', index=False)
print(df)
# Comment Characters Words
# 0 Table 5.2.3 (SMPG Plots results metadata): the... 85 13
# 1 such as: so there are other uses. It is recogn... 427 75
# 2 All "Type" columns in the model refer to SAS-X... 193 31
# 3 Space needed in middle of classifyinghypoglycemia 49 6
然后通过以下方式计算平均值:
mean_characters = df['Characters'].mean()
mean_words = df['Words'].mean()