Question

我必须编写一个代码来读取文本文件中的数据。此文本文件具有特定格式。它就像存储表格数据的逗号分隔值（CSV）文件。而且，我必须能够对该文件的数据进行计算。

这是该文件的格式说明：

数据集必须以其名称声明开头：

@relation name

后跟数据集中所有属性的列表

@attribute attribute_name specification

如果属性是名义上的，则规范包含大括号中可能属性值的列表：

@attribute nominal_attribute {first_value，second_value，third_value}

如果属性是数字，则说明将替换为关键字

@attribute numeric_attribute numeric

在属性声明之后，实际数据由

引入

@data

标记，后跟所有实例的列表。实例以逗号分隔格式列出，问号表示缺失值。

评论是以％开头的行，将被忽略。

我必须能够用逗号分隔这些数据，并且必须知道哪些数据与哪个属性相关联。

示例数据集文件： 1：https://drive.google.com/open?id=0By6GDPYLwp2cSkd5M0J0ZjczVW8 2：https://drive.google.com/open?id=0By6GDPYLwp2cejB5SVlhTFdubnM

我没有解析经验，也没有使用Python的经验。因此，我想让专家们轻松地做到这一点。

由于

Answer 1

这是我提出的一个简单的解决方案：

我们的想法是逐行读取文件并根据遇到的行类型应用规则。

正如您在示例输入中看到的那样，您可能会遇到大致5种类型的输入。

可以以＆＃39;％＆＃39;开头的评论 - ＆GT;这里不需要采取任何行动。
一个空白行，即＆＃39; \ n＆＃39; - ＆GT;这里不需要采取任何行动。
以@开头的行，表示它可以是关系的属性或名称。
如果不是这些中的任何一个，那么它就是数据本身。

代码遵循一个简单的if-else逻辑，在每一步都采取行动。基于上述4条规则。

with open("../Downloads/Reading_Data_Files.txt","r") as dataFl:
    lines = [line for line in dataFl]

attribute = []
data = []
for line in lines:
    if line.startswith("%") or 'data' in line or line=='\n': # this is a comment or the data line
        pass
    elif line.startswith("@"):
        if "relation" in line:
            relationName = line.split(" ")[1]
        elif "attribute" in line:
            attribute.append(line.split(" ")[1])
    else:
        data.append(list(map(lambda x : x.strip(),line.split(","))))

print("Relation Name is : %s" %relationName)
print("Attributes are " + ','.join(attribute))
print(data)

如果你想看看哪个属性是一个解决方案，这与上面的解决方案基本相同，但稍作调整。上面解决方案的唯一问题是输出是一个列表列表，并告诉哪个属性是一个问题。因此，一个更好的解决方案是使用相应的属性名称注释每个数据元素。输出将采用以下形式： {'distance': '45', 'temperature': '75', 'BusArrival': 'on_time', 'Students': '25'}

with open("/Users/sreejithmenon/Downloads/Reading_Data_Files.txt","r") as dataFl:
    lines = [line for line in dataFl]

attribute = []
data = []
for line in lines:
    if line.startswith("%") or 'data' in line or line=='\n': # this is a comment or the data line
        pass
    elif line.startswith("@"):
        if "relation" in line:
            relationName = line.split(" ")[1]
        elif "attribute" in line:
            attribute.append(line.split(" ")[1])
    else:
        dataLine = list(map(lambda x : x.strip(),line.split(",")))
        dataDict = {attribute[i] : dataLine[i] for i in range(len(attribute))} # each line of data is now a dictionary.
        data.append(dataDict)

print("Relation Name is : %s" %relationName)
print("Attributes are " + ','.join(attribute))
print(data)

您可以使用pandas数据框进行更多分析，切片，查询等。以下链接可帮助您开始使用http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html

编辑：评论说明该行的含义：dataLine = list(map(lambda x : x.strip(),line.split(","))) split(<delimiter>)函数会在有分隔符的地方将字符串拆分为多个并返回一个列表（迭代器）。

例如， "hello, world".split(",")将返回['hello',' world'] 注意＆＃34; world＆＃34;前面的空格。

map是一个函数，可以将函数（第一个参数）应用于iterator（第二个参数）中的每个元素。它通常用作将变换应用于迭代器的每个元素的简写。 strip()删除任何前导或尾随空格。 lambda expression是一个函数，这里它只是应用条带函数。 map()从迭代器中提取每个元素并将其传递给lambda函数，并将返回的答案附加到最终解决方案。请在线阅读有关map function的更多信息。预先要求：lambda expressions。

评论中的第二部分：当我输入＆＃39; print（data [0]）＆＃39;将打印所有数据及其属性。如果我只打印不打算怎么办？第五排学生？什么是我想多重所有没有。具有相应温度的学生，并将其存储在具有相应索引的新列中？

当你print(data[0])它应该按原样给你第一行时，带有相关的属性，看起来应该是这样的。

data[0]
Out[63]: 
{'BusArrival': 'on_time',
 'Students': '25',
 'distance': '45',
 'temperature': '75'}

我建议您使用pandas数据帧来快速处理数据。

import pandas as pd
df = pd.DataFrame(data)
df
Out[69]: 
  BusArrival Students distance temperature
0     on_time       25       45          75
1      before       12       40          70
2       after       49       50          80
3     on_time       24       44          74
4      before       15       38          75
    # and so on

现在您只想提取第5行，

df.iloc[5]
Out[73]: 
BusArrival     after
Students          45
distance          49
temperature       85
Name: 5, dtype: object

学生和温度的产品现在简单，

df['Students'] = df['Students'].astype('int') # making sure they are not strings
df['temperature'] = df['temperature'].astype('int') 
df['studentTempProd'] = df['Students'] * df['temperature']

df
Out[82]: 
   BusArrival  Students distance  temperature  studentTempProd
0     on_time        25       45           75             1875
1      before        12       40           70              840
2       after        49       50           80             3920
3     on_time        24       44           74             1776
4      before        15       38           75             1125

你可以用熊猫做更多的事情。就像只提取＆＃39; on_time＆＃39;巴士到达等。

Python中的Text Parser

1 个答案: