将顺序数据从.txt文件转换为数据框

时间:2020-11-05 11:18:13

标签: python dataframe machine-learning data-science text-mining

您好,数据科学社区,我是数据科学和python编程的新手。 这是我的txt文件的结构,但是缺少许多值

#*Improved Channel Routing by Via Minimization and Shifting.
#@Chung-Kuan Cheng
David N. Deutsch
#t1988
#cDAC
#index131751
#%133716
#%133521
#%134343
#!Channel routing area improvement by means of via minimization and via shifting in two dimensions (compaction) is readily achievable. Routing feature area can be minimized by wire straightening. The implementation of algorithms for each of these procedures has produced a solution for Deutsch's Difficult Example
 the standard channel routing benchmark
 that is more than 5% smaller than the best result published heretofore. Suggestions for possible future work are also given.

#*A fast simultaneous input vector generation and gate replacement algorithm for leakage power reduction.
#@Lei Cheng
Liang Deng
Deming Chen
Martin D. F. Wong
#t2006
#cDAC
#index131752
#%132550
#%530568
#%436486
#%134259
#%283007
#%134422
#%282140
#%1134324
#!Input vector control (IVC) technique is based on the observation that the leakage current in a CMOS logic gate depends on the gate input state
 and a good input vector is able to minimize the leakage when the circuit is in the sleep mode. The gate replacement technique is a very effective method to further reduce the leakage current. In this paper
 we propose a fast algorithm to find a low leakage input vector with simultaneous gate replacement. Results on MCNC91 benchmark circuits show that our algorithm produces $14 %$ better leakage current reduction with several orders of magnitude speedup in runtime for large circuits compared to the previous state-of-the-art algorithm. In particular
 the average runtime for the ten largest combinational circuits has been dramatically reduced from 1879 seconds to 0.34 seconds.

#*On the Over-Specification Problem in Sequential ATPG Algorithms.
#@Kwang-Ting Cheng
Hi-Keung Tony Ma
#t1992
#cDAC
#index131756
#%455537
#%1078626
#%131745
#!The authors show that some ATPG (automatic test pattern generation) programs may err in identifying untestable faults. These test generators may not be able to find the test sequence for a testable fault
 even allowed infinite run time
 and may mistakenly claim it as untestable. The main problem of these programs is that the underlying combinational test generation algorithm may over-specify the requirements at the present state lines. A necessary condition that the underlying combinational test generation algorithm must satisfy is considered to ensure a correct sequential ATPG program. It is shown that the simple D-algorithm satisfies this condition while PODEM and the enhanced D-algorithm do not. The impact of over-specification on the length of the generated test sequence was studied. Over-specification caused a longer test sequence. Experimental results are presented

#*Device and architecture co-optimization for FPGA power reduction.
#@Lerong Cheng
Phoebe Wong
Fei Li
Yan Lin
Lei He
#t2005
#cDAC
#index131759
#%214244
#%215701
#%214503
#%282575
#%214411
#%214505
#%132929
#!Device optimization considering supply voltage Vdd and threshold voltage Vt tuning does not increase chip area but has a great impact on power and performance in the nanometer technology. This paper studies the simultaneous evaluation of device and architecture optimization for FPGA. We first develop an efficient yet accurate timing and power evaluation method
 called trace-based model. By collecting trace information from cycle-accurate simulation of placed and routed FPGA benchmark circuits and re-using the trace for different Vdd and Vt
 we enable the device and architecture co-optimization for hundreds of combinations. Compared to the baseline FPGA which has the architecture same as the commercial FPGA used by Xilinx
 and has Vdd suggested by ITRS but Vt optimized by our device optimization
 architecture and device co-optimization can reduce energy-delay product by 20.5% without any chip area increase compared to the conventional FPGA architecture. Furthermore
 considering power-gating of unused logic blocks and interconnect switches
 our co-optimization method reduces energy-delay product by 54.7% and chip area by 8.3%. To the best of our knowledge
 this is the first in-depth study on architecture and device co-optimization for FPGAs.

我想将其转换为数据框,以#@开头的exp行是作者,#!是摘要,#*是标题,#%是参考,#c是使用python的场所 每篇文章都以标题开头,问题可能与摘要有关

我尝试了不同的方法,例如

import csv
with open('names7.csv', 'w', encoding="utf-8") as csvfile:
    fieldnames = ["Venue", "Year", "Authors","Title","id","ListCitation","NbrCitations","Abstract"]
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    with open(r"C:\Users\lenovo\Downloads\1.txt", "r", encoding="utf-8") as f:
        cnt = 1
        for line in f :           
            if line.startswith('#*'):
                writer.writerow({'Title': line})
                cnt += 1
            elif line.startswith('#@'):
                writer.writerow({'Authors': line})
                cnt += 1
            elif line.startswith("#t"):
                writer.writerow({'Year': line})
                cnt += 1
            elif line.startswith("#!"):
                writer.writerow({'Abstract': line})
                cnt += 1
            elif line.startswith("#c"):
                writer.writerow({'Venue': line})
                cnt +=1
            elif line.startswith("#index"):
                writer.writerow({'id': line}) 
                cnt +=1
            else:
                writer.writerow({'ListCitation': line}) 
                cnt +=1
    f.close()

我尝试了此方法,但没有成功,我想将其转换为具有所述列的数据框,如何将这个文件转换为数据框并将结果存储在csv文件中

答案代码1的输出

我想要的输出 enter image description here

例如,在这种情况下(抽象列),在段落之间留有空格,这会导致我出现问题,因此必须考虑这种情况,并且对于列引用,有很多引用,因此必须将它们考虑在内

#*Total power reduction in CMOS circuits via gate sizing and multiple threshold voltages.
#@Feng Gao
John P. Hayes
#t2005
#cDAC
#index132139
#%437038
#%437006
#%436596
#%285977
#%1135402
#%132206
#%194016
#%143061
#!Minimizing power consumption is one of the most important objectives in IC design. Resizing gates and assigning different Vt's are common ways to meet power and timing budgets. We propose an automatic implementation of both these techniques using a mixedinteger linear programming model called MLP-exact
 which minimizes a circuit's total active-mode power consumption. Unlike previous linear programming methods which only consider local optimality
 MLP-exact
 can find a true global optimum. An efficient
 non-optimal way to solve the MLP model
 called MLP-fast

 is also described. We present a set of benchmark experiments which show that MLP-fast
 is much faster than MLP-exact

 while obtaining designs with only slightly higher power consumption. Furthermore
 the designs generated by MLP-fast
 consume 30% less power than those obtained by conventional
 sensitivity-based methods.

3 个答案:

答案 0 :(得分:0)

csv.DictWriter().writerow()使用一个字典,该字典代表该特定样本的整个行。现在,您的代码正在发生什么,所有您称为writerow的东西,都会创建一个全新的行,而不是添加到当前行。相反,您应该做的是

  1. 将名为row的变量定义为字典
  2. 将整个行的值存储在此变量中
  3. 为此字典使用writerow写入csv

这会将整行写入csv,而不是为每个新值写入新行。

尽管这不是我们必须解决的唯一问题。文本文档中没有起始ID的每一行都被视为一个没有的值。例如,!Channel跨越三行文本,但只有第一行将被视为!Channel,而其他两行将被视为其他内容。

下面是带有文档的文档的改进版本,该文档使用字典来存储起始值和相应的id。要添加新案例,只需修改字典keysfieldnames

"""
#@ are authors
#! are abstracts
#* are titles
#% are references
#index are index
#c are venues using python Each article start by its title , the problem may have to do with abstracts
"""

# use dictionary to store fieldnames with corresponding id's/tags
keys = {
        'Venue': '#c',
        'Year':'#t',
        'Authors':'#@',
        'Title':'#*',
        'id': '#index',
        'References': '#%',
        'Abstract': '#!',
}

fieldnames = ["Venue", "Year", "Authors", "Title","id","Abstract", 'References']

outFile = 'names7.csv' # path to csv output
inFile = r"1.txt" # path to input text file

import csv
with open(outFile, 'w', encoding="utf-8") as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    with open(inFile, "r", encoding="utf-8") as f:
        row = dict()
        stack = [] # (key, value) pair, used to store multiple values regarding rows
        for line in f:
            line = line.strip() # remove white space from beginning and end of line
            prev = "" # store value of previous column
            for col in fieldnames: 
                # if column is defined and doesnt start with any id, add to previous value
                # this handles cases with results over one line or containing new lines
                if prev in row and not any([line.startswith(prefix) for prefix in keys.items()]):
                    # remove prefix
                    prefix = keys[prev]
                    line = line[len(prefix):]

                    row[prev] += ' ' + line
                # initate or append to current value. Handles (References #%)
                elif col in keys and line.startswith(keys[col]):
                    # remove prefix
                    prefix = keys[col]
                    line = line[len(prefix):]
                    if col in row:
                        stack.append((col, line))
                    else:
                        row[col] = line
                    prev = col # define prev col if answer goes over one line
                    break # go to next line in text
        writer.writerow(row)
        for col, line in stack:
            row[col] = line
            writer.writerow(row)
        f.close()

上面给出的测试用例产生的结果。

enter image description here

答案 1 :(得分:0)

在给出此特定文本文件的情况下,使用此结果更新了上一个答案

enter image description here

"""
#@ are authors
#! are abstracts
#* are titles
#% are references
#index are index
#c are venues using python Each article start by its title , the problem may have to do with abstracts
"""

# use dictionary to store fieldnames with corresponding id's/tags
keys = {
    '#c': 'Venue',
    '#t': 'Year',
    '#@': 'Authors',
    '#*': 'Title',
    '#index': 'id',
    '#%': 'References',
    '#!': 'Abstract'
}


fieldnames = ["Venue", "Year", "Authors", "Title", "NbrAuthor", "id", "ListCitation", "NbrCitation", "References", "NbrReferences", "Abstract"]

# fieldnames = ["Venue", "Year", "Authors", "Title", "NbrAuthor", "id", "ListCitation", "NbrCitation"]



# References and Authors store on one line
# Count number of authors and references
# We want to count the Authors, NbrAuthor, NbrCitations

outFile = 'names7.csv' # path to csv output
inFile = r"1.txt" # path to input text file

import csv
import re
with open(outFile, 'w', encoding="utf-8") as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    with open(inFile, "r", encoding="utf-8") as f:
        row = dict()
        prev = ""
        for line in f.readlines():
            line = line.strip() # remove any leading or trailing whitespace
            # remove any parentheses at the end of the string
            query = re.findall(r'\([^)]*\)', line)
            if len(query) > 0:
                line = line.replace(query[-1], '')
            # if none of the keys match, then belongs to previous key
            if prev != "" and not any([line.startswith(k) for k in keys]):
                if prev == 'Abstract':
                    row[prev] += " " + line
                else:
                    row[prev] += ", " + line
            else:
                for k in keys:
                    prefix = ""
                    if line.startswith(k):
                        # remove prefix
                        prefix = k
                        line = line[len(prefix):]
                        if keys[k] in row:
                            if keys[k] == "References":
                                row[keys[k]] += ", " + line
                            else:
                                row[keys[k]] += " " + line
                        else:
                            row[keys[k]] = line
                        prev = keys[k]
        # count number of references and Citations
        row["NbrAuthor"] = row["Authors"].count(',') + 1
        row["NbrCitation"] = 0
        row["NbrReferences"] = row["References"].count(',') + 1
        writer.writerow(row)

答案 2 :(得分:0)

编辑:在if语句中添加了子句

prefixes = {
    '#*': 'Title',
    '#@': 'Authors',
    '#t': 'Year',
    '#c': 'Venue',
    '#index': 'id',
    '#%': 'References',
    '#!': 'Abstract',
}


outFile = 'names7.csv' # path to csv output
inFile = r"1.txt" # path to input text file

import csv
import re
with open(outFile, 'w', encoding="utf-8") as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=list(prefixes.values()) + ['NbrAuthor', 'NbrCitations', 'ListCitations'])
    writer.writeheader()
    with open(inFile, "r", encoding="utf-8") as f:
        row = dict()
        prev = ''
        for line in f.readlines():
            # remove leading and trailing whitespace
            line = line.strip()

            # remove close brackets at end of lines
            # query = re.findall(r'\([^)]*\)', line)
            # if len(query) > 0:
            #     line = line.replace(query[-1], '')

            for prefix, col in prefixes.items():
                if line.startswith(prefix):
                    line = line[len(prefix):]
                    if col == "Authors" or col == 'Abstract':
                        row[col] = ""
                    else:
                        row[col] = line
                    prev = prefix
                    break
            
            # special cases
            try:
                if prev == '#@':
                    if row['Authors'] == "":
                        row['Authors'] = line
                    else:
                        row['Authors'] += ', ' + line
                elif prev == '#%':
                    row['References'] += ', ' + line
                elif prev == '#!':
                    row['Abstract'] += ' ' + line
            except Exception as e:
                print(e)
                
            if len(line) == 0:
                row['NbrAuthor'] = row['Authors'].count(',') + 1
                row['NbrCitations'] = 0
                writer.writerow(row)
                prev = ''
                row = dict()