从txt文件中提取文本并将其转换为df

时间:2020-03-06 10:48:59

标签: python python-3.x pandas dataframe

将此txt文件包含值

google.com('172.217.163.46', 443)
        commonName: *.google.com
        issuer: GTS CA 1O1
        notBefore: 2020-02-12 11:47:11
        notAfter:  2020-05-06 11:47:11

facebook.com('31.13.79.35', 443)
        commonName: *.facebook.com
        issuer: DigiCert SHA2 High Assurance Server CA
        notBefore: 2020-01-16 00:00:00
        notAfter:  2020-04-15 12:00:00

如何将其转换为df

尝试了这个并获得了部分成功:

f = open("out.txt", "r")
a=(f.read())


a=(pd.read_csv(StringIO(data),
              header=None,
     #use a delimiter not present in the text file
     #forces pandas to read data into one column
              sep="/",
              names=['string'])
     #limit number of splits to 1
  .string.str.split(':',n=1,expand=True)
  .rename({0:'Name',1:'temp'},axis=1)
  .assign(temp = lambda x: np.where(x.Name.str.strip()
                             #look for string that ends 
                             #with a bracket
                              .str.match(r'(.*[)]$)'),
                              x.Name,
                              x.temp),
          Name = lambda x: x.Name.str.replace(r'(.*[)]$)','Name')
          )
   #remove whitespace
 .assign(Name = lambda x: x.Name.str.strip())
 .pivot(columns='Name',values='temp')
 .ffill()
 .dropna(how='any')
 .reset_index(drop=True)
 .rename_axis(None,axis=1)
 .filter(['Name','commonName','issuer','notBefore','notAfter'])      
  )

但这是循环的,给我多个数据,就像单行有多个重复

2 个答案:

答案 0 :(得分:1)

该文件不是csv格式,因此您不应该使用read_csv来读取它,而应该用手对其进行解析。您可以在这里做

with open("out.txt") as fd:
    cols = {'commonName','issuer','notBefore','notAfter'}  # columns to keep
    rows = []                                              # list of records
    for line in fd:
        line = line.strip()
        if ':' in line:
            elt = line.split(':', 1)                       # data line: parse it
            if elt[0] in cols:
                rec[elt[0]] = elt[1]
        elif len(line) > 0:
            rec = {'Name': line}                           # initial line of a block
            rows.append(rec)

a = pd.DataFrame(rows)         # and build the dataframe from the list of records

它给出:

                                Name       commonName                                   issuer               notAfter             notBefore
0  google.com('172.217.163.46', 443)     *.google.com                               GTS CA 1O1    2020-05-06 11:47:11   2020-02-12 11:47:11
1   facebook.com('31.13.79.35', 443)   *.facebook.com   DigiCert SHA2 High Assurance Server CA    2020-04-15 12:00:00   2020-01-16 00:00:00

答案 1 :(得分:0)

尝试一下:

# ==============
# read text file
# ==============
file = open('in.txt')
lines = file.readlines()

# ==============
# create a dict
# ==============
mydict = {}
for i in range(0,len(lines),6):

    # ==============
    # add "Name" to dict
    # ==============
    if 'Name' not in mydict:
        mydict['Name']=[]

    mydict['Name'].append(lines[i].strip('\n'))

    # ==============
    # add other cols to dict
    # ==============
    for line in lines[i+1:i+5]:
        key,*value = line.strip().strip('\n').split(':',maxsplit=1)
        if key not in mydict:
            mydict[key]=[]
        mydict[key].append(''.join(value).strip())

pd.DataFrame(mydict)

输出:

+----+-----------------------------------+----------------+----------------------------------------+---------------------+---------------------+
|    | Name                              | commonName     | issuer                                 | notBefore           | notAfter            |
|----+-----------------------------------+----------------+----------------------------------------+---------------------+---------------------|
|  0 | google.com('172.217.163.46', 443) | *.google.com   | GTS CA 1O1                             | 2020-02-12 11:47:11 | 2020-05-06 11:47:11 |
|  1 | facebook.com('31.13.79.35', 443)  | *.facebook.com | DigiCert SHA2 High Assurance Server CA | 2020-01-16 00:00:00 | 2020-04-15 12:00:00 |
+----+-----------------------------------+----------------+----------------------------------------+---------------------+---------------------+