将特殊格式的文本文档转换为Pandas DataFrame

时间:2019-04-22 19:10:11

标签: python pandas

我有一个以下格式的文本文件:

1: frack 0.733, shale 0.700, 
10: space 0.645, station 0.327, nasa 0.258, 
4: celebr 0.262, bahar 0.345 

我需要将此文本转换为以下格式的数据框:

Id   Term    weight
1    frack   0.733
1    shale   0.700
10   space   0.645
10   station 0.327
10   nasa    0.258
4    celebr  0.262
4    bahar   0.345

我该怎么做?

9 个答案:

答案 0 :(得分:12)

这是一种使用re解析文件的优化方法,首先获取ID,然后解析数据元组。这利用了文件对象可迭代的事实。在打开的文件上进行迭代时,您会得到单独的行作为字符串,可以从中提取有意义的数据元素。

import re
import pandas as pd

SEP_RE = re.compile(r":\s+")
DATA_RE = re.compile(r"(?P<term>[a-z]+)\s+(?P<weight>\d+\.\d+)", re.I)


def parse(filepath: str):
    def _parse(filepath):
        with open(filepath) as f:
            for line in f:
                id, rest = SEP_RE.split(line, maxsplit=1)
                for match in DATA_RE.finditer(rest):
                    yield [int(id), match["term"], float(match["weight"])]
    return list(_parse(filepath))

示例:

>>> df = pd.DataFrame(parse("/Users/bradsolomon/Downloads/doc.txt"),
...                   columns=["Id", "Term", "weight"])
>>> 
>>> df
   Id     Term  weight
0   1    frack   0.733
1   1    shale   0.700
2  10    space   0.645
3  10  station   0.327
4  10     nasa   0.258
5   4   celebr   0.262
6   4    bahar   0.345

>>> df.dtypes
Id          int64
Term       object
weight    float64
dtype: object

演练

SEP_RE寻找初始分隔符:文字:后跟一个或多个空格。找到第一个拆分后,它将使用maxsplit=1停止。当然,这假设您的数据是严格格式化的:整个数据集的格式始终遵循问题中列出的示例格式。

此后,DATA_RE.finditer()处理从rest算出的每对( term,权重)。字符串rest本身看起来像frack 0.733, shale 0.700,.finditer()为您提供了多个match对象,您可以在其中使用["key"]表示法从给定的named capture group(例如(?P<term>[a-z]+))中访问元素。

一种简单的可视化方法是使用文件中的示例line作为字符串:

>>> line = "1: frack 0.733, shale 0.700,\n"
>>> SEP_RE.split(line, maxsplit=1)
['1', 'frack 0.733, shale 0.700,\n']

现在,您有了初始ID和其余组件,可以将其解压缩为两个标识符。

>>> id, rest = SEP_RE.split(line, maxsplit=1)
>>> it = DATA_RE.finditer(rest)
>>> match = next(it)
>>> match
<re.Match object; span=(0, 11), match='frack 0.733'>
>>> match["term"]
'frack'
>>> match["weight"]
'0.733'

可视化它的更好方法是使用pdb。如果您敢尝试一下;)

免责声明

这是需要特定解决方案类型的问题之一,如果您放宽对数据格式的限制,这些解决方案可能无法一概而论。

例如,假设每个Term只能采用大写或小写ASCII字母,没有其他含义。如果您将其他Unicode字符用作标识符,则需要查看其他re个字符,例如\w

答案 1 :(得分:4)

如果将输入内容调整为适当的格式,则可以使用DataFrame构造函数。这是一种方法:

import pandas as pd
from itertools import chain

text="""1: frack 0.733, shale 0.700, 
10: space 0.645, station 0.327, nasa 0.258, 
4: celebr 0.262, bahar 0.345 """

df = pd.DataFrame(
    list(
        chain.from_iterable(
            map(lambda z: (y[0], *z.strip().split()), y[1].split(",")) for y in 
            map(lambda x: x.strip(" ,").split(":"), text.splitlines())
        )
    ), 
    columns=["Id", "Term", "weight"]
)

print(df)
#  Id     Term weight
#0  4    frack  0.733
#1  4    shale  0.700
#2  4    space  0.645
#3  4  station  0.327
#4  4     nasa  0.258
#5  4   celebr  0.262
#6  4    bahar  0.345

说明

我假设您已将文件读入字符串text中。您要做的第一件事是在:上分割之前先去除逗号/空格和空格。

print(list(map(lambda x: x.strip(" ,").split(":"), text.splitlines())))
#[['1', ' frack 0.733, shale 0.700'], 
# ['10', ' space 0.645, station 0.327, nasa 0.258'], 
# ['4', ' celebr 0.262, bahar 0.345']]

下一步是分割逗号以分隔值,并将Id分配给每组值:

print(
    [
        list(map(lambda z: (y[0], *z.strip().split()), y[1].split(","))) for y in 
        map(lambda x: x.strip(" ,").split(":"), text.splitlines())
    ]
)
#[[('1', 'frack', '0.733'), ('1', 'shale', '0.700')],
# [('10', 'space', '0.645'),
#  ('10', 'station', '0.327'),
#  ('10', 'nasa', '0.258')],
# [('4', 'celebr', '0.262'), ('4', 'bahar', '0.345')]]

最后,我们使用itertools.chain.from_iterable展平此输出,然后可以将其直接传递给DataFrame构造函数。

注意*元组解包是python 3功能。

答案 2 :(得分:4)

假设您的数据(csv文件)如下所示:

df = pd.read_csv('untitled.txt', sep=': ', header=None)
df.set_index(0, inplace=True)

# split the `,`
df = df[1].str.strip().str.split(',', expand=True)

#    0             1              2           3
#--  ------------  -------------  ----------  ---
# 1  frack 0.733   shale 0.700
#10  space 0.645   station 0.327  nasa 0.258
# 4  celebr 0.262  bahar 0.345

# stack and drop empty
df = df.stack()
df = df[~df.eq('')]

# split ' '
df = df.str.strip().str.split(' ', expand=True)

# edit to give final expected output:

# rename index and columns for reset_index
df.index.names = ['Id', 'to_drop']
df.columns = ['Term', 'weight']

# final df
final_df  = df.reset_index().drop('to_drop', axis=1)

答案 3 :(得分:3)

只需投入两美分即可:您可以为自己编写一个解析器,然后将结果提供给pandas

import pandas as pd
from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor

file = """
1: frack 0.733, shale 0.700, 
10: space 0.645, station 0.327, nasa 0.258, 
4: celebr 0.262, bahar 0.345 
"""

grammar = Grammar(
    r"""
    expr    = (garbage / line)+

    line    = id colon pair*
    pair    = term ws weight sep? ws?
    garbage = ws+

    id      = ~"\d+"
    colon   = ws? ":" ws?
    sep     = ws? "," ws?

    term    = ~"[a-zA-Z]+"
    weight  = ~"\d+(?:\.\d+)?"

    ws      = ~"\s+"
    """
)

tree = grammar.parse(file)

class PandasVisitor(NodeVisitor):
    def generic_visit(self, node, visited_children):
        return visited_children or node

    def visit_pair(self, node, visited_children):
        term, _, weight, *_ = visited_children
        return (term.text, weight.text)

    def visit_line(self, node, visited_children):
        id, _, pairs = visited_children
        return [(id.text, *pair) for pair in pairs]

    def visit_garbage(self, node, visited_children):
        return None

    def visit_expr(self, node, visited_children):
        return [item
                for lst in visited_children
                for sublst in lst if sublst
                for item in sublst]

pv = PandasVisitor()
out = pv.visit(tree)

df = pd.DataFrame(out, columns=["Id", "Term", "weight"])
print(df)

这产生

   Id     Term weight
0   1    frack  0.733
1   1    shale  0.700
2  10    space  0.645
3  10  station  0.327
4  10     nasa  0.258
5   4   celebr  0.262
6   4    bahar  0.345

在这里,我们正在使用可能的信息构建语法:行或空格。 line由ID(例如1),后跟冒号(:),空格和pair和{{1}的term s组成。 evtl。然后是weight播音员。

然后,我们需要一个sep类来实际执行某项操作。带有检索到的ast。

答案 4 :(得分:1)

完全可以使用大熊猫:

df = pd.read_csv(StringIO(u"""1: frack 0.733, shale 0.700, 
10: space 0.645, station 0.327, nasa 0.258, 
4: celebr 0.262, bahar 0.345 """), sep=":", header=None)

#df:
    0                                          1
0   1                 frack 0.733, shale 0.700, 
1  10   space 0.645, station 0.327, nasa 0.258, 
2   4                 celebr 0.262, bahar 0.345 

1列变成一个列表,然后展开:

df[1] = df[1].str.split(",", expand=False)

dfs = []
for idx, rows in df.iterrows():
    print(rows)
    dfslice = pd.DataFrame({"Id": [rows[0]]*len(rows[1]), "terms": rows[1]})
    dfs.append(dfslice)
newdf = pd.concat(dfs, ignore_index=True)

# this creates newdf:
   Id           terms
0   1     frack 0.733
1   1     shale 0.700
2   1                
3  10     space 0.645
4  10   station 0.327
5  10      nasa 0.258
6  10                
7   4    celebr 0.262
8   4    bahar 0.345 

现在,我们需要str拆分最后一行并删除空:

newdf["terms"] = newdf["terms"].str.strip()
newdf = newdf.join(newdf["terms"].str.split(" ", expand=True))
newdf.columns = ["Id", "terms", "Term", "Weights"]
newdf = newdf.drop("terms", axis=1).dropna()

产生的newdf:

   Id     Term Weights
0   1    frack   0.733
1   1    shale   0.700
3  10    space   0.645
4  10  station   0.327
5  10     nasa   0.258
7   4   celebr   0.262
8   4    bahar   0.345

答案 5 :(得分:0)

这是您的另一个问题。创建一个列表,其中将包含每个ID和术语的列表。然后生成数据框。

import pandas as pd
file=r"give_your_path".replace('\\', '/')
my_list_of_lists=[]#creating an empty list which will contain lists of [Id Term  Weight]
with open(file,"r+") as f:
    for line in f.readlines():#looping every line
        my_id=[line.split(":")[0]]#storing the Id in order to use it in every term
        for term in [s.strip().split(" ") for s in line[line.find(":")+1:].split(",")[:-1]]:
            my_list_of_lists.append(my_id+term)
df=pd.DataFrame.from_records(my_list_of_lists)#turning the lists to dataframe
df.columns=["Id","Term","weight"]#giving columns their names

答案 6 :(得分:0)

我可以假设'TERM'之前只有1个空格吗?

df=pd.DataFrame(columns=['ID','Term','Weight'])
with open('C:/random/d1','r') as readObject:
    for line in readObject:
        line=line.rstrip('\n')
        tempList1=line.split(':')
        tempList2=tempList1[1]
        tempList2=tempList2.rstrip(',')
        tempList2=tempList2.split(',')
        for item in tempList2:
            e=item.split(' ')
            tempRow=[tempList1[0], e[0],e[1]]
            df.loc[len(df)]=tempRow
print(df)

答案 7 :(得分:0)

也许会很容易理解发生了什么。您只需要更新代码即可读取文件,而无需使用变量。

import pandas as pd

txt = """1: frack 0.733, shale 0.700,
10: space 0.645, station 0.327, nasa 0.258,
4: celebr 0.262, bahar 0.345"""

data = []
for line in txt.splitlines():
    key, values = line.split(':')
    for elements in values.split(','):
        if elements:
            term, weight = elements.split()
            data.append({'Id': key, 'Term': term, 'Weight': weight})

df = pd.DataFrame(data)

DF:

   Id    Term  Weight
0   1    frack  0.733
1   1    shale  0.700
2  10    space  0.645
3  10  station  0.327
4  10     nasa  0.258
5   4   celebr  0.262
6   4    bahar  0.345

答案 8 :(得分:-3)

1)您可以逐行阅读。

2)然后,您可以用':'分隔索引,用','分隔值

1)

with open('path/filename.txt','r') as filename:
   content = filename.readlines()

2) content = [对于内容中的x,为x.split(':')]

这将为您提供以下结果:

content =[
    ['1','frack 0.733, shale 0.700,'],
    ['10', 'space 0.645, station 0.327, nasa 0.258,'],
    ['4','celebr 0.262, bahar 0.345 ']]