我有一个以下格式的文本文件:
1: frack 0.733, shale 0.700,
10: space 0.645, station 0.327, nasa 0.258,
4: celebr 0.262, bahar 0.345
我需要将此文本转换为以下格式的数据框:
Id Term weight
1 frack 0.733
1 shale 0.700
10 space 0.645
10 station 0.327
10 nasa 0.258
4 celebr 0.262
4 bahar 0.345
我该怎么做?
答案 0 :(得分:12)
这是一种使用re
解析文件的优化方法,首先获取ID,然后解析数据元组。这利用了文件对象可迭代的事实。在打开的文件上进行迭代时,您会得到单独的行作为字符串,可以从中提取有意义的数据元素。
import re
import pandas as pd
SEP_RE = re.compile(r":\s+")
DATA_RE = re.compile(r"(?P<term>[a-z]+)\s+(?P<weight>\d+\.\d+)", re.I)
def parse(filepath: str):
def _parse(filepath):
with open(filepath) as f:
for line in f:
id, rest = SEP_RE.split(line, maxsplit=1)
for match in DATA_RE.finditer(rest):
yield [int(id), match["term"], float(match["weight"])]
return list(_parse(filepath))
示例:
>>> df = pd.DataFrame(parse("/Users/bradsolomon/Downloads/doc.txt"),
... columns=["Id", "Term", "weight"])
>>>
>>> df
Id Term weight
0 1 frack 0.733
1 1 shale 0.700
2 10 space 0.645
3 10 station 0.327
4 10 nasa 0.258
5 4 celebr 0.262
6 4 bahar 0.345
>>> df.dtypes
Id int64
Term object
weight float64
dtype: object
SEP_RE
寻找初始分隔符:文字:
后跟一个或多个空格。找到第一个拆分后,它将使用maxsplit=1
停止。当然,这假设您的数据是严格格式化的:整个数据集的格式始终遵循问题中列出的示例格式。
此后,DATA_RE.finditer()
处理从rest
算出的每对( term,权重)。字符串rest
本身看起来像frack 0.733, shale 0.700,
。 .finditer()
为您提供了多个match
对象,您可以在其中使用["key"]
表示法从给定的named capture group(例如(?P<term>[a-z]+)
)中访问元素。
一种简单的可视化方法是使用文件中的示例line
作为字符串:
>>> line = "1: frack 0.733, shale 0.700,\n"
>>> SEP_RE.split(line, maxsplit=1)
['1', 'frack 0.733, shale 0.700,\n']
现在,您有了初始ID和其余组件,可以将其解压缩为两个标识符。
>>> id, rest = SEP_RE.split(line, maxsplit=1)
>>> it = DATA_RE.finditer(rest)
>>> match = next(it)
>>> match
<re.Match object; span=(0, 11), match='frack 0.733'>
>>> match["term"]
'frack'
>>> match["weight"]
'0.733'
可视化它的更好方法是使用pdb
。如果您敢尝试一下;)
这是需要特定解决方案类型的问题之一,如果您放宽对数据格式的限制,这些解决方案可能无法一概而论。
例如,假设每个Term
只能采用大写或小写ASCII字母,没有其他含义。如果您将其他Unicode字符用作标识符,则需要查看其他re
个字符,例如\w
。
答案 1 :(得分:4)
如果将输入内容调整为适当的格式,则可以使用DataFrame构造函数。这是一种方法:
import pandas as pd
from itertools import chain
text="""1: frack 0.733, shale 0.700,
10: space 0.645, station 0.327, nasa 0.258,
4: celebr 0.262, bahar 0.345 """
df = pd.DataFrame(
list(
chain.from_iterable(
map(lambda z: (y[0], *z.strip().split()), y[1].split(",")) for y in
map(lambda x: x.strip(" ,").split(":"), text.splitlines())
)
),
columns=["Id", "Term", "weight"]
)
print(df)
# Id Term weight
#0 4 frack 0.733
#1 4 shale 0.700
#2 4 space 0.645
#3 4 station 0.327
#4 4 nasa 0.258
#5 4 celebr 0.262
#6 4 bahar 0.345
说明
我假设您已将文件读入字符串text
中。您要做的第一件事是在:
上分割之前先去除逗号/空格和空格。
print(list(map(lambda x: x.strip(" ,").split(":"), text.splitlines())))
#[['1', ' frack 0.733, shale 0.700'],
# ['10', ' space 0.645, station 0.327, nasa 0.258'],
# ['4', ' celebr 0.262, bahar 0.345']]
下一步是分割逗号以分隔值,并将Id
分配给每组值:
print(
[
list(map(lambda z: (y[0], *z.strip().split()), y[1].split(","))) for y in
map(lambda x: x.strip(" ,").split(":"), text.splitlines())
]
)
#[[('1', 'frack', '0.733'), ('1', 'shale', '0.700')],
# [('10', 'space', '0.645'),
# ('10', 'station', '0.327'),
# ('10', 'nasa', '0.258')],
# [('4', 'celebr', '0.262'), ('4', 'bahar', '0.345')]]
最后,我们使用itertools.chain.from_iterable
展平此输出,然后可以将其直接传递给DataFrame构造函数。
注意:*
元组解包是python 3功能。
答案 2 :(得分:4)
假设您的数据(csv
文件)如下所示:
df = pd.read_csv('untitled.txt', sep=': ', header=None)
df.set_index(0, inplace=True)
# split the `,`
df = df[1].str.strip().str.split(',', expand=True)
# 0 1 2 3
#-- ------------ ------------- ---------- ---
# 1 frack 0.733 shale 0.700
#10 space 0.645 station 0.327 nasa 0.258
# 4 celebr 0.262 bahar 0.345
# stack and drop empty
df = df.stack()
df = df[~df.eq('')]
# split ' '
df = df.str.strip().str.split(' ', expand=True)
# edit to give final expected output:
# rename index and columns for reset_index
df.index.names = ['Id', 'to_drop']
df.columns = ['Term', 'weight']
# final df
final_df = df.reset_index().drop('to_drop', axis=1)
答案 3 :(得分:3)
只需投入两美分即可:您可以为自己编写一个解析器,然后将结果提供给pandas
:
import pandas as pd
from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor
file = """
1: frack 0.733, shale 0.700,
10: space 0.645, station 0.327, nasa 0.258,
4: celebr 0.262, bahar 0.345
"""
grammar = Grammar(
r"""
expr = (garbage / line)+
line = id colon pair*
pair = term ws weight sep? ws?
garbage = ws+
id = ~"\d+"
colon = ws? ":" ws?
sep = ws? "," ws?
term = ~"[a-zA-Z]+"
weight = ~"\d+(?:\.\d+)?"
ws = ~"\s+"
"""
)
tree = grammar.parse(file)
class PandasVisitor(NodeVisitor):
def generic_visit(self, node, visited_children):
return visited_children or node
def visit_pair(self, node, visited_children):
term, _, weight, *_ = visited_children
return (term.text, weight.text)
def visit_line(self, node, visited_children):
id, _, pairs = visited_children
return [(id.text, *pair) for pair in pairs]
def visit_garbage(self, node, visited_children):
return None
def visit_expr(self, node, visited_children):
return [item
for lst in visited_children
for sublst in lst if sublst
for item in sublst]
pv = PandasVisitor()
out = pv.visit(tree)
df = pd.DataFrame(out, columns=["Id", "Term", "weight"])
print(df)
这产生
Id Term weight
0 1 frack 0.733
1 1 shale 0.700
2 10 space 0.645
3 10 station 0.327
4 10 nasa 0.258
5 4 celebr 0.262
6 4 bahar 0.345
在这里,我们正在使用可能的信息构建语法:行或空格。 line
由ID(例如1
),后跟冒号(:
),空格和pair
和{{1}的term
s组成。 evtl。然后是weight
播音员。
然后,我们需要一个sep
类来实际执行某项操作。带有检索到的ast。
答案 4 :(得分:1)
完全可以使用大熊猫:
df = pd.read_csv(StringIO(u"""1: frack 0.733, shale 0.700,
10: space 0.645, station 0.327, nasa 0.258,
4: celebr 0.262, bahar 0.345 """), sep=":", header=None)
#df:
0 1
0 1 frack 0.733, shale 0.700,
1 10 space 0.645, station 0.327, nasa 0.258,
2 4 celebr 0.262, bahar 0.345
将1
列变成一个列表,然后展开:
df[1] = df[1].str.split(",", expand=False)
dfs = []
for idx, rows in df.iterrows():
print(rows)
dfslice = pd.DataFrame({"Id": [rows[0]]*len(rows[1]), "terms": rows[1]})
dfs.append(dfslice)
newdf = pd.concat(dfs, ignore_index=True)
# this creates newdf:
Id terms
0 1 frack 0.733
1 1 shale 0.700
2 1
3 10 space 0.645
4 10 station 0.327
5 10 nasa 0.258
6 10
7 4 celebr 0.262
8 4 bahar 0.345
现在,我们需要str拆分最后一行并删除空:
newdf["terms"] = newdf["terms"].str.strip()
newdf = newdf.join(newdf["terms"].str.split(" ", expand=True))
newdf.columns = ["Id", "terms", "Term", "Weights"]
newdf = newdf.drop("terms", axis=1).dropna()
产生的newdf:
Id Term Weights
0 1 frack 0.733
1 1 shale 0.700
3 10 space 0.645
4 10 station 0.327
5 10 nasa 0.258
7 4 celebr 0.262
8 4 bahar 0.345
答案 5 :(得分:0)
这是您的另一个问题。创建一个列表,其中将包含每个ID和术语的列表。然后生成数据框。
import pandas as pd
file=r"give_your_path".replace('\\', '/')
my_list_of_lists=[]#creating an empty list which will contain lists of [Id Term Weight]
with open(file,"r+") as f:
for line in f.readlines():#looping every line
my_id=[line.split(":")[0]]#storing the Id in order to use it in every term
for term in [s.strip().split(" ") for s in line[line.find(":")+1:].split(",")[:-1]]:
my_list_of_lists.append(my_id+term)
df=pd.DataFrame.from_records(my_list_of_lists)#turning the lists to dataframe
df.columns=["Id","Term","weight"]#giving columns their names
答案 6 :(得分:0)
我可以假设'TERM'之前只有1个空格吗?
df=pd.DataFrame(columns=['ID','Term','Weight'])
with open('C:/random/d1','r') as readObject:
for line in readObject:
line=line.rstrip('\n')
tempList1=line.split(':')
tempList2=tempList1[1]
tempList2=tempList2.rstrip(',')
tempList2=tempList2.split(',')
for item in tempList2:
e=item.split(' ')
tempRow=[tempList1[0], e[0],e[1]]
df.loc[len(df)]=tempRow
print(df)
答案 7 :(得分:0)
也许会很容易理解发生了什么。您只需要更新代码即可读取文件,而无需使用变量。
import pandas as pd
txt = """1: frack 0.733, shale 0.700,
10: space 0.645, station 0.327, nasa 0.258,
4: celebr 0.262, bahar 0.345"""
data = []
for line in txt.splitlines():
key, values = line.split(':')
for elements in values.split(','):
if elements:
term, weight = elements.split()
data.append({'Id': key, 'Term': term, 'Weight': weight})
df = pd.DataFrame(data)
DF:
Id Term Weight
0 1 frack 0.733
1 1 shale 0.700
2 10 space 0.645
3 10 station 0.327
4 10 nasa 0.258
5 4 celebr 0.262
6 4 bahar 0.345
答案 8 :(得分:-3)
1)您可以逐行阅读。
2)然后,您可以用':'分隔索引,用','分隔值
1)
with open('path/filename.txt','r') as filename:
content = filename.readlines()
2) content = [对于内容中的x,为x.split(':')]
这将为您提供以下结果:
content =[
['1','frack 0.733, shale 0.700,'],
['10', 'space 0.645, station 0.327, nasa 0.258,'],
['4','celebr 0.262, bahar 0.345 ']]