我有一个需要更改为数据框的file.txt(制表符分隔),基本上是逐行排序文件并为最终数据帧创建唯一列。另外,写" Na"没有信息建立时为空值。请注意" CS _"作为"之后的模式:"。 我正在考虑大熊猫数据框架,但我们将非常感谢您的帮助。 R中的建议也值得赞赏。
输入:
Japan Cases:CS_1 People:CS_2 Life:CS_3
Australia People:CS_4 Transportation:CS_Ground
Spain Life:CS_5 Language:CS_Spanish
输出:
Cases People Life Transportation Language
Japan CS_1 CS_2 CS_3 Na Na
Australia Na CS_4 Na CS_Ground Na
Spain Na Na CS_5 Na CS_Spanish
答案 0 :(得分:0)
假设:
>>> from io import StringIO
>>> infile = """Japan Cases:CS_1 People:CS_2 Life:CS_3
... Australia People:CS_4 Transportation:CS_Ground
... Spain Life:CS_5 Language:CS_Spanish"""
逐行遍历文件:
\s
或\t
)分割其余部分Key
)[代码]:
>>> row_dicts = []
>>> for line in StringIO(infile):
... k, _, therest =line.partition(' ') # Step 1.
... _row = {kv.split(':')[0]:kv.split(':')[1] for kv in therest.split()} # Step 2-3.
... _row['Key'] = k # Step 4.
... row_dicts.append(_row) # Step 5.
...
将词典列表转换为pd.DataFrame
:
>>> pd.DataFrame(row_dicts)
Cases Key Language Life People Transportation
0 CS_1 Japan NaN CS_3 CS_2 NaN
1 NaN Australia NaN NaN CS_4 CS_Ground
2 NaN Spain CS_Spanish CS_5 NaN NaN
使用.set_index
将国家/地区Key
列设置为索引。
>>> df.set_index('Key')
Cases Language Life People Transportation
Key
Japan CS_1 NaN CS_3 CS_2 NaN
Australia NaN NaN NaN CS_4 CS_Ground
Spain NaN CS_Spanish CS_5 NaN NaN
答案 1 :(得分:0)
你可以使用(使用生成器和理解):
import re, pandas as pd
string = """
Japan Cases:CS_1 People:CS_2 Life:CS_3
Australia People:CS_4 Transportation:CS_Ground
Spain Life:CS_5 Language:CS_Spanish
"""
rx = re.compile(r'(?P<key>\w+):(?P<value>CS_\d+)')
rxc = re.compile(r'(?P<country>\w+)')
dft = (dict({'Country': item.group('country')}, **{m.group('key'): m.group('value') for m in rx.finditer(line)})
for line in string.split("\n")
for item in [rxc.match(line)]
if item)
df = pd.DataFrame(dft)
print(df)
这使用两个正则表达式,一个用于国家/地区,另一个用于键/值对。然后构建df
。