我仍然是在Python中解析的新手,所以我需要一些关于连接的一些帮助。我有一个.asc文件和.dat文件,它们都有这样的结构:
sta cnt assgnum cty pdpass ptype patnum
"IL" "" 1 "Chicago" 10030271 "0" 3930271
"PA" "" 1 "Bedford" 10156902 "0" 3930272
"MO" "" 1 "St. Louis" 10112031 "0" 3930273
"IL" "" 1 "Chicago" 10030276 "0" 3930276
和.dat像这样:
cod cod_fix pdpass standard_name uspto_assignee
"US institute" 32 12832332 "& AEROSPACE FOUND" 835951
"02 US corporation" "No change" 11624193 "& COMMUNICATIONS" 721167
"03 Foreign corp, incl. state-owned" "No change" 12549858 "& DESIGN LTD" 806186
"03 Foreign corp, incl. state-owned" "No change" 11170486 "& FR" 182855
这很难看。所以我试图将它们分别保存为 .csv文件,并将它们都删除所有引号(在sta,cty,ptype,cod等中......)
最终目标是将两个数据集合在同一个唯一标识符( pdpass )下,但首先需要将它们剥离。
我的剥离代码如下:
import csv
import re
with open("C:\\......FILE.asc", "r") as fin:
with open("C:\.....FILE.csv", "w") as fout:
for line in fin:
newline = map(str.strip, line.split('"'))
csv.writer(fout).writerow(newline)
这给了我一些格式化的结果。不太确定去哪里。任何人吗?
答案 0 :(得分:1)
这些文件看起来是以制表符分隔的。如果是这样,只需使用pandas库将其作为csv读取,并使用制表符作为分隔符。
import pandas as pd
pd.read_csv("C:\\......FILE.asc", sep = '\t')
然后,您可以使用地图清理引号或应用于每一行。
答案 1 :(得分:0)
已经很晚了(在我的时区)。此代码表示读取这两个文件的基本方法,并将它们合并到pdpass
变量上,假设输入文件是制表符分隔的。 (如果它们没有以制表符分隔,我可以提供其他代码。)
几乎忘了!:我更改了其中一个文件中的一些pdpass
值,以便其中的记录“连接”到另一个文件中的某些相应记录。
import csv
complete = {}
with open('FILE.csv') as csvfile:
reader = csv.DictReader(csvfile, delimiter='\t')
for row in reader:
complete[row['pdpass']] = [row[_] for _ in ['sta', 'cnt', 'assgnum', 'cty', 'ptype', 'patnum']]
with open('FILE.dat') as datfile:
reader = csv.DictReader(datfile, delimiter='\t')
for row in reader:
complete[row['pdpass']].extend(row[_] for _ in ['cod', 'cod_fix', 'standard_name', 'uspto_assignee'])
for pdpass in complete:
print (pdpass, complete[pdpass])
以下是结果输出:
10030276 ['IL', '', '1', 'Chicago', '0', '3930276', '02 US corporation', 'No change', '& COMMUNICATIONS', '721167']
10156902 ['PA', '', '1', 'Bedford', '0', '3930272', '03 Foreign corp, incl. state-owned', 'No change', '& FR', '182855']
10030271 ['IL', '', '1', 'Chicago', '0', '3930271', '03 Foreign corp, incl. state-owned', 'No change', '& DESIGN LTD', '806186']
10112031 ['MO', '', '1', 'St. Louis', '0', '3930273', 'US institute', '32', '& AEROSPACE FOUND', '835951']
输入文件:
FILE.CSV
sta cnt assgnum cty pdpass ptype patnum
"IL" "" 1 "Chicago" 10030271 "0" 3930271
"PA" "" 1 "Bedford" 10156902 "0" 3930272
"MO" "" 1 "St. Louis" 10112031 "0" 3930273
"IL" "" 1 "Chicago" 10030276 "0" 3930276
FILE.DAT
cod cod_fix pdpass standard_name uspto_assignee
"US institute" 32 10112031 "& AEROSPACE FOUND" 835951
"02 US corporation" "No change" 10030276 "& COMMUNICATIONS" 721167
"03 Foreign corp, incl. state-owned" "No change" 10030271 "& DESIGN LTD" 806186
"03 Foreign corp, incl. state-owned" "No change" 10156902 "& FR" 182855