我有一个如下所示的txt文件:
Alabama[edit]
Auburn (Auburn University, Edward Via College of Osteopathic Medicine)
Birmingham (University of Alabama at Birmingham, Birmingham School of
Alaska[edit]
Anchorage[21] (University of Alaska Anchorage)
Fairbanks (University of Alaska Fairbanks)[16]
我想将txt文件读作一个如下所示的数据框:
state county
Alabama Auburn
Alabama Birmingham
Alaska Anchorage
Alaska Faibanks
到目前为止我所拥有的是:
university_towns = open('university_towns.txt','r')
df_university_towns = pd.DataFrame(columns={'State','RegionName'})
# loop over each line of the file object
# determine if each line is state or county.
# if the line has [edit], it's state
for line in university_towns:
state_pattern = re.compile('\[edit\]')
state_pattern_m = state_pattern.search(line)
county_pattern = re.compile('(')
county_pattern_m = county_pattern.search(line)
if state_pattern_m:
#extract everything before \[edit]
print(state_pattern_m.start())
end_position = state_pattern_m.start()
print(line[0:end_position])
state_name = line[0:end_position]
if county_pattern_m:
#extract everything before (
此代码只会给我这样的内容:
State County
Alabama Auburn
Birminham
.
.
.
答案 0 :(得分:0)
这应该这样做:
key = None
for line in t:
if '[edit]' in line:
key = line.replace('[edit]', '')
continue
if key:
# Use regex to extrac what you need
print(key, line.split(' ')[0])
我不确定您的数据是什么样的,所以更改正则表达式以从标题中删除[](猜测它是标题)并可能使用正则表达式代替
中的“[edit”]