Question

我有一个如下所示的txt文件：

    Alabama[edit]
    Auburn (Auburn University, Edward Via College of Osteopathic Medicine)
    Birmingham (University of Alabama at Birmingham, Birmingham School of 
    Alaska[edit]
    Anchorage[21] (University of Alaska Anchorage)
    Fairbanks (University of Alaska Fairbanks)[16]

我想将txt文件读作一个如下所示的数据框：

state     county
Alabama   Auburn
Alabama   Birmingham
Alaska    Anchorage
Alaska    Faibanks

到目前为止我所拥有的是：

university_towns = open('university_towns.txt','r')
df_university_towns = pd.DataFrame(columns={'State','RegionName'})
# loop over each line of the file object
# determine if each line is state or county. 
# if the line has [edit], it's state
for line in university_towns:
    state_pattern = re.compile('\[edit\]')
    state_pattern_m = state_pattern.search(line)
    county_pattern = re.compile('(')
    county_pattern_m = county_pattern.search(line)
    if state_pattern_m:
        #extract everything before \[edit]
        print(state_pattern_m.start())
        end_position = state_pattern_m.start()
        print(line[0:end_position])
        state_name = line[0:end_position]
    if county_pattern_m:
        #extract everything before (

此代码只会给我这样的内容：

State  County
Alabama Auburn
        Birminham
.
.
.

Answer 1

这应该这样做：

key = None

for line in t:
    if '[edit]' in line:
        key = line.replace('[edit]', '')
        continue
    if key:
        # Use regex to extrac what you need
        print(key, line.split(' ')[0])

我不确定您的数据是什么样的，所以更改正则表达式以从标题中删除[]（猜测它是标题）并可能使用正则表达式代替

中的“[edit”]

读入.txt文件中所需的数据帧格式

1 个答案: