Question

我有一个文本文件：

它可以每次更改并且可以更改行数，并且每行包含以下内容：

string (can contain one word, two or even more) ^ string of one word
EX:



level country ^ layla
hello sandra  ^ organization
hello people ^ layla
hello samar  ^ organization

我想使用pandas创建数据框，以便：

item0   ( country, people)
item1    (sandra , samar)

因为例如每次有layla，我们返回属于它的最正确的名称并将其添加为上面显示的第二列（在这种情况下是国家，人），我们将layla称为item0和作为数据帧的索引。我似乎无法安排这个，我不知道如何做出返回“^”之后的任何重复的逻辑并返回其所属的最正确名称的列表。到目前为止我的试验并没有真正做到：

def text_file(file):

    list=[]
    file_of_text = "text.txt"
    with open(file_of_context) as f:
         for l in f:
              l_dict = l.split(" ")
              list.append(l_dict)
    return(list)

def items(file_of_text):

     list_of_items= text_file(file_of_text)
     for a in list_of_items:
         for b in a:
             if a[-1]==



def main():

    file_of_text = "text.txt"

if __name__ == "__main__":
    main()

Answer 1

假设您的文件名为file_of_text.txt并包含以下内容：

level country ^ layla
hello sandra  ^ organization
hello people ^ layla
hello samar  ^ organization

您可以使用以下代码行将数据从文件获取到与所需输出类似的数据框：

import re
import pandas as pd

def main(myfile):
    # Open the file and read the lines
    text = open(myfile,'r').readlines()

    # Split the lines into lists
    text = list(map(lambda x: re.split(r"\s[\^\s]*",x.strip()), text))

    # Put it in a DataFrame
    data = pd.DataFrame(text, columns = ['A','B','C'])

    # Create an output DataFrame with rows "item0" and "item1"
    final_data = pd.DataFrame(['item0','item1'],columns=['D'])

    # Create your desired column
    final_data['E'] = data.groupby('C')['B'].apply(lambda x: tuple(x.values)).values

    print(final_data)

if __name__ == "__main__":
    myfile = "file_of_text.txt"
    main(myfile)

我们的想法是从文本文件中读取行，然后使用split模块中的re方法拆分每行。然后将结果传递给DataFrame方法以生成名为data的数据框，该数据框用于创建所需的数据帧final_data。结果应如下所示：

# data

       A        B             C
0  level  country         layla
1  hello   sandra  organization
2  hello   people         layla
3  hello    samar  organization


# final_data

       D                  E
0  item0  (country, people)
1  item1    (sandra, samar)

如果您有任何问题，请查看脚本并提出进一步的问题。

我希望这会有所帮助。

Answer 2

从pandas开始read_csv()指定'^'作为分隔符并使用任意列名称

df = pd.read_csv('data.csv', delimiter='\^', names=['A', 'B'])
print (df)
                A              B
0  level country           layla
1  hello sandra     organization
2   hello people           layla
3   hello samar     organization

然后我们分裂以获得我们想要的值。扩展arg在熊猫16中是新的，我相信

df['A'] = df['A'].str.split(' ', expand=True)[1]
print(df)
         A              B
0  country          layla
1   sandra   organization
2   people          layla
3    samar   organization

然后我们将列B分组并应用元组函数。注意：我们正在重置索引，以便稍后使用

g = df.groupby('B')['A'].apply(tuple).reset_index()
print(g)
              B                  A
0          layla  (country, people)
1   organization    (sandra, samar)

使用字符串'item'和索引

创建一个新列

   g['item'] = 'item' + g.index.astype(str)
    print (g[['item','A']])
        item                  A
    0  item0  (country, people)
    1  item1    (sandra, samar)

从文件中读取并在python中操作

2 个答案: