Question

我读取了一个csv文件，并将其转换为带有2个文本列的pandas数据框。一栏中我有这种形式的多行：

<suggested-actions-list text =""is this a test?"">suggested- 
action>Yes</suggested-action><suggested-action>No</suggested-action> 
</suggested-actions-list>"

<choice-list text=""some text""> <choice-option>option1</choice-option> 
<choice-option>option2</choice-option> <choice-option>option3</choice- 
option></choice-list>

我想选择尖括号之间的文本，以得到这样的结果：

""is this a test?"" Yes No
""some text"" option1 option2 option3

有人可以提示吗？谢谢！

Answer 1

s = """
<suggested-actions-list text =""is this a test?""><suggested-action>Yes</suggested-action><suggested-action>No</suggested-action></suggested-actions-list>

<choice-list text=""some text""> <choice-option>option1</choice-option><choice-option>option2</choice-option> <choice-option>option3</choice-option></choice-list>
"""

x = re.sub('<(?:.*?)("".*"")?>', r'\1 ', s)
x = re.sub('[ ]+', ' ', x)

print(x)

输出：

""is this a test?"" Yes No 

""some text"" option1 option2 option3

注意：我必须对原始文本进行某种程度的修复，即在第一个“建议的操作”之前添加<，并在第一个元素的末尾删除"。让我知道这是否还行，我们也需要在代码中修复此问题

Answer 2

1。使用readlines（）阅读代码中的全文，这将为您提供行列表。

2。使用正则表达式，在列表列表中获取文本和其他选项。

3。将列表列表加载到数据框中。

import re
import pandas as pd
df_list = []
data = open('filename.txt','r').readlines()
for row in data:
    m = re.search('=(.+?)>', text)
        text = m.group(1)
    row = re.sub('<.*?>','',row).split(' ')
    df_list.append([m,row[0],row[1],row[2])
data_df = pd.Dataframe(df_list)

Python：在尖括号之间选择文本

2 个答案: