使用python从字符串中提取模式

时间:2017-04-27 22:57:42

标签: python

我正在尝试从Excel文件中的列中读取数据,然后在该行中使用extracint用户ID。到目前为止,我能够使用以下代码提取用户ID,然后将结果写入Excel文件。

import xlrd
import pandas as pd


#Input File Path
file='file1.xlsx'
workbook = xlrd.open_workbook(file)

#open first worksheet
sheet=workbook.sheet_by_index(0)

#extract details from 4th column
description = sheet.col_values(4)

my_series = pd.Series(description)
numbers = my_series.str.findall('\d+')
All_Ids = pd.to_numeric(numbers, errors='ignore')
All_Ids_mapped = [map(int, x) for x in All_Ids]
df = pd.DataFrame(All_Ids_mapped)

# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('extracted_ids.xlsx', engine='xlsxwriter')

# Convert the dataframe to an XlsxWriter Excel object.
df.to_excel(writer, sheet_name='Sheet1')

# Close the Pandas Excel writer and output the Excel file.
writer.save()

但现在我的问题是在列中有很多id。所以我想提取以字符串'user with id'开头的id 例如,列中的字符串如下所示:

The user with id '123' started discussion with the user with id '456' in the discussion thread with id '5000'.

由于我只对用户ID感兴趣,我想更新我的搜索字符串以包含文本。我尝试了以下但它没有给我输出。

  numbers=my_series.str.findall('^user with id.+\d+')

请帮助我在str.findall中写出正确的表达方式。

谢谢。

1 个答案:

答案 0 :(得分:0)

使用re模块,我得到以下结果:

series = "The user with id '123' started discussion with the user with id '456' in the discussion thread with id '5000'."
>>>re.findall("user with id '\d+'", series)
["user with id '123'", "user with id '456'"]

这些是预期的匹配吗?由于生成的匹配是有序的,因此通过索引选择一个并且提取id不会太难。