我正在尝试从文本格式的数据集中生成一个数据框。文本文件的格式如下
product/productId: B000JVER7W
product/title: Mobile Action MA730 Handset Manager - Bluetooth Data Suite
product/price: unknown
review/userId: A1RXYH9ROBAKEZ
review/profileName: A. Igoe
review/helpfulness: 0/0
review/score: 1.0
review/time: 1233360000
review/summary: Don't buy!
review/text: First of all, the company took my money and sent me an email telling me the product was shipped. A week and a half later I received another email telling me that they are sorry, but they don't actually have any of these items, and if I received an email telling me it has shipped, it was a mistake.When I finally got my money back, I went through another company to buy the product and it won't work with my phone, even though it depicts that it will. I have sent numerous emails to the company - I can't actually find a phone number on their website - and I still have not gotten any kind of response. What kind of customer service is that? No one will help me with this problem. My advice - don't waste your money!
product/productId: B000JVER7W
product/title: Mobile Action MA730 Handset Manager - Bluetooth Data Suite
product/price: unknown
review/userId: A7L6E1KSJTAJ6
review/profileName: Steven Martz
review/helpfulness: 0/0
review/score: 5.0
review/time: 1191456000
review/summary: Mobile Action Bluetooth Mobile Phone Tool Software MA-730
review/text: Great product- tried others and this is a ten compared to them. Real easy to use and sync's easily. Definite recommended buy to transfer data to and from your Cell.
因此,我需要生成一个数据帧,其中包含所有ProductID,标题,价格等。作为列标题以及每个记录中的对应数据。
所以我想要的最终数据框是
ID Title Price UserID ProfileName Helpfulness Score Time summary
B000JVER7W Mobile Action MA730 unknown A1RXYH9ROBAKEZ A. Igoe 0/0 1.0 1233360000 Don'tbuy!
Handset Manager - Bluetooth
Data Suite
,以此类推,使用正则表达式获取数据集中的所有审阅详细信息。由于我是正则表达式的初学者,因此无法执行此操作。我尝试做(假设数据集变量包含文本文件的所有内容)
pattern = "product\productId:\s(.*)\s"
a = re.search(pattern, dataset)
这样做我得到了输出
>> a.group(1)
"B000JVER7W product/title: Mobile Action MA730 Handset Manager - Bluetooth Data Suite product/price: unknown review/userId: A1RXYH9ROBAKEZ review/profileName: A. Igoe review/helpfulness: 0/0 review/score: 1.0 review/time: 1233360000 review/summary: Dont buy! review/text: First of all, the company took my money and sent me an email telling me the product was shipped. A week and a half later I received another email telling me that they are sorry, but they don't actually have any of these items, and if I received an email telling me it has shipped, it was a mistake.When I finally got my money back, I went through another company to buy the product and it won't work with my phone, even though it depicts that it will. I have sent numerous emails to the company - I can't actually find a phone number on their website - and I still have not gotten any kind of response. What kind of customer service is that? No one will help me with this problem. My advice - don't waste your money!"
但是我想要的是
>> a.group(1)
"["B000JVER7W", "A000123js" ...]"
,并且所有字段类似。
上述要求是否可行?
预先感谢
答案 0 :(得分:1)
即使没有任何正则表达式,也可以通过创建字典然后使用pandas.Dataframe()
来做到这一点。
尝试一下:
import pandas as pd
with open("your_file_name") as file:
product_details = file.read().split("\n\n")
product_dict = {"ID":[],"Title":[],"Price":[],"UserID":[],
"ProfileName":[],"Helpfulness":[],"Score":[],"Time":[],"summary":[]}
for product in product_details:
fields = product.split("\n")
product_dict["ID"].append(fields[0].split(":")[1])
product_dict["Title"].append(fields[1].split(":")[1])
product_dict["Price"].append(fields[2].split(":")[1])
product_dict["UserID"].append(fields[3].split(":")[1])
product_dict["ProfileName"].append(fields[4].split(":")[1])
product_dict["Helpfulness"].append(fields[5].split(":")[1])
product_dict["Score"].append(fields[6].split(":")[1])
product_dict["Time"].append(fields[7].split(":")[1])
product_dict["summary"].append(fields[8].split(":")[1])
dataframe = pd.DataFrame(product_dict)
print(dataframe)
输出
第一行如下所示:
ID Title Price UserID ProfileName Helpfulness Score Time summary
B000JVER7W Mobile Action MA730 unknown A1RXYH9ROBAKEZ A. Igoe 0/0 1.0 1233360000 Don'tbuy!
Handset Manager - Bluetooth
Data Suite
答案 1 :(得分:0)
您在“样式”中有一个错字,将“ \”更改为“ /”。 并使用\ s *和findall:
pattern = r"product/productId:\s*(.*)\s*"
mo= re.findall(pattern,text)