我已经阅读了这个问题Load data from txt with pandas。但是,我的数据格式有点不同。以下是数据示例:
product/productId: B003AI2VGA
review/userId: A141HP4LYPWMSR
review/profileName: Brian E. Erland "Rainbow Sphinx"
review/helpfulness: 7/7
review/score: 3.0
review/time: 1182729600
review/summary: "There Is So Much Darkness Now ~ Come For The Miracle"
review/text: Synopsis: On the daily trek from Juarez, Mexico to ...
product/productId: B003AI2VGA
review/userId: A328S9RN3U5M68
review/profileName: Grady Harp
review/helpfulness: 4/4
review/score: 3.0
review/time: 1181952000
review/summary: Worthwhile and Important Story Hampered by Poor Script and Production
review/text: THE VIRGIN OF JUAREZ is based on true events...
.
.
我打算进行情绪分析,因此我想在每个部分只获得text
和score
行。有人用pandas怎么做?或者我需要阅读文件并分析每一行以提取评论和评级?
答案 0 :(得分:0)
这是一种方式:
import pandas as pd
from io import StringIO
mystr = StringIO("""product/productId: B003AI2VGA
review/userId: A141HP4LYPWMSR
review/profileName: Brian E. Erland "Rainbow Sphinx"
review/helpfulness: 7/7
review/score: 3.0
review/time: 1182729600
review/summary: "There Is So Much Darkness Now ~ Come For The Miracle"
review/text: Synopsis: On the daily trek from Juarez, Mexico to ...
product/productId: B003AI2VGA
review/userId: A328S9RN3U5M68
review/profileName: Grady Harp
review/helpfulness: 4/4
review/score: 3.0
review/time: 1181952000
review/summary: Worthwhile and Important Story Hampered by Poor Script and Production
review/text: THE VIRGIN OF JUAREZ is based on true events...""")
# replace mystr with 'file.txt'
df = pd.read_csv(mystr, header=None, sep='|', error_bad_lines=False)
df = pd.DataFrame(df[0].str.split(':', n=1).values.tolist())
df = df.loc[df[0].isin({'review/text', 'review/score'})]
结果:
0 1
4 review/score 3.0
7 review/text Synopsis: On the daily trek from Juarez, Mexi...
12 review/score 3.0
15 review/text THE VIRGIN OF JUAREZ is based on true events...
答案 1 :(得分:0)
实际上,我不知道大熊猫可以读取文件。
我建议写一个能读取你文件的python程序, 并输出csv文件,让我们这样命名为sentiment.csv:
产品ID,审稿人ID,个人资料名称,乐于助人,得分,时间,摘要,文本 B003AI2VGA,A141HP4LYPWMSR,Brian E. Erland" Rainbow 狮身人面像",7 / 7,3.0,1182729600,"现在有这么多黑暗〜来吧 奇迹",剧情简介:从墨西哥华雷斯到日常徒步旅行......
B003AI2VGA,A328S9RN3U5M68,Grady Harp,4 / 4,3.0,1181952000,值得和 糟糕的剧本和制作妨碍的重要故事 JUAREZ基于真实事件...
然后,简单地使用: df = pd.read_csv(' sentiment.csv')
答案 2 :(得分:0)
我认为来自@sanrio的答案可能是最直接的,但这里可以选择在pandas
中进行字符串操作:
with open('your_text_file.txt') as f:
text_lines = f.readlines()
# create pandas Series object where each value is a text line from your file
s = pd.Series(text_lines)
# remove the new-lines
s = s.str.strip()
# extract some columns using regex and represent in a dataframe
df = s.str.split('\s?(.*)/([^:]*):(.*)', expand=True)
# remove irrelevant columns
df = df.replace('', np.nan).dropna(how='all', axis=1)
def gb_organize(df_):
"""
Organize a sub-dataframe from group-by operation.
"""
df_ = df_.dropna()
return pd.DataFrame(df_[3].values, index=df_[2].values).T
# pass a Series object to .groupby to iterate over consecutive non-null rows
df_result = df.groupby(df.isna().all(axis=1).cumsum(), group_keys=False).apply(gb_organize)
df_result = df_result.set_index(['productId', 'userId'])
# then you can access the records you want with the following:
df_result[['score', 'text']]