我有一个相当大的数据集存储在一个数据框中。如此之大,实际上,对数据集进行排序以生成示例数据集已导致我的文本编辑器崩溃。因此,我提供了到我正在使用的数据集的链接:
出于计划目的,我需要从问题,文章标题和段落上下文列中检索单词的词汇。
但是,似乎在拆分和合并列的过程中,我无意间通过将两个单词首尾相接在一起创建了一些单词(例如:“ raised”和“ in”变成“ raisein”加泰罗尼亚语”) )
### Loading JSON datasets
import json
import re
regex = re.compile(r'\W+')
def readFile(filename):
with open(filename) as file:
fields = []
JSON = json.loads(file.read())
for article in JSON["data"]:
articleTitle = article["title"]
for paragraph in article["paragraphs"]:
paragraphContext = paragraph["context"]
for qas in paragraph["qas"]:
question = qas["question"]
for answer in qas["answers"]:
fields.append({"question":question,"answer_text":answer["text"],"answer_start":answer["answer_start"],"paragraph_context":paragraphContext,"article_title":articleTitle})
fields = pd.DataFrame(fields)
fields["question"] = fields["question"].str.replace(regex," ")
assert not (fields["question"].str.contains("catalanswhat").any())
fields["paragraph_context"] = fields["paragraph_context"].str.replace(regex," ")
fields["answer_text"] = fields["answer_text"].str.replace(regex," ")
assert not (fields["answer_text"].str.contains("catalanswhat").any())
fields["article_title"] = fields["article_title"].str.replace("_"," ")
assert not (fields["article_title"].str.contains("catalanswhat").any())
return fields
# Load training dataset.
trainingData = readFile("train-v1.1.json")
# Vocabulary functions
def vocabulary():
data_frame = trainingData
data_frame = data_frame.astype("str")
text_split = pd.concat((data_frame["question"],data_frame["paragraph_context"],data_frame["article_title"]),ignore_index=True)
text_split = text_split.str.split()
words = set()
text_split.apply(words.update)
return words
def vocabularySize():
return len(vocabulary())
也失败的备用代码:
def vocabulary():
data_frame = trainingData
data_frame = data_frame.astype("str")
concat = data_frame["question"].str.cat(sep=" ",others=[data_frame["paragraph_context"],data_frame["article_title"]])
concat = concat.str.split(" ")
words = set()
concat.apply(words.update)
print(words)
assert "raisedin" not in words
return words
答案 0 :(得分:0)
这是我解决问题的方法:
from sklearn.feature_extraction.text import CountVectorizer
df = pd.read_json('train-v1.1.json')
words = []
for idx, row in df.iterrows():
#title
words.append(json_normalize(df['data'][idx])['title'].str.replace("_"," ").to_string(index = False))
#paragraph context
words.append(json_normalize(df['data'][idx], record_path = 'paragraphs')['context'].to_string(index = False))
#question
words.append(json_normalize(df['data'][idx], record_path = ['paragraphs', 'qas'])['question'].to_string(index = False))
vectorizer = CountVectorizer()
count = vectorizer.fit_transform(words)
vectorizer.get_feature_names()
sklearn具有一个功能,该功能可以按照您的意愿进行操作,获取一组数据的所有单个单词。为了使用此功能,我们需要将所有数据放入一个列表或系列中。
我们如何通过首先读取文件来完成列表的创建。我注意到其中嵌入了许多json文件,因此接下来我们将遍历所有不同的json并提取所需的数据,然后将其添加到单词列表中。
我们如何从以下方面获取所需信息:
json_normalize(df['data'][idx], record_path = ['paragraphs', 'qas'])['question'].to_string(index = False)
我们查看df的data列,其中包含各个json。我们通过json导航,直到到达要通过record_path的位置。接下来,我们获取所需的列,将其全部转换为字符串,然后将新列表追加到主单词列表中。我们对所有不同的json文件执行此操作。
如果您想执行更多的字符串操作(例如,删除“”中的“ _”),则可以在for循环中或使用主词列表进行操作。我只为我的情况做标题。
最后,我们将把单词加起来。我们创建了一个称为vectorizer的CountVectorizer,我们对列表进行拟合和转换。最后,我们可以通过get_feature_names()函数研究CountVectorizer,以查看每个单词。请注意,如果文本中存在拼写错误,它们也会在其中。
编辑:
您可以使用下面的代码搜索单词并查看它们的位置。将检查中的值更改为所需的值。
df = pd.read_json('train-v1.1.json')
vectorizer = CountVectorizer()
checking = ['raisedin']
for idx, row in df.iterrows():
title = []
para = []
quest = []
getTitle = json_normalize(df['data'][idx])['title'].str.replace("_"," ")
getPara = json_normalize(df['data'][idx], record_path = 'paragraphs')['context']
getQuest = json_normalize(df['data'][idx], record_path = ['paragraphs', 'qas'])['question']
title.append(getTitle.str.replace("_"," ").to_string(index = False))
para.append(getPara.to_string(index = False))
quest.append(getQuest.to_string(index = False))
for word in checking:
for allwords in [getTitle, getPara, getQuest]:
count = vectorizer.fit_transform(allwords)
test = vectorizer.get_feature_names()
if word in test:
print(getTitle)
print(f"{word} is in: " + allwords.loc[allwords.str.contains(word)])
0 Poultry
Name: title, dtype: object
93 raisedin is in: How long does it take for an broiler raisedin...
Name: question, dtype: object