我有一个函数应用于熊猫数据帧的每一行。它的主要用途是查询 REST API(Azure 的文本分析 API)并返回结果实体的列表。
def get_entity_rec(row):
try:
textcon = row._c0[0:5000]
doc = [textcon]
textconlang = row._c0[0:1000]
doclang = [textconlang]
# Get Language
response = client.detect_language(documents = doclang, country_hint = 'us')[0]
row['language'] = response.primary_language.name
result = client.recognize_entities(documents = doc)[0]
row['items'] = [[entity.text, entity.category, entity.subcategory, entity.confidence_score] for entity in result.entities]
return row
except Exception as err:
print("Encountered exception. {}".format(err))
d = {'_c0': ['London', 'Paris'], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
pd_df2 = df.apply(get_entity_rec, axis=1)
pd_df2
我最初有一个这样的 for 循环:
for entity in result.entities:
b = [entity.text, entity.category, entity.subcategory, entity.confidence_score]
a.append(b)
row['items'] = a
return row
但环顾四周似乎列表理解会表现得更好。但是,在进行更改后,我获得了几乎相同的运行时间(有时 for 循环会快一点)