我的文件行如下:
"voc_sales_dac" "QVN" "BE" "FR" "21513287expe" "21513287" "expe" "10" "7" "vehicule livrée mais vendeur en congé donc vehicule receptioné plus tard"
"voc_sales_dac" "QVN" "CH" "FR" "21207010reco" "21207010" "reco" "10" "10" "A ma fille"
我所做的是将字段10中的文字标记化,首先是句子,然后是单独的单词,以提取文本中每个单词的初始位置。
我想得到的是这样的字典:
maped { 21513287expe: { vehicule: 0,
livrée: 10,
mais: 17,
vendeur: 22,
en: 30,
congé: 33,
donc: 39,
vehicule: 44,
recepcioné: 53,
plus: 64,
tard: 69
},
21207010reco: { A: 0,
ma: 3,
fille: 6
},
}
我做了什么:
import nltk.data
from nltk.tokenize import TreebankWordTokenizer
W_tokenizer = TreebankWordTokenizer()
S_tokenizer = nltk.data.load('tokenizers/punkt/PY3/french.pickle')
import csv
import re
pattern = re.compile("[a-zá-úä-üâ-ûà-ùç]+")
with open('FR_test.csv', encoding='utf-8') as csvfile:
reader = csv.reader(csvfile, delimiter="\t",skipinitialspace=True)
for row in reader:
phrases = S_tokenizer.tokenize(row[9])
for v in phrases:
tokens = W_tokenizer.tokenize(v)
maped={row[4]:{w:row[9].index(w)} for w in tokens if pattern.match(w)}
是否有可能在字典理解中实现这一目标?
答案 0 :(得分:1)
试试这个:
#standardSQL
SELECT
sonnetsCorp,
count(distinct word) cnt,
count(distinct word)/sum(count(distinct word)) over (partition by sonnetsCorp) ratio
FROM (
SELECT
*,
corpus = 'sonnets' AS sonnetsCorp
FROM `bigquery-public-data.samples.shakespeare`
)
GROUP BY sonnetsCorp;