Python词典与理解

时间:2017-04-30 11:21:18

标签: python-3.x dictionary list-comprehension

我的文件行如下:

"voc_sales_dac" "QVN"   "BE"    "FR"    "21513287expe"  "21513287"  "expe"  "10"    "7" "vehicule livrée mais vendeur en congé donc vehicule receptioné plus tard"
"voc_sales_dac" "QVN"   "CH"    "FR"    "21207010reco"  "21207010"  "reco"  "10"    "10"    "A ma fille"

我所做的是将字段10中的文字标记化,首先是句子,然后是单独的单词,以提取文本中每个单词的初始位置。

我想得到的是这样的字典:

maped { 21513287expe: { vehicule: 0,
                        livrée: 10,
                        mais: 17,
                        vendeur: 22,
                        en: 30,
                        congé: 33,
                        donc: 39,
                        vehicule: 44,
                        recepcioné: 53,
                        plus: 64,
                        tard: 69
                       },
        21207010reco: { A: 0,
                        ma: 3,
                        fille: 6
                      },
      }  

我做了什么:

import nltk.data
from nltk.tokenize import TreebankWordTokenizer
W_tokenizer = TreebankWordTokenizer()
S_tokenizer = nltk.data.load('tokenizers/punkt/PY3/french.pickle')
import csv
import re

pattern = re.compile("[a-zá-úä-üâ-ûà-ùç]+")

with open('FR_test.csv', encoding='utf-8') as csvfile:
    reader = csv.reader(csvfile, delimiter="\t",skipinitialspace=True)
    for row in reader:
        phrases = S_tokenizer.tokenize(row[9])
        for v in phrases:
            tokens = W_tokenizer.tokenize(v)
            maped={row[4]:{w:row[9].index(w)} for w in tokens if pattern.match(w)}

是否有可能在字典理解中实现这一目标?

1 个答案:

答案 0 :(得分:1)

试试这个:

#standardSQL
SELECT
 sonnetsCorp,
 count(distinct word) cnt,
 count(distinct word)/sum(count(distinct word)) over (partition by sonnetsCorp) ratio
FROM (
  SELECT
   *,
   corpus = 'sonnets' AS sonnetsCorp
  FROM `bigquery-public-data.samples.shakespeare`
)
GROUP BY sonnetsCorp;