Question

我有一个句子的字符串表示列表，如下所示：

original_format = ["This is a question", "This is another question", "And one more too"]

我想将此列表转换为语料库中的一组唯一单词。鉴于上面的列表，输出看起来像这样：

{'And', 'This', 'a', 'another', 'is', 'more', 'one', 'question', 'too'}

我已经找到了实现这个目标的方法，但需要很长时间才能运行。我对从一种格式转换为另一种格式的更有效方式感兴趣（特别是因为我的实际数据集包含> 200k的句子）。

仅供参考，我现在正在做的是为词汇创建一个空集，然后循环每个句子（用空格分割）并与词汇集合并。使用上面定义的 original_format 变量，它看起来像这样：

vocab = set()
for q in original_format:
    vocab = vocab.union(set(q.split(' ')))

你能帮助我更有效地运行这种转换吗？

Answer 1

您可以将itertools.chain与set一起使用。这可以避免嵌套的for循环和list构造。

from itertools import chain

original_format = ["This is a question", "This is another question", "And one more too"]

res = set(chain.from_iterable(i.split() for i in original_format))

print(res)

{'And', 'This', 'a', 'another', 'is', 'more', 'one', 'question', 'too'}

或者是真正的功能性方法：

from itertools import chain
from operator import methodcaller

res = set(chain.from_iterable(map(methodcaller('split'), original_format)))

Answer 2

使用简单的集合理解：

byte sensorPin = 2;
double pulses = 0;
double wSpeed = 0;
long updateTimer = 0;
int updateDuration = 3000;

void setup() {
  Serial.begin(115200);
  pinMode(sensorPin, INPUT_PULLUP);
  attachInterrupt(digitalPinToInterrupt(sensorPin), sensorISR, FALLING);
}

void loop() {
  long now = millis();
  if(updateTimer < now) {
    updateTimer = now + updateDuration;
    wSpeed = ((pulses/(updateDuration/1000)) * 0.765) + 0.35;
    pulses = 0;
    Serial.println("Windspeed is:" + String(wSpeed));
  }
}

void sensorISR() {
  pulses++;
}

输出：

{j for i in original_format for j in i.split()}

将句子的字符串表示列表转换为词汇集

2 个答案: