Question

我有一个短语列表，我想知道在我的所有短语中最常出现哪两个单词。

我尝试使用正则表达式和其他代码，但我找不到正确的方法来执行此操作。

有人可以帮忙吗？

例如：

I am purchasing a wallet
a wallet for 20$
purchasing a bag

我知道

a wallet发生了2次
purchasing a发生了2次

Answer 1

<?
$string = "I am purchasing a wallet a wallet for 20$ purchasing a bag";
//split string into words
$words  = explode(' ', $string);

//make chunks block ie [0,1][2,3]...
$chunks = array_chunk($words, 2);

//remove first array element
unset($words[0]);
//make chunks block ie [0,1][2,3]...
//but since first element is removed , the real block will be  [1,2][3,4]...
$alternateChunks = array_chunk($words, 2);
//merge both chunks
$totalChunks = array_merge($chunks,$alternateChunks);

$finalChunks = array();
foreach($totalChunks as $t)
{
    //change the inside chunk to pharse using +
    //+ can be replaced to space, if neeced
    //to keep associative working + is used instead of white space
    $finalChunks[] = implode('+', $t);
}
//count the words inside array 
$result = array_count_values($finalChunks);
echo "<pre>";
print_r($result);

Answer 2

尝试将其与爆炸放入数组并使用array_count_values计算值。

<?php
$text = "whatever";

$text_array = explode( ' ', $text);
$double_words = array();

for($c = 1; $c < count($text_array); $c++)
{ 
  $double_words[] = $text_array[$c -1] . ' ' . $text_array[$c];
}

$result = array_count_values($double_words);

?>

我现在更新为两个单词版本。这对你有用吗？

array(9) { 
  ["I am"]=> int(1) 
  ["am purchasing"]=> int(1) 
  ["purchasing a"]=> int(2) 
  ["a wallet"]=> int(2) 
  ["wallet a"]=> int(1) 
  ["wallet for"]=> int(1) 
  ["for 20$"]=> int(1) 
  ["20$ purchasing"]=> int(1) 
  ["a bag"]=> int(1) 
}

Answer 3

将它们全部放入一个数组中，然后通过当前的单词索引和下一个单词索引访问它们。

我认为这应该可以解决问题。它会抓取成对的单词，除非你在字符串的末尾，你只能得到一个单词。

$str = "I purchased a wallet because I wanted a wallet a wallet a wallet";
$words = explode(" ", $str);

$array_results = array();
for ($i = 0; $i<count($words); $i++) {
  if ($i < count($words)-1) {

     $pair = $words[$i] . " " . $words[$i+1]; echo $pair . "\n"; 
     // Have to check if the key is in use yet to avoid a notice
     $array_results[$pair] = isset($array_results[$pair]) ? $array_results[$pair] + 1 : 1;
  }
  // At the end of the array, just use a single word
  else $array_results[$words[$i]] = isset($array_results[$words[$i]]) ? $array_results[$words[$i]] + 1 : 1;
}

// Sort the results
// use arsort() instead to get the highest first
asort($array_results);

// Prints:
Array
(
    [I wanted] => 1
    [wanted a] => 1
    [wallet] => 1
    [because I] => 1
    [wallet because] => 1
    [I purchased] => 1
    [purchased a] => 1
    [wallet a] => 2
    [a wallet] => 4
)

更新已将++更改为+1，因为在测试时它无效...

Answer 4

我毫不犹豫地建议这一点，因为这是一种极其蛮力的方式：

使用你的一串单词，使用explode（“”，$ string）将其爆炸;命令，然后通过for循环运行它，检查字符串中每两个单词的每两个单词组合。

$string = "I am purchasing a wallet a wallet for 20$ purchasing a bag";
$words = explode(" ", $string);
for ($t=0; $t<count($string); $t++)
{
    for ($i=0; $i<count($string); $i++)
    {
        if (($words[$t] . words[$t+1]) == ($words[$i] . $word[$i+1])) {$count[$words[$i].$words[$i+1]]++}
    }
}

所以嵌套的for循环步入，抓住前两个单词，将它们与两个连续单词的每个其他单词进行比较，然后抓取接下来的两个单词并再次进行。每个答案的答案至少为1（它总是匹配），但按大小对结果数组进行排序将为您提供最重复的值。

请注意，这将运行（n-1）*（n-1）次迭代，这可能会变得难以置信。

Answer 5

由于您使用了excel标签，我认为我会试一试，而且实际上非常简单。

使用空格作为分隔符拆分字符串。数据＆gt;文字到列...＆gt;定界＆gt;分隔符：空格。每个单词现在都在自己的单元格中。
转置结果（不是严格要求但更容易可视化）。复制，编辑＆gt;选择性粘贴...＆gt;转置。
使单元格包含连续的单词对。因此，如果您的单词位于单元格B5：B15中，则单元格C5应为=B5&" "&B6（并向下拖动）。
计算每个单词对的出现次数：在单元格D5中，=COUNTIF($C$5:$C$15,"="&C5)，向下拖动。
突出显示获胜者。选择C5：D15，格式＆gt;条件格式...＆gt;公式为=$D5=MAX($D$5:$D$15)并选择例如黄色背景。

请注意，步骤4中存在一些低效率，因为如果该字对出现多次，则将多次计算每个字对的计数。如果这是一个问题，那么您可以先使用Data＆gt;列出唯一单词对。过滤器＆gt;高级过滤器...＆gt;仅限唯一记录。

通过录制上述宏，然后进行一些小编辑，可以轻松制作自动VBA解决方案。

Answer 6

一种方法是使用SPLIT或正则表达式将句子分成单词并将每个句子存储到一个数组中。然后获取数组并创建一个字典对象。向词典添加术语时，如果已经存在，请在.value中添加1以计算该计数。

以下是一些示例代码（远非完美，因为它只是为了显示覆盖的概念）将采用A列中的所有字符串并在B列和C列中生成单词频率列表。这不完全是您想要的，但是应该给你一些关于如何做到这一点的想法我希望：

Sub FrequencyList()

Dim vArray As Variant
Dim myDict As Variant
Set myDict = CreateObject("Scripting.Dictionary")
Dim i As Long
Dim cell As range

With myDict
    For Each cell In range("A1", cells(Rows.count, "A").End(xlUp))
        vArray = Split(cell.Value, " ")
        For i = LBound(vArray) To UBound(vArray)
            If Not .exists(vArray(i)) Then
                .Add vArray(i), 1
            Else
                .Item(vArray(i)) = .Item(vArray(i)) + 1
            End If
        Next
    Next
    range("B1").Resize(.count).Value = Application.Transpose(.keys)
    range("C1").Resize(.count).Value = Application.Transpose(.items)
    End With

End Sub

在几个字符串中获取两个最常用的单词

6 个答案: