我正在尝试将.wav文件转换为文本,以便我们可以对我公司回答的电话进行语音分析。我有一个工作原型,但它很慢,需要很长时间才能转录100个文件。我需要每天能够做大约30k文件。
到目前为止,这是我的代码。它的步骤即用户必须一个接一个地开始。
第一步从S3服务器获取文件
S3.php
<?php
require 'aws-autoloader.php';
include 'db.php'; // my DB connection file
set_time_limit(0);
ignore_user_abort(true);
ini_set('max_execution_time', 0); // setting timer to so the script does not time out
$credentials = new Aws\Credentials\Credentials('', '');
$s3Client = new Aws\S3\S3Client([ // S3 client connection
'version' => 'latest',
'region' => 'us-east-1',
'credentials' => $credentials
//'debug' => true
]);
echo "<br>";
$objects = $s3Client->getIterator('ListObjects', array(
"Bucket" => 'bucker_name',
"Prefix" => "folder1/folder2/folder3/2017/10/05/"
));
$i = 0;
foreach ($objects as $object) {
try {
if ($i == 140) break; // This is the counter I set to get only 140 files
if ($object['Size'] > 482000 and $object['Size'] < 2750000) { // get only objects that are not too small nor too big or consider it file filerting
echo $object['Key'] . "<br>";
$i++;
$cmd = $s3Client->getCommand('GetObject', [
'Bucket' => 'bucket_name',
'Key' => $object['Key']
]);
// Create a signed URL from a completely custom HTTP request that
// will last for 10 minutes from the current time
$signedUrl = $s3Client->createPresignedRequest($cmd, '+10 minutes');
ob_start();
echo $url = (string)$signedUrl->getUri();
ob_end_flush();
ob_flush();
flush();
$filename = parse_url($url, PHP_URL_PATH);
$arr = explode("_", basename($filename));
$filename = $arr[0] . ".wav";
file_put_contents('uploads/' . basename($filename), fopen($url, 'r')); // Storing the files in uploads folder on my Linux server
$sql = "INSERT INTO `audioFiles` (`audioFile`) VALUES ('" . basename($filename) . "')"; // Inserting the file name into DB to keep track of it
$STH = $DBH->prepare($sql);
$STH->execute();
}
//print_r($object);
} catch (Exception $e) {
print_r($e);
}
}
下载文件后,我需要将录制内容左右分开并使用右侧的前5秒。我这样做是因为它转录整个调用的成本很高,而且这更像是一个初始化应用程序,需要在我们可以合理化每个文件的整个持续时间之前对数千个文件进行扩展。
以下是用于分割和提取前5秒的脚本的一部分。我从DB中获取标记为0的文件名和拆分,然后用新名称和标记更新DB文件标记为1。
Split.php
$sql = "SELECT audioFile FROM audioFiles WHERE split = 0"; // SQL to get file names
$sql_update = "UPDATE audioFiles SET split = 1 WHERE audioFile IN ("; // SQL to update split files
.
.
while ($fileName = $STH->fetch()) {
echo $output = shell_exec("sox --i " . $location . " | grep Channels | sed 's/^.*: //'"); // to check if the file has stereo or mono recording
if ($output == 2) {
$left = substr($location, 0, $extension_pos) . '.CALLER' . substr($location, $extension_pos);
$right = substr($location, 0, $extension_pos) . '.AGENT' . substr($location, $extension_pos);
$ap = substr($location, 0, $extension_pos) . '.AGENT.AP' . substr($location, $extension_pos);
exec("sox $location $left remix 1 ");
exec("sox $location $right remix 2 ");
exec("sox $location $ap trim 0 5");
$sql_update .= "'" . $fileName[0] . "',";
$sql_update_agentTranscript = "UPDATE audioFiles SET agentFile ='" . $right . "', agentAP ='".$ap ."' WHERE audioFile ='" . $fileName[0] . "'";
$STH1 = $DBH->prepare($sql_update_agentTranscript);
$STH1->execute();
} else if ($output == 1) {
$right = substr($location, 0, $extension_pos) . '.AGENT' . substr($location, $extension_pos);
$ap = substr($location, 0, $extension_pos) . '.AGENT.AP' . substr($location, $extension_pos);
exec("cp $location $right");
exec("sox $location $ap trim 0 5");
$sql_update .= "'" . $fileName[0] . "',";
$sql_update_agentTranscript = "UPDATE audioFiles SET agentFile ='" . $right . "', agentAP ='".$ap ."' WHERE audioFile ='" . $fileName[0] . "'";
$STH1 = $DBH->prepare($sql_update_agentTranscript);
$STH1->execute();
} else {
echo "Something is wrong. The file did not have 1 or 2 channel or code is wrong - ".$fileName[0];
echo "<br>";
$ap = substr($location, 0, $extension_pos) . '.AGENT.AP' . substr($location, $extension_pos);
}
$sql_update = substr($sql_update, 0, -1);
$sql_update .= ")";error_log($sql_update, 0);
$STH = $DBH->prepare($sql_update);
$STH->execute();
以下是用于将5秒文件转换为文本的脚本。
IBM.php
<?php
.
//get file name from DB with marker set as 1 from previous script.
$url = 'https://stream.watsonplatform.net/speech-to-text/api/v1/recognize?model=en-US_NarrowbandModel&profanity_filter=false';
$headers = array(
"Content-Type: audio/wav",
"Transfer-Encoding: chunked");
.
if($STH->rowCount() > 0) {
while ($fileName = $STH->fetch()) {
$file = fopen($fileName[0], 'r');
$size = filesize($fileName[0]);
$fileData = fread($file, $size);
// CURL start to send via IBM API and conver it.
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERPWD, "$username:$password");
curl_setopt($ch, CURLOPT_POST, TRUE);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_POSTFIELDS, $fileData);
curl_setopt($ch, CURLOPT_INFILE, $file);
curl_setopt($ch, CURLOPT_INFILESIZE, $size);
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$executed = curl_exec($ch);
curl_close($ch);
$result = json_decode($executed);
$match = "thank you for calling"; // The text to see if they are present in the converted text
$transcript = $result->results[0]->alternatives[0]->transcript;
if(strpos($transcript,$match) !== false){
//Update DB with STH1->execute() to say that matching text is found.
} else {
//Update DB with STH2->execute() to say that matching text is not found.
}
}
}
else{
echo "No more files to convert.";
}
?>
以上内容可用于使用IBM Watson将语音转换为文本。如果有人想要使用它,只需添加它。
我认为整个三步过程可以用于数百个呼叫,但是无法运行或者太昂贵而无法运行数千个呼叫。
步骤如下:
我需要帮助优化此流程并使其比现在更快。我希望有一种方法可以直接将文件从S3发送到IBM Watson作为流,每个文件的时间限制为5秒。我认为这可能是可能的,但我不知道如何做到这一点。
我需要完全重新创建吗?如果有的话还有其他选择吗?
任何建议或想法都会有所帮助。
Ps - 我为我的代码缩进道歉