我尝试使用余弦相似度找到两个文本文件的相似度。我提供这些文字时会发现。但是我想在读取计算机上的文本文件后得到结果。
//calculates the cosine similarity between two texts / documents etc., (having each word separated by space)
public class Cosine_Similarity
{
public class values
{
int val1;
int val2;
values(int v1, int v2)
{
this.val1=v1;
this.val2=v2;
}
public void Update_VAl(int v1, int v2)
{
this.val1=v1;
this.val2=v2;
}
}//end of class values
public double Cosine_Similarity_Score(String Text1, String Text2)
{
double sim_score=0.0000000;
//1. Identify distinct words from both documents
String [] word_seq_text1 = Text1.split(" ");
String [] word_seq_text2 = Text2.split(" ");
Hashtable<String, values> word_freq_vector = new Hashtable<String,
Cosine_Similarity.values>();
LinkedList<String> Distinct_words_text_1_2 = new LinkedList<String>();
//prepare word frequency vector by using Text1
for(int i=0;i<word_seq_text1.length;i++)
{
String tmp_wd = word_seq_text1[i].trim();
if(tmp_wd.length()>0)
{
if(word_freq_vector.containsKey(tmp_wd))
{
values vals1 = word_freq_vector.get(tmp_wd);
int freq1 = vals1.val1+1;
int freq2 = vals1.val2;
vals1.Update_VAl(freq1, freq2);
word_freq_vector.put(tmp_wd, vals1);
}
else
{
values vals1 = new values(1, 0);
word_freq_vector.put(tmp_wd, vals1);
Distinct_words_text_1_2.add(tmp_wd);
}
}
}
//prepare word frequency vector by using Text2
for(int i=0;i<word_seq_text2.length;i++)
{
String tmp_wd = word_seq_text2[i].trim();
if(tmp_wd.length()>0)
{
if(word_freq_vector.containsKey(tmp_wd))
{
values vals1 = word_freq_vector.get(tmp_wd);
int freq1 = vals1.val1;
int freq2 = vals1.val2+1;
vals1.Update_VAl(freq1, freq2);
word_freq_vector.put(tmp_wd, vals1);
}
else
{
values vals1 = new values(0, 1);
word_freq_vector.put(tmp_wd, vals1);
Distinct_words_text_1_2.add(tmp_wd);
}
}
}
//calculate the cosine similarity score.
double VectAB = 0.0000000;
double VectA_Sq = 0.0000000;
double VectB_Sq = 0.0000000;
for(int i=0;i<Distinct_words_text_1_2.size();i++)
{
values vals12 = word_freq_vector.get(Distinct_words_text_1_2.get(i));
double freq1 = (double)vals12.val1;
double freq2 = (double)vals12.val2;
System.out.println(Distinct_words_text_1_2.get(i)+"#"+freq1+"#"+freq2);
VectAB=VectAB+(freq1*freq2);
VectA_Sq = VectA_Sq + freq1*freq1;
VectB_Sq = VectB_Sq + freq2*freq2;
}
System.out.println("VectAB "+VectAB+" VectA_Sq "+VectA_Sq+" VectB_Sq "+VectB_Sq);
sim_score = ((VectAB)/(Math.sqrt(VectA_Sq)*Math.sqrt(VectB_Sq)));
return(sim_score);
}
public static void main(String[] args)
{
Cosine_Similarity cs1 = new Cosine_Similarity();
System.out.println("[Word # VectorA # VectorB]");
double sim_score = cs1.Cosine_Similarity_Score("this is text file one", "this is text file two");
System.out.println("Cosine similarity score = "+sim_score);
}
}
答案 0 :(得分:1)
在您的代码中,您可以比较两个文本字符串,但不能比较两个文件,因此只需将它们转换为两个文本字符串即可比较两个文件。 为此,您可以逐行读取每个文件,并使用空格作为分隔符将它们连接起来。
public static void main(String[] args) throws IOException {
Cosine_Similarity cs = new Cosine_Similarity();
// read file 1 and convert into a String
String text1 = Files.readAllLines(Paths.get("path/to/file1")).stream().collect(Collectors.joining(" "));
// read file 2 and convert into a String
String text2 = Files.readAllLines(Paths.get("path/to/file2")).stream().collect(Collectors.joining(" "));
double score = cs.Cosine_Similarity_Score(text1, text2);
System.out.println("Cosine similarity score = " + score);
}
顺便阅读并遵守约定!
一个例子:
public class CosineSimilarity {
private static class Values {
private int val1;
private int val2;
private Values(int v1, int v2) {
this.val1 = v1;
this.val2 = v2;
}
public void updateValues(int v1, int v2) {
this.val1 = v1;
this.val2 = v2;
}
}//end of class values
public double score(String text1, String text2) {
//1. Identify distinct words from both documents
String[] text1Words = text1.split(" ");
String[] text2Words = text2.split(" ");
Map<String, Values> wordFreqVector = new HashMap<>();
List<String> distinctWords = new ArrayList<>();
//prepare word frequency vector by using Text1
for (String text : text1Words) {
String word = text.trim();
if (!word.isEmpty()) {
if (wordFreqVector.containsKey(word)) {
Values vals1 = wordFreqVector.get(word);
int freq1 = vals1.val1 + 1;
int freq2 = vals1.val2;
vals1.updateValues(freq1, freq2);
wordFreqVector.put(word, vals1);
} else {
Values vals1 = new Values(1, 0);
wordFreqVector.put(word, vals1);
distinctWords.add(word);
}
}
}
//prepare word frequency vector by using Text2
for (String text : text2Words) {
String word = text.trim();
if (!word.isEmpty()) {
if (wordFreqVector.containsKey(word)) {
Values vals1 = wordFreqVector.get(word);
int freq1 = vals1.val1;
int freq2 = vals1.val2 + 1;
vals1.updateValues(freq1, freq2);
wordFreqVector.put(word, vals1);
} else {
Values vals1 = new Values(0, 1);
wordFreqVector.put(word, vals1);
distinctWords.add(word);
}
}
}
//calculate the cosine similarity score.
double vectAB = 0.0000000;
double vectA = 0.0000000;
double vectB = 0.0000000;
for (int i = 0; i < distinctWords.size(); i++) {
Values vals12 = wordFreqVector.get(distinctWords.get(i));
double freq1 = vals12.val1;
double freq2 = vals12.val2;
System.out.println(distinctWords.get(i) + "#" + freq1 + "#" + freq2);
vectAB = vectAB + freq1 * freq2;
vectA = vectA + freq1 * freq1;
vectB = vectB + freq2 * freq2;
}
System.out.println("VectAB " + vectAB + " VectA_Sq " + vectA + " VectB_Sq " + vectB);
return ((vectAB) / (Math.sqrt(vectA) * Math.sqrt(vectB)));
}
public static void main(String[] args) throws IOException {
CosineSimilarity cs = new CosineSimilarity();
String text1 = Files.readAllLines(Paths.get("path/to/file1")).stream().collect(Collectors.joining(" "));
String text2 = Files.readAllLines(Paths.get("path/to/file2")).stream().collect(Collectors.joining(" "));
double score = cs.score(text1, text2);
System.out.println("Cosine similarity score = " + score);
}
}
答案 1 :(得分:0)
您可以通过在运行程序时在命令行中提供路径来指定所需的文件,然后在代码中将其用作args
。例如。您将必须运行程序java Cosine_Similarity path_to_text1 path_to_text2
double sim_score = cs1.Cosine_Similarity_Score(args[0], args[1]);
当前,您正在做的只是比较两个字符串。对于短字符串,您可以简单地将它们作为参数。如果要使用实际文件,则需要提供文件路径作为参数,然后将文件内容转换为单个字符串,然后进行比较。看看这个答案: