根据字符串从CSV文件中删除重复行 - JAVA

时间:2016-01-24 21:09:26

标签: java csv opencsv

我最近在TripAdvisor上搜索了一些评论数据,目前有一个结构如下的数据集。

Organization,Address,Reviewer,Review Title,Review,Review Count,Help Count,Attraction Count,Restaurant Count,Hotel Count,Location,Rating Date,Rating

Temple of the Tooth (Sri Dalada Maligawa),Address: Sri Dalada Veediya Kandy 20000 Sri Lanka,WowLao,Temple tour,Visits to places of worship always bring home to me the power of superstition. The Temple of the Tooth was no exception. But I couldn't help but marvel at the fervor with which some devotees were praying. One tip though: the shrine that houses the Tooth  is open only twice a day and so it's best to check these timings ...   More,89,48,7,0,0,Vientiane,2 days ago,3

Temple of the Tooth (Sri Dalada Maligawa),Address: Sri Dalada Veediya Kandy 20000 Sri Lanka,WowLao,Temple tour,Visits to places of worship always bring home to me the power of superstition. The Temple of the Tooth was no exception. But I couldn't help but marvel at the fervor with which some devotees were praying. One tip though: the shrine that houses the Tooth  is open only twice a day and so it's best to check these timings  though I would imagine that the crowds would be at a peak.,89,48,7,0,0,Vientiane,2 days ago,3

如您所见,第一行对象有部分审核,第二行有完整审核。

我想要实现的是检查这样的重复项,并删除具有部分审阅的对象(行),并保留具有完整审阅的行。

我发现每个部分评论最后都以“更多”结尾,这可能会以某种方式过滤掉部分评论吗?

我如何使用OpenCSV解决这个问题?

4 个答案:

答案 0 :(得分:1)

注意:未经明确许可,不得在商业上使用其他网络服务的数据。

说完了: 基本上,openCSV将为您提供数组的枚举。阵列是你的线。

您需要将您的行复制到其他更多语义数据结构中。从你的标题行来看,我会创建一个像这样的bean。

public class TravelRow {
   String organization;
   String address;
   String reviewer;
   String reviewTitle;
   String review; // you get it... 

   public TravelRow(String[] row) {
       // assign row-index to property
       this.organization = row[0];
       // you get it ...
   }
}

您可能希望为其生成getXXXsetXXX函数。

现在你需要找到该行的主键,我建议它是organisation。 迭代行,为它创建一个bean,将其添加到具有密钥组织的hashmap。

如果组织已在哈希表中,则将当前审核与已存储的审核进行比较。如果新审核较长或存储的审核以... more结尾,则替换地图中的对象。

在遍历所有行后,您有一个Map,其中包含您想要的评论。

Map<TravelRow> result = new HashMap<TravelRow>();
CSVReader reader = new CSVReader(new FileReader("yourfile.csv"));
String [] nextLine;
while ((nextLine = reader.readNext()) != null) {
   // nextLine[] is an array of values from the line
   if( result.containsKey(nextLine[0]) ) {
       // compare the review
       if( reviewNeedsUpdate(result.get(nextLine[0]), nextLine[4]) ) {
           result.get(nextLine[0]).setReview(nextLine[4]); // update only the review, create a new object, if you like
       }
   }
   else {
       // create TravelRow with array using the constructor eating the line
       result.put(nextLine[0], new TravelRow(nextLine));
   }
}

reviewNeedsUpdate(TravelRow row, String review)会将reviewrow.review进行比较,如果新评价更好,则会返回true。您可以扩展此功能,直到它符合您的需求....

private boolean reviewNeedsUpdate( TravelRow row, String review ) {
    return ( row.review.endsWith("more") && !review.endsWith("more") ); 
}

答案 1 :(得分:0)

以下内容如何:

 HashMap<String, String[]> preferredReviews = new HashMap<>();
 int indexOfReview = 4;
 CSVReader reader = new CSVReader(new FileReader("reviews.csv"));
 String [] nextLine;
 while ((nextLine = reader.readNext()) != null) {
     String reviewId = nextLine[0];
     String[] prevReview = preferredReviews.get(reviewId);
     if (prevReview == null || prevReview[indexOfReview].length < nextLine[indexOfReview].length) {
         preferredReviews.put(reviewId, nextLine);
     }
 }

在IF语句的第二个条款中,它进行长度比较以决定使用哪个。我喜欢这种方法的是,如果由于某种原因没有完整的尺寸审查,那么至少你会得到短的。

但可以更改为检查“...更多”而不是查看长度。

 HashMap<String, String[]> preferredReviews = new HashMap<>();
 int indexOfReview = 4;
 CSVReader reader = new CSVReader(new FileReader("reviews.csv"));
 String [] nextLine;
 while ((nextLine = reader.readNext()) != null) {
     String reviewId = nextLine[0];
     if (nextLine[indexOfReview].endsWith("... More")){
         preferredReviews.put(reviewId, nextLine);
     }       
 }

答案 2 :(得分:0)

假设您定义了类class Rating { public String review; // consider using getters/setters instead of public fields Rating(String review) { this.review = review; } } 来存储相关数据。

Set<Rating> readCSV() {
  List<String[]> csv = new CSVReader(new FileReader("reviews.csv")).readAll();
  List<Rating> ratings = csv.stream()
      .map(row -> new Rating(row[4])) // add the other attributes
      .collect(Collectors.toList());
  return mergeRatings(ratings);
}

阅读CSV的内容。

TreeSet

我们将使用class RatingMergerComparator implements Comparator<Rating> { @Override public int compare(Rating rating1, Rating rating2) { if (rating1.review.startsWith(rating2.review) || rating2.review.startsWith(rating1.review)) { return 0; } return rating1.review.compareTo(rating2.review); } } 来整理重复项。这需要一个自定义比较器,丢弃已经在集合中的项目。

mergeRatings

创建void removeMoreEndings(List<Ratings> ratings) { for (Rating rating : ratings) { if (rating.review.endsWith("... More")) { rating.review = rating.review.substring(0, rating.review.length() - 9); // 9 = length of "... More" } } } Set<Rating> mergeRatings(List<Rating> ratings) { removeMoreEndings(ratings); // remove all "... More" endings // sort ratings by length in a descending order, since the set will discard certain items, // it is important to keep the longer ones, so they come first ratings.sort(Comparator.comparing((Rating rating) -> rating.review.length()).reversed()); TreeSet<Rating> mergedRatings = new TreeSet<>(new RatingMergerComparator()); mergedRatings.addAll(ratings); return mergedRatings; } 方法

Exception

<强>更新

我可能误读了OP。即使必须合并的记录在CSV中更远,上述解决方案也能提供非常好的性能。如果您确定,部分完整评论是连续的,则上述情况可能过度。

答案 3 :(得分:0)

这取决于您如何阅读数据。

如果使用MappingStategy将数据读取为Bean,则可以使用CSVFilter接口创建自己的过滤器,并将其注入CsvToBean类。这会导致根据allowedLine方法中的条件读取(允许)或跳过行。 CSVFilter的java文档提供了一个很好的示例 - 对于您的情况,您将允许其Review列不以&#34; More&#34;结尾的所有行。

如果您只是使用CSVReader / CSVParser,那将会有点棘手。您需要阅读标题并查看Review是哪一列。然后,在阅读每一行时,您将查看该索引处的元素,如果它以&#34;更多&#34;不要处理它。