我最近在TripAdvisor上搜索了一些评论数据,目前有一个结构如下的数据集。
Organization,Address,Reviewer,Review Title,Review,Review Count,Help Count,Attraction Count,Restaurant Count,Hotel Count,Location,Rating Date,Rating
Temple of the Tooth (Sri Dalada Maligawa),Address: Sri Dalada Veediya Kandy 20000 Sri Lanka,WowLao,Temple tour,Visits to places of worship always bring home to me the power of superstition. The Temple of the Tooth was no exception. But I couldn't help but marvel at the fervor with which some devotees were praying. One tip though: the shrine that houses the Tooth is open only twice a day and so it's best to check these timings ... More,89,48,7,0,0,Vientiane,2 days ago,3
Temple of the Tooth (Sri Dalada Maligawa),Address: Sri Dalada Veediya Kandy 20000 Sri Lanka,WowLao,Temple tour,Visits to places of worship always bring home to me the power of superstition. The Temple of the Tooth was no exception. But I couldn't help but marvel at the fervor with which some devotees were praying. One tip though: the shrine that houses the Tooth is open only twice a day and so it's best to check these timings though I would imagine that the crowds would be at a peak.,89,48,7,0,0,Vientiane,2 days ago,3
如您所见,第一行对象有部分审核,第二行有完整审核。
我想要实现的是检查这样的重复项,并删除具有部分审阅的对象(行),并保留具有完整审阅的行。
我发现每个部分评论最后都以“更多”结尾,这可能会以某种方式过滤掉部分评论吗?
我如何使用OpenCSV解决这个问题?
答案 0 :(得分:1)
注意:未经明确许可,不得在商业上使用其他网络服务的数据。
说完了: 基本上,openCSV将为您提供数组的枚举。阵列是你的线。
您需要将您的行复制到其他更多语义数据结构中。从你的标题行来看,我会创建一个像这样的bean。
public class TravelRow {
String organization;
String address;
String reviewer;
String reviewTitle;
String review; // you get it...
public TravelRow(String[] row) {
// assign row-index to property
this.organization = row[0];
// you get it ...
}
}
您可能希望为其生成getXXX
和setXXX
函数。
现在你需要找到该行的主键,我建议它是organisation
。
迭代行,为它创建一个bean,将其添加到具有密钥组织的hashmap。
如果组织已在哈希表中,则将当前审核与已存储的审核进行比较。如果新审核较长或存储的审核以... more
结尾,则替换地图中的对象。
在遍历所有行后,您有一个Map
,其中包含您想要的评论。
Map<TravelRow> result = new HashMap<TravelRow>();
CSVReader reader = new CSVReader(new FileReader("yourfile.csv"));
String [] nextLine;
while ((nextLine = reader.readNext()) != null) {
// nextLine[] is an array of values from the line
if( result.containsKey(nextLine[0]) ) {
// compare the review
if( reviewNeedsUpdate(result.get(nextLine[0]), nextLine[4]) ) {
result.get(nextLine[0]).setReview(nextLine[4]); // update only the review, create a new object, if you like
}
}
else {
// create TravelRow with array using the constructor eating the line
result.put(nextLine[0], new TravelRow(nextLine));
}
}
reviewNeedsUpdate(TravelRow row, String review)
会将review
与row.review
进行比较,如果新评价更好,则会返回true
。您可以扩展此功能,直到它符合您的需求....
private boolean reviewNeedsUpdate( TravelRow row, String review ) {
return ( row.review.endsWith("more") && !review.endsWith("more") );
}
答案 1 :(得分:0)
以下内容如何:
HashMap<String, String[]> preferredReviews = new HashMap<>();
int indexOfReview = 4;
CSVReader reader = new CSVReader(new FileReader("reviews.csv"));
String [] nextLine;
while ((nextLine = reader.readNext()) != null) {
String reviewId = nextLine[0];
String[] prevReview = preferredReviews.get(reviewId);
if (prevReview == null || prevReview[indexOfReview].length < nextLine[indexOfReview].length) {
preferredReviews.put(reviewId, nextLine);
}
}
在IF语句的第二个条款中,它进行长度比较以决定使用哪个。我喜欢这种方法的是,如果由于某种原因没有完整的尺寸审查,那么至少你会得到短的。
但可以更改为检查“...更多”而不是查看长度。
HashMap<String, String[]> preferredReviews = new HashMap<>();
int indexOfReview = 4;
CSVReader reader = new CSVReader(new FileReader("reviews.csv"));
String [] nextLine;
while ((nextLine = reader.readNext()) != null) {
String reviewId = nextLine[0];
if (nextLine[indexOfReview].endsWith("... More")){
preferredReviews.put(reviewId, nextLine);
}
}
答案 2 :(得分:0)
假设您定义了类class Rating {
public String review; // consider using getters/setters instead of public fields
Rating(String review) {
this.review = review;
}
}
来存储相关数据。
Set<Rating> readCSV() {
List<String[]> csv = new CSVReader(new FileReader("reviews.csv")).readAll();
List<Rating> ratings = csv.stream()
.map(row -> new Rating(row[4])) // add the other attributes
.collect(Collectors.toList());
return mergeRatings(ratings);
}
阅读CSV的内容。
TreeSet
我们将使用class RatingMergerComparator implements Comparator<Rating> {
@Override
public int compare(Rating rating1, Rating rating2) {
if (rating1.review.startsWith(rating2.review) ||
rating2.review.startsWith(rating1.review)) {
return 0;
}
return rating1.review.compareTo(rating2.review);
}
}
来整理重复项。这需要一个自定义比较器,丢弃已经在集合中的项目。
mergeRatings
创建void removeMoreEndings(List<Ratings> ratings) {
for (Rating rating : ratings) {
if (rating.review.endsWith("... More")) {
rating.review = rating.review.substring(0, rating.review.length() - 9); // 9 = length of "... More"
}
}
}
Set<Rating> mergeRatings(List<Rating> ratings) {
removeMoreEndings(ratings); // remove all "... More" endings
// sort ratings by length in a descending order, since the set will discard certain items,
// it is important to keep the longer ones, so they come first
ratings.sort(Comparator.comparing((Rating rating) -> rating.review.length()).reversed());
TreeSet<Rating> mergedRatings = new TreeSet<>(new RatingMergerComparator());
mergedRatings.addAll(ratings);
return mergedRatings;
}
方法
Exception
<强>更新强>
我可能误读了OP。即使必须合并的记录在CSV中更远,上述解决方案也能提供非常好的性能。如果您确定,部分完整评论是连续的,则上述情况可能过度。
答案 3 :(得分:0)
这取决于您如何阅读数据。
如果使用MappingStategy将数据读取为Bean,则可以使用CSVFilter接口创建自己的过滤器,并将其注入CsvToBean类。这会导致根据allowedLine方法中的条件读取(允许)或跳过行。 CSVFilter的java文档提供了一个很好的示例 - 对于您的情况,您将允许其Review列不以&#34; More&#34;结尾的所有行。
如果您只是使用CSVReader / CSVParser,那将会有点棘手。您需要阅读标题并查看Review是哪一列。然后,在阅读每一行时,您将查看该索引处的元素,如果它以&#34;更多&#34;不要处理它。