当空格b / w标签和文本在r中变化时,从矢量中删除html标签

时间:2019-02-15 04:57:24

标签: html r regex

我有以下向量:

@NoArgsConstructor(force = true, access = AccessLevel.PRIVATE)
@EqualsAndHashCode(exclude = {"users"})
@ToString(exclude = {"users"})
@Getter
@JsonDeserialize(builder = Profile.Builder.class)
@Entity
@Table(name = "profile")
@SQLDelete(sql="Update users SET deleted = 'true' where id=?")
@Where(clause="deleted != 'true'")
public class Profile implements Serializable {

    private static final long serialVersionUID = 1L;

    @Id
    @GeneratedValue(strategy = GenerationType.SEQUENCE, generator = "sequenceGenerator")
    @SequenceGenerator(name = "sequenceGenerator")
    private final Long id;

    @Column(name = "profile_name")
    private final String profileName;

    @Column(name = "date_of_birth")
    private final LocalDate dateOfBirth;

    @Column(name = "health_history")
    private final String healthHistory;

    @ManyToOne
    @JoinColumn(name = "users_id")
    @JsonIgnoreProperties("reports")
    private final User users;

    @Column(name="deleted")
    String deleteFlag;
}

我想删除所有的vec<-c("\n\t\t\t\n\t\t\t\n\t\t\t\t8900 E Runstack Rd \n\t\t\t\n\t\t\t\n\t\t\t\n\t\t\tScottsdale, AZ \n\t\t\t\t\t85251\n\t\t\t" , "\n\t\t\t\n\t\t\t\n\t\t\t\t330 Orange Boulevard\n\t\t\t\n\t\t\t\n\t\t\t\n\t\t\tBeverly Hills, CA \n\t\t\t\t\t90212\n\t\t\t" , "\n\t\t\t\n\t\t\t\n\t\t\t\t645 Newport Center Drive \n\t\t\t\n\t\t\t\n\t\t\t\n\t\t\tNewport Beach, CA \n\t\t\t\t\t92660\n\t\t\t" , "\n\t\t\t\n\t\t\t\n\t\t\t\t5000 Westlake Depot Road \n\t\t\t\n\t\t\t\n\t\t\t\n\t\t\tPalo Alto, CA \n\t\t\t\t\t94304\n\t\t\t" , "\n\t\t\t\n\t\t\t\n\t\t\t\t646 Lucern Road\n\t\t\t\n\t\t\t\n\t\t\t\n\t\t\tSan Diego, CA \n\t\t\t\t\t92108\n\t\t\t" ) \n。我尝试了以下方法:

\t

但这将它们转换为空格。我尝试过:

str_replace_all(vec, "\n|\t", " ")
[1] "             8900 E Runstack Rd                 Scottsdale,  AZ        85251    "         
[2] "             330 Orange Boulevard                Beverly Hills,  CA        90212    "     
[3] "             645 Newport Center Drive                 Newport Beach,  CA        92660    "
[4] "             5000 Westlake Depot Road                 Palo Alto,  CA        94304    "    
[5] "             646 Lucern Road                San Diego,  CA        92108    " 

但是请注意,在某些情况下,应该没有空格(例如索引2 str_replace_all(vec, "\n|\t", "") [1] "8900 E Runstack Rd Scottsdale, AZ 85251" "330 Orange BoulevardBeverly Hills, CA 90212" [3] "645 Newport Center Drive Newport Beach, CA 92660" "5000 Westlake Depot Road Palo Alto, CA 94304" [5] "646 Lucern RoadSan Diego, CA 92108" )。问题是因为330 Orange BoulevardBeverly Hills, CA 90212附加在某些文本的末尾,而在另一些情况下则存在空格。仅当\n接触到紧接在其前面的字母时如何替换空白,而在所有其他情况下都不能用空格替换?我正在寻找以下结果:

\n

在运行[1] "8900 E Runstack Rd Scottsdale, AZ 85251" "330 Orange Boulevard Beverly Hills, CA 90212" [3] "645 Newport Center Drive Newport Beach, CA 92660" "5000 Westlake Depot Road Palo Alto, CA 94304" [5] "646 Lucern Road San Diego, CA 92108" 之后,我可以使用str_squish(vec)来实现上述目的,但是我想要一个单行解决方案。

2 个答案:

答案 0 :(得分:1)

可能只有一行,但是我们失去了可读性,而且确实变得更加复杂。

gsub("^[\\\n|\\\t]+([0-9a-zA-Z ,]+)[\\\n|\\\t]+([a-zA-Z ,]+)[\\\n|\\\t]+([0-9]{5})[\\\n|\\\t]+$","\\1 \\2 \\3",vec)

在这里,我们利用地址包含

模式的事实
  1. 街道地址
  2. 城市,州
  3. 5位邮政编码

答案 1 :(得分:0)

尝试:stringr::str_remove_all(vec,"[\n|\t]") 结果:可以放回到您的数据中。

[1] "8900 E Runstack Rd Scottsdale,  AZ  85251"         
[2] "330 Orange BoulevardBeverly Hills,  CA  90212"     
[3] "645 Newport Center Drive Newport Beach,  CA  92660"
[4] "5000 Westlake Depot Road Palo Alto,  CA  94304"    
[5] "646 Lucern RoadSan Diego,  CA  92108" 

根据@ Sada93的评论,我们在第二个元素中丢失了(a)空间,虽然这不是重新引入空间的最佳方法,但是它是:

gsub("BoulevardBeverly","Boulevard Beverly",vec1)#vec1 is the result of the above transformation

重新引入空格的其他方法:仅用于说明

vec1<-stringr::str_replace_all(vec,"[\n|\t]","")
vec2<-stringr::str_remove_all(vec1," ")
gsub("([0-9])([a-zA-Z])","\\1 \\2",vec2)