我有一个与(预处理)文本信息有关的问题。我在每条csv行中的数据结构如下:
row = "['Adventure' 'African elephant' 'Animal' 'Ball game' 'Bay' 'Body of water' 'Communication Device' 'Electronic device']"
转化后的预期结果:
[adventure, african_elephant, animal, ball_game, bay, body_of_water, communication_device, electronic_device]
问题: 如何解决这种最好,最有效的文档(100,000个文档)?欢迎使用Python的RegEx和非RegEx解决方案。
解决方案:
%%time
import ast
row = "['Adventure' 'African elephant' 'Animal' 'Ball game' 'Bay' 'Body of water' 'Communication Device' 'Electronic device']"
row = ast.literal_eval(','.join(['_'.join(i.lower().split()) for i in row.split("' '")]))[0].split(',')
row
CPU times: user 43 µs, sys: 1 µs, total: 44 µs
Wall time: 48.2 µs
%%time
row = "['Adventure' 'African elephant' 'Animal' 'Ball game' 'Bay' 'Body of water' 'Communication Device' 'Electronic device']"
row = [w.lower().replace(' ', '_') for w in re.findall(r"'([^']*)'", row)]
row
CPU times: user 25 µs, sys: 1e+03 ns, total: 26 µs
Wall time: 29.1 µs
答案 0 :(得分:1)
简单的列表理解
import ast
document = "['Adventure' 'African elephant' 'Animal' 'Ball game' 'Bay' 'Body of water' 'Communication Device' 'Electronic device']"
ast.literal_eval(','.join(['_'.join(i.lower().split()) for i in document.split("' '")]))
输出(作为包含单个字符串的列表)
['adventure,african_elephant,animal,ball_game,bay,body_of_water,communication_device,electronic_device']
现在,如果您需要字符串列表
ast.literal_eval(','.join(['_'.join(i.lower().split()) for i in document.split("' '")]))[0].split(',')
输出
['adventure',
'african_elephant',
'animal',
'ball_game',
'bay',
'body_of_water',
'communication_device',
'electronic_device']
答案 1 :(得分:1)
这应该有效
import re
document = "['Adventure' 'African elephant' 'Animal' 'Ball game' 'Bay' 'Body of water' 'Communication Device' 'Electronic device']"
list = re.findall("'([^']*)'", document)
答案 2 :(得分:1)
您可以使用以下代码:
<!DOCTYPE html>
<html>
<head>
<title>Exercise 2 Start</title>
<meta charset="UTF-8">
<link rel="stylesheet" href="css/style.css" type="text/css">
</head>
<body>
<div class="container">
<div class="brand">Yosemite</div>
<div class="byline"><p>Irene <strong>Li</strong></p><p>2019</p>
</div>
<div class="box1">1</div>
<div class="box2">2</div>
<div class="box3">3</div>
<div class="box4">4</div>
<div class="box5">5</div>
<div class="box6">6</div>
<div class="box7">7</div>
<div class="box8">8</div>
<div class="box9">9</div>
</div>
</body>
</html>
详细信息:
@charset "UTF-8";
/* CSS Document */
* {
-webkit-box-sizing: border-box;
-moz-box-sizing: border-box;
box-sizing: border-box;
}
HTML, body {
margin: 0;
padding: 0;
height: 100%;
width: 100%;
}
.container {
margin: 0 auto;
width: 900px;
height: 900px;
}
body {
font-family: Gotham, "Helvetica Neue", Helvetica, Arial, sans- serif;
margin: 0px;
}
.box1,
.box2,
.box3,
.box4,
.box5,
.box6,
.box7,
.box8,
.box9 {
width: 300px;
height: 300px;
position: relative;
}
.box1 {
background-image:url(../img/Yosemite_0002_03.png);
float: right;
}
.box2 {
background-image:url(../img/Yosemite_0000_01.png);
float: left;
}
.box3 {
background-image:url(../img/Yosemite_0001_02.png);
float: left;
}
.box4 {
background-image:url(../img/Yosemite_0003_04.png);
float: left;
}
.box5 {
background-image:url(../img/Yosemite_0005_06.png);
float: right;
}
.box6 {
background-image:url(../img/Yosemite_0004_05.png);
float: left;
}
.box7 {
background-image:url(../img/Yosemite_0008_09.png);
float: right;
}
.box8 {
background-image:url(../img/Yosemite_0007_08.png);
float: right;
}
.box9 {
background-image:url(../img/Yosemite_0006_07.png);
float: left;
}
.brand {
background-color: #000000;
color: #ffffff;
position: fixed;
font-size:36px;
top: 0;
height: 100px;
width: 900px;
}
.byline {
background-color:#E0E0E0;
position: absolute;
top: 25px;
left: 25px;
padding: 25px;
}
:将输入字符串转换为小写>>> row = "['Adventure' 'African elephant' 'Animal' 'Ball game' 'Bay' 'Body of water' 'Communication Device' 'Electronic device']"
>>> [w.replace(' ', '_') for w in re.findall(r"'([^']*)'", row.lower())]
['adventure', 'african_elephant', 'animal', 'ball_game', 'bay', 'body_of_water', 'communication_device', 'electronic_device']
通过查找用单引号引起来的子字符串将小写输入字符串转换为列表row.lower()
在列表的每个元素中用re.findall
替换空格