Question

我有一个与（预处理）文本信息有关的问题。我在每条csv行中的数据结构如下：

row = "['Adventure' 'African elephant' 'Animal' 'Ball game' 'Bay' 'Body of water' 'Communication Device' 'Electronic device']"

转化后的预期结果：

[adventure, african_elephant, animal, ball_game, bay, body_of_water, communication_device, electronic_device]

问题： 如何解决这种最好，最有效的文档（100,000个文档）？欢迎使用Python的RegEx和非RegEx解决方案。

解决方案：

%%time
import ast
row = "['Adventure' 'African elephant' 'Animal' 'Ball game' 'Bay' 'Body of water' 'Communication Device' 'Electronic device']"
row = ast.literal_eval(','.join(['_'.join(i.lower().split()) for i in row.split("' '")]))[0].split(',')
row

CPU times: user 43 µs, sys: 1 µs, total: 44 µs
Wall time: 48.2 µs

%%time
row = "['Adventure' 'African elephant' 'Animal' 'Ball game' 'Bay' 'Body of water' 'Communication Device' 'Electronic device']"
row = [w.lower().replace(' ', '_') for w in re.findall(r"'([^']*)'", row)]
row

CPU times: user 25 µs, sys: 1e+03 ns, total: 26 µs
Wall time: 29.1 µs

Answer 1

简单的列表理解

import ast
document = "['Adventure' 'African elephant' 'Animal' 'Ball game' 'Bay' 'Body of water' 'Communication Device' 'Electronic device']"
ast.literal_eval(','.join(['_'.join(i.lower().split()) for i in document.split("' '")]))

输出（作为包含单个字符串的列表）

['adventure,african_elephant,animal,ball_game,bay,body_of_water,communication_device,electronic_device']

现在，如果您需要字符串列表

ast.literal_eval(','.join(['_'.join(i.lower().split()) for i in document.split("' '")]))[0].split(',')

输出

['adventure',
 'african_elephant',
 'animal',
 'ball_game',
 'bay',
 'body_of_water',
 'communication_device',
 'electronic_device']

Answer 2

这应该有效

import re
document = "['Adventure' 'African elephant' 'Animal' 'Ball game' 'Bay' 'Body of water' 'Communication Device' 'Electronic device']"
list = re.findall("'([^']*)'", document)

Answer 3

您可以使用以下代码：

<!DOCTYPE html>
 <html>

      <head>
     <title>Exercise 2 Start</title>
     <meta charset="UTF-8">
     <link rel="stylesheet" href="css/style.css" type="text/css">

      </head>

      <body>

      <div class="container">

     <div class="brand">Yosemite</div>

     <div class="byline"><p>Irene <strong>Li</strong></p><p>2019</p>

      </div>

     <div class="box1">1</div>

     <div class="box2">2</div>

     <div class="box3">3</div>

     <div class="box4">4</div>

     <div class="box5">5</div>  

     <div class="box6">6</div>

     <div class="box7">7</div>

     <div class="box8">8</div>

     <div class="box9">9</div>

 </div>

 </body>

 </html>

详细信息：

@charset "UTF-8"; /* CSS Document */ * { -webkit-box-sizing: border-box; -moz-box-sizing: border-box; box-sizing: border-box; } HTML, body { margin: 0; padding: 0; height: 100%; width: 100%; } .container { margin: 0 auto; width: 900px; height: 900px; } body { font-family: Gotham, "Helvetica Neue", Helvetica, Arial, sans- serif; margin: 0px; } .box1, .box2, .box3, .box4, .box5, .box6, .box7, .box8, .box9 { width: 300px; height: 300px; position: relative; } .box1 { background-image:url(../img/Yosemite_0002_03.png); float: right; } .box2 { background-image:url(../img/Yosemite_0000_01.png); float: left; } .box3 { background-image:url(../img/Yosemite_0001_02.png); float: left; } .box4 { background-image:url(../img/Yosemite_0003_04.png); float: left; } .box5 { background-image:url(../img/Yosemite_0005_06.png); float: right; } .box6 { background-image:url(../img/Yosemite_0004_05.png); float: left; } .box7 { background-image:url(../img/Yosemite_0008_09.png); float: right; } .box8 { background-image:url(../img/Yosemite_0007_08.png); float: right; } .box9 { background-image:url(../img/Yosemite_0006_07.png); float: left; } .brand { background-color: #000000; color: #ffffff; position: fixed; font-size:36px; top: 0; height: 100px; width: 900px; } .byline { background-color:#E0E0E0; position: absolute; top: 25px; left: 25px; padding: 25px; }：将输入字符串转换为小写
>>> row = "['Adventure' 'African elephant' 'Animal' 'Ball game' 'Bay' 'Body of water' 'Communication Device' 'Electronic device']" >>> [w.replace(' ', '_') for w in re.findall(r"'([^']*)'", row.lower())] ['adventure', 'african_elephant', 'animal', 'ball_game', 'bay', 'body_of_water', 'communication_device', 'electronic_device']通过查找用单引号引起来的子字符串将小写输入字符串转换为列表
row.lower()在列表的每个元素中用re.findall替换空格

将长列表状字符串转换为新列表

3 个答案: