我必须将文本字符向量中的所有参数转换为易于引用的格式:使用R的列表具有3列(演示者,时间和文本)(对不起,我应该更清楚了)。
例如,主持人应该是
# HARPER'S
时间应该是
# [Day 1, 9:00 A.M.]
并且文本应该是参数中的其余部分。
我需要计算文本中的参数数量(每个
开头# HARPER'S [Day 1, 9:00 A.M.]
是一个参数)。我想创建一个名为“ arguments”的新列表对象,该列表的每个元素都是一个包含三个元素(“ presenter”,“ time”和“ text”)的子列表。
然后将演示者名称和时间提取到两个字符向量中(也删除缩进),并将“ presenter”元素和“ time”元素保留在该参数的子列表中。
This is the text:
[1] "HARPER'S [Day 1, 9:00 A.M.]: When the computer was young, the word hacking was"
[2] "used to describe the work of brilliant students who explored and expanded the"
[3] "uses to which this new technology might be employed. There was even talk of a"
[4] "\"hacker ethic.\" Somehow, in the succeeding years, the word has taken on dark"
[5] "connotations, suggestion the actions of a criminal. What is the hacker ethic,"
[6] "and does it survive?"
[7] ""
[8] "ADELAIDE [Day 1, 9:25 A.M.]: the hacker ethic survives, and it is a fraud. It"
[9] "survives in anyone excited by technology's power to turn many small,"
[10] "insignificant things into one vast, beautiful thing. It is a fraud because"
[11] "there is nothing magical about computers that causes a user to undergo"
[12] "religious conversion and devote himself to the public good. Early automobile"
[13] "inventors were hackers too. At first the elite drove in luxury. Later"
[14] "practically everyone had a car. Now we have traffic jams, drunk drivers, air"
[15] "pollution, and suburban sprawl. The old magic of an automobile occasionally"
[16] "surfaces, but we possess no delusions that it automatically invades the"
[17] "consciousness of anyone who sits behind the wheel. Computers are power, and"
[18] "direct contact with power can bring out the best or worst in a person. It's"
[19] "tempting to think that everyone exposed to the technology will be grandly"
[20] "inspired, but, alas, it just ain't so."
[21] ""
[22] "BRAND [Day 1, 9:54 A.M.]: The hacker ethic involves several things. One is"
[23] "avoiding waste; insisting on using idle computer power -- often hacking into a"
[24] "system to do so, while taking the greatest precautions not to damage the"
[25] "system. A second goal of many hackers is the free exchange of technical"
[26] "information. These hackers feel that patent and copyright restrictions slow"
[27] "down technological advances. A third goal is the advancement of human"
[28] "knowledge for its own sake. Often this approach is unconventional. People we"
[29] "call crackers often explore systems and do mischief. The are called hackers by"
[30] "the press, which doesn't understand the issues."
[31] ""
[32] "KK [Day 1, 11:19 A.M.]: The hacker ethic went unnoticed early on because the"
[33] "explorations of basement tinkerers were very local. Once we all became"
[34] "connected, the work of these investigations rippled through the world. today"
[35] "the hacking spirit is alive and kicking in video, satellite TV, and radio. In"
[36] "some fields they are called chippers, because the modify and peddle altered"
[37] "chips. Everything that was once said about \"phone phreaks\" can be said about"
[38] "them too."
我试图计算参数的长度。
length(grep("^([A-Z]+'*[A-Z]*)", text_data))
arguments = list(presenters = regmatches(text_data, regexpr("^([A-Z]+'*[A-Z]*)", text_data)), time = regmatches(text_data, regexpr("(\\[.*\\])", text_data)), text = regmatches(paste(unlist(text_data), collapse =" ")), regexpr("(:\\s.*)", regmatches(paste(unlist(text_data), collapse =" "))))
text_data
列表“参数”的长度应为55。
输出示例为example data output format
非常感谢您的帮助。
答案 0 :(得分:1)
使用您要捕获给定文本的方式,此正则表达式可以完成工作,因为它可以将演示者,时间和文本分为三组,并使用<?xml version="1.0" encoding="utf-8"?>
<LinearLayout xmlns:android="http://schemas.android.com/apk/res/android"
xmlns:app="http://schemas.android.com/apk/res-auto"
xmlns:tools="http://schemas.android.com/tools"
android:windowSoftInputMode="stateVisible|adjustPan"
android:layout_width="match_parent"
android:layout_height="match_parent"
android:orientation="vertical"
tools:context=".MainActivity">
<EditText
android:id="@+id/name"
android:hint="Enter name"
android:layout_width="match_parent"
android:layout_height="0dp"
android:layout_weight="0.25"/>
<EditText
android:id="@+id/phone_num"
android:layout_width="match_parent"
android:layout_height="0dp"
android:layout_weight="0.25"
android:hint="Enter Phone Number"/>
<Button
android:id="@+id/submit"
android:layout_width="150dp"
android:layout_height="50dp"
android:layout_gravity="center"
android:text="@string/submit_contact"/>
<ScrollView
android:layout_width="match_parent"
android:layout_height="wrap_content"
android:layout_weight="2"
android:isScrollContainer="false"
android:fillViewport="true">
<LinearLayout
android:id="@+id/layout"
android:layout_width="match_parent"
android:layout_height="wrap_content"
android:orientation="vertical">
</LinearLayout>
</ScrollView>
</LinearLayout>
查找所有文本并将其放入列表中这三个信息中的每一个在元组中都作为单个元素存在于列表中。查看此正则表达式演示,
re.findall
示例Python代码,
(.*?)\s+(\[[^[\]]*\]):\s*([\w\W]*?)(?=\n\n|\Z)
打印包含三元组import re
s = """HARPER'S [Day 1, 9:00 A.M.]: When the computer was young, the word hacking was
used to describe the work of brilliant students who explored and expanded the
uses to which this new technology might be employed. There was even talk of a
\"hacker ethic.\" Somehow, in the succeeding years, the word has taken on dark
connotations, suggestion the actions of a criminal. What is the hacker ethic,
and does it survive?
ADELAIDE [Day 1, 9:25 A.M.]: the hacker ethic survives, and it is a fraud. It
survives in anyone excited by technology's power to turn many small,
insignificant things into one vast, beautiful thing. It is a fraud because
there is nothing magical about computers that causes a user to undergo
religious conversion and devote himself to the public good. Early automobile
inventors were hackers too. At first the elite drove in luxury. Later
practically everyone had a car. Now we have traffic jams, drunk drivers, air
pollution, and suburban sprawl. The old magic of an automobile occasionally
surfaces, but we possess no delusions that it automatically invades the
consciousness of anyone who sits behind the wheel. Computers are power, and
direct contact with power can bring out the best or worst in a person. It's
tempting to think that everyone exposed to the technology will be grandly
inspired, but, alas, it just ain't so.
BRAND [Day 1, 9:54 A.M.]: The hacker ethic involves several things. One is
avoiding waste; insisting on using idle computer power -- often hacking into a
system to do so, while taking the greatest precautions not to damage the
system. A second goal of many hackers is the free exchange of technical
information. These hackers feel that patent and copyright restrictions slow
down technological advances. A third goal is the advancement of human
knowledge for its own sake. Often this approach is unconventional. People we
call crackers often explore systems and do mischief. The are called hackers by
the press, which doesn't understand the issues.
KK [Day 1, 11:19 A.M.]: The hacker ethic went unnoticed early on because the
explorations of basement tinkerers were very local. Once we all became
connected, the work of these investigations rippled through the world. today
the hacking spirit is alive and kicking in video, satellite TV, and radio. In
some fields they are called chippers, because the modify and peddle altered
chips. Everything that was once said about \"phone phreaks\" can be said about
them too."""
argument = re.findall(r'(.*?)\s+(\[[^[\]]*\]):\s*([\w\W]*?)(?=\n\n|\Z)', s)
print(argument)
,presenter
和time
的元组的列表
text
答案 1 :(得分:1)
library(magrittr)
library(data.table)
text2df <- function(text) {
idx <- c(1, which(text == ""), length(text))
apply(matrix(c(idx[-length(idx)], idx[-1]), ncol = 2), 1, function(id1_id2) {
presenter_text <- text[id1_id2[1]:id1_id2[2]]
first_row <- paste(presenter_text[1:2], collapse = "") # presenter_text[1] can be ''
presenter_name <- strsplit(first_row, split = " [", fixed = T)[[1]][1]
presentation_time <- strsplit(first_row, split = "]: ", fixed = T)[[1]][1] %>%
gsub(paste0(presenter_name, " ["), "", ., fixed = T)
presentation_text <- paste(c(
gsub(paste0(presenter_name, " [", presentation_time, "]:"), "", first_row, fixed = T) %>%
stringi::stri_trim_left() # remove leading spaces
, presenter_text[3:length(presenter_text)] %>% .[!is.na(.)] # filter NA if only one row of text
), collapse = "")
data.table(presenter = presenter_name, time = presentation_time, text = presentation_text)
}) %>% rbindlist
}
答案 2 :(得分:1)
这是您的输入:
text_data = """HARPER'S [Day 1, 9:00 A.M.]: When the computer was young, the word hacking was
used to describe the work of brilliant students who explored and expanded the
uses to which this new technology might be employed. There was even talk of a
\"hacker ethic.\" Somehow, in the succeeding years, the word has taken on dark
connotations, suggestion the actions of a criminal. What is the hacker ethic,
and does it survive?
ADELAIDE [Day 1, 9:25 A.M.]: the hacker ethic survives, and it is a fraud. It
survives in anyone excited by technology's power to turn many small,
insignificant things into one vast, beautiful thing. It is a fraud because
there is nothing magical about computers that causes a user to undergo
religious conversion and devote himself to the public good. Early automobile
inventors were hackers too. At first the elite drove in luxury. Later
practically everyone had a car. Now we have traffic jams, drunk drivers, air
pollution, and suburban sprawl. The old magic of an automobile occasionally
surfaces, but we possess no delusions that it automatically invades the
consciousness of anyone who sits behind the wheel. Computers are power, and
direct contact with power can bring out the best or worst in a person. It's
tempting to think that everyone exposed to the technology will be grandly
inspired, but, alas, it just ain't so.
BRAND [Day 1, 9:54 A.M.]: The hacker ethic involves several things. One is
avoiding waste; insisting on using idle computer power -- often hacking into a
system to do so, while taking the greatest precautions not to damage the
system. A second goal of many hackers is the free exchange of technical
information. These hackers feel that patent and copyright restrictions slow
down technological advances. A third goal is the advancement of human
knowledge for its own sake. Often this approach is unconventional. People we
call crackers often explore systems and do mischief. The are called hackers by
the press, which doesn't understand the issues.
KK [Day 1, 11:19 A.M.]: The hacker ethic went unnoticed early on because the
explorations of basement tinkerers were very local. Once we all became
connected, the work of these investigations rippled through the world. today
the hacking spirit is alive and kicking in video, satellite TV, and radio. In
some fields they are called chippers, because the modify and peddle altered
chips. Everything that was once said about \"phone phreaks\" can be said about
them too."""
使用regex
提取三个变量:
import re
argument = re.findall("(?P<presenter>[A-Z|']+).\[(?P<time>\w.+)\].\s+(?P<text>[\w\W]*?)(?=\n\n|\Z)",text_data)
以防万一,如果您想将它们做成字典:
mydict = {'presenter':[],'time':[],'text':[]}
for i in argument:
mydict['presenter'].append(i[0])
mydict['time'].append(i[1])
mydict['text'].append(i[2])
或者如果您要将它们保存在csv
文件中:
import csv
with open("filename.csv","w") as mycsv:
writers = csv.writer(mycsv)
header = ['presenter','time','text']
writers.writerow(header)
for item in argument:
writers.writerow(item)
要加载您的csv
文件:
import pandas as pd
df = pd.read_csv("filename.csv")
df
输出:
presenter | time | text
--------------------------------------------------------------------------------------
0 HARPER'S | Day 1, 9:00 A.M. | When the computer was young, the word hacking ...
1 ADELAIDE | Day 1, 9:25 A.M. | the hacker ethic survives, and it is a fraud. ...
2 BRAND | Day 1, 9:54 A.M. | The hacker ethic involves several things. One...
3 KK | Day 1, 11:19 A.M. | The hacker ethic went unnoticed early on becau...
答案 3 :(得分:0)
import re
matchObj = re.search( r'(.*?)\[(.*?)\](.*\s)', line)
print(matchObj.group(1))
print(matchObj.group(2))
print(matchObj.group(3))
这可能会有所帮助 使用组可以提取字符,如果您想更改某些逻辑,可以在“()”括号中更改