Question

我有一组包含来自html页面的预处理文本的文档。它们已经送给我了。我只想从中提取单词。我不希望提取任何数字或常用词或任何单个字母。我面临的第一个问题就是这个问题。

假设我有一个单元格数组：

{'!' '!!' '!!!!)'  '!!!!thanks' '!!dogsbreath'    '!)'    '!--[endif]--'    '!--[if'}

我想让单元格数组只包含单词 - 就像这样。

{'!!!!thanks' '!!dogsbreath' '!--[endif]--' '!--[if'}

然后将其转换为此单元格数组

{'thanks' 'dogsbreath' 'endif' 'if'}

有没有办法做到这一点？

更新后的要求：感谢您的所有答案。但是我遇到了问题！让我来说明一下（请注意，单元格值是从HTML文档中提取的文本，因此可能包含非ASCII值） -

{'!/bin/bash'    '![endif]'    '!take-a-long'    '!â€“photo'}

这给了我答案

{'bin'    'bash'    'endif'    'take'    'a'    'long'    'â'    'photo' }

我的问题：

为什么bin / bash和take-a-long被分成三个单元格？它对我来说不是问题，但仍然是为什么？可以避免这种情况。我的意思是所有来自单个单元格的单词组合成一个单词。
请注意，在'!â€“photo'中存在一个非ascii字符â，这意味着a。可以合并一个步骤，以便这种转换是自动的吗？
我注意到文字"it? __________ About the Author:"给了我"__________"作为一个单词。为什么会这样？
同时文字"2. areoplane 3. cactus 4. a_rinny_boo... 5. trumpet 6. window 7. curtain ... 173. gypsy_wagon..."会将单词返回为'areoplane' 'cactus' 'a_rinny_boo' 'trumpet' 'window' 'curtain' 'gypsy_wagon'。我希望'a_rinny_boo'和''gypsy_wagon这两个词为'a' 'rinny' 'boo' 'gypsy' 'wagon'。可以这样做吗？

更新1 根据我提出的所有建议，我写下了除上述两个新问题外的大部分内容。

function [Text_Data] = raw_txt_gn(filename)

% This function will convert the text documnets into raw text
% It will remove all commas empty cells and other special characters
% It will also convert all the words of the text documents into lowercase

T = textread(filename, '%s');

% find all the important indices
ind1=find(ismember(T,':WebpageTitle:'));
T1 = T(ind1+1:end,1);

% Remove things which are not basically words
not_words = {'##','-',':ImageSurroundingText:',':WebpageDescription:',':WebpageKeywords:',' '};

T2 = []; count = 1;
for j=1:length(T1)    
    x = T1{j};
    ind=find(ismember(not_words,x), 1);
    if isempty(ind)

        B = regexp(x, '\w*', 'match');
        B(cellfun('isempty', B)) = []; % Clean out empty cells
        B = [B{:}]; % Flatten cell array

        % convert the string into lowecase
        % so that while generating the features the case sensitivity is
        % handled well
        x = lower(B);        

        T2{count,1} = x;
        count = count+1;
    end
end
T2 = T2(~cellfun('isempty',T2));


% Getting the common words in the english language
% found from Wikipedia
not_words2 = {'the','be','to','of','and','a','in','that','have','i'};
not_words2 = [not_words2, 'it' 'for' 'not' 'on' 'with' 'he' 'as' 'you' 'do' 'at'];
not_words2 = [not_words2, 'this' 'but' 'his' 'by' 'from' 'they' 'we' 'say' 'her' 'she'];
not_words2 = [not_words2, 'or' 'an' 'will' 'my' 'one' 'all' 'would' 'there' 'their' 'what'];
not_words2 = [not_words2, 'so' 'up' 'out' 'if' 'about' 'who' 'get' 'which' 'go' 'me'];
not_words2 = [not_words2, 'when' 'make' 'can' 'like' 'time' 'no' 'just' 'him' 'know' 'take'];
not_words2 = [not_words2, 'people' 'into' 'year' 'your' 'good' 'some' 'could' 'them' 'see' 'other'];
not_words2 = [not_words2, 'than' 'then' 'now' 'look' 'only' 'come' 'its' 'over' 'think' 'also'];
not_words2 = [not_words2, 'back' 'after' 'use' 'two' 'how' 'our' 'work' 'first' 'well' 'way'];
not_words2 = [not_words2, 'even' 'new' 'want' 'because' 'any' 'these' 'give' 'day' 'most' 'us'];

for j=1:length(T2)
    x = T2{j};
    % if a particular cell contains only numbers then make it empty
    if sum(isstrprop(x, 'digit'))~=0
        T2{j} = [];
    end
    % also remove single character cells
    if length(x)==1
        T2{j} = [];
    end
    % also remove the most common words from the dictionary
    % the common words are taken from the english dicitonary (source
    % wikipedia)
    ind=find(ismember(not_words2,x), 1);
    if isempty(ind)==0
        T2{j} = [];
    end
end

Text_Data = T2(~cellfun('isempty',T2));

更新2 我在here中找到了这段代码，告诉我如何检查非ascii字符。将此代码段合并到Matlab中作为

% remove the non-ascii characters
if all(x  < 128)
else
  T2{j} = [];
end

然后删除空单元格似乎我的第二个要求已经完成，尽管包含非ascii字符部分的文本完全消失了。

我的最终要求能否完成？其中大多数涉及角色'_'和'-'。

Answer 1

采用regexp方法直接进入最后一步：

A = {'!' '!!' '!!!!)'  '!!!!thanks' '!!dogsbreath'    '!)'    '!--[endif]--'    '!--[if'};

B = regexp(A, '\w*', 'match');
B(cellfun('isempty', B)) = []; % Clean out empty cells
B = [B{:}]; % Flatten cell array

匹配任何字母，数字或下划线字符。对于示例案例，我们得到一个1x4单元格数组：

B = 

    'thanks'    'dogsbreath'    'endif'    'if'

编辑：

为什么bin / bash和take-a-long被分成三个单元格？它对我来说不是问题，但仍然是为什么？可以避免这种情况。我的意思是所有来自单个单元格的单词组合成一个单词。

因为我正在展平单元格数组以删除嵌套单元格。如果删除B = [B{:}];，则每个单元格将包含一个嵌套单元格，其中包含输入单元格数组的所有匹配项。您可以将这些结合起来。

请注意，在'！'照片'中存在一个非ascii字符â，其实际上意味着一个。可以合并一个步骤，以便这种转换是自动的吗？

是的，你必须根据字符代码制作它。

我注意到文本“它？__________关于作者：”给了我“__________”作为一个单词。为什么会这样？

正如我所说，正则表达式匹配字母，数字或下划线字符。您可以更改过滤器以排除_，这也将解决第四个要点：B = regexp(A, '[a-zA-Z0-9]*', 'match');这只会匹配a-z，A-Z和0-9。这也将排除非ASCII字符，它似乎与\w*标志匹配。

Answer 2

我认为@excaza的解决方案将成为首选方法，但是isstrprop使用其可选的输入参数'alpha'来替代它寻找字母 -

A(cellfun(@(x) any(isstrprop(x, 'alpha')), A))

示例运行 -

>> A
A = 
    '!'    '!!'    '!!!!)'    '!!!!thanks'    '!!dogsbreath'    '!)'    '!--[endif]--'    '!--[if'
>> A(cellfun(@(x) any(isstrprop(x, 'alpha')), A))
ans = 
    '!!!!thanks'    '!!dogsbreath'    '!--[endif]--'    '!--[if'

要进入最终目的地，您可以调整一下这种方法，就像这样 -

B = cellfun(@(x) x(isstrprop(x, 'alpha')), A,'Uni',0);
out = B(~cellfun('isempty',B))

示例运行 -

A = 
    '!'    '!!'    '!!!!)'    '!!!!thanks'    '!!dogsbreath'    '!)'    '!--[endif]--'    '!--[if'
out = 
    'thanks'    'dogsbreath'    'endif'    'if'

在matlab中仅提取单元格数组中的单词

2 个答案: