清理由3500个不同文本文档形成的熊猫数据框-刮文章

时间:2018-07-20 09:43:05

标签: python pandas data-science

我有3500多个抓取的文本文件,这些文件是从相同或不同的来源抓取的,这些文本文件既是文章,又是包含在抓取的文本中的广告,我想从头到尾删除这些文件只是因为它们仅位于两者之间。

我加载了所有文本文件,并将它们转换为熊猫数据框。

import io
loc = 'D://Users/Desktop/tdm_input'
os.chdir(loc)
filelist = os.listdir(loc)
data = []
path = loc
files = [f for f in os.listdir(path) if os.path.isfile(f)]
for f in files:
    with io.open(f,'r', encoding='ISO-8859-1') as myfile:
        data.append(myfile.read())

df = pd.DataFrame(data)
print (df.shape)

df.columns = ['text']
df.head()


text
0   What Fresh Hell Is This? January 31, 2018 ...A...
1   What Fresh Hell Is This? February 27, 2018 My ...
2   What Fresh Hell Is This? March 31, 2018 Trump ...
3   What Fresh Hell Is This? April 29, 2018 Michel...
4   Join Email List Contribute Join AMERICAblog Ac...
5   Join Email List Contribute Join AMERICAblog Ac...
6   Join Email List Contribute Join AMERICAblog Ac...
7   Join Email List Contribute Join AMERICAblog Ac...
8   Skip to content Facebook ASF On Twitter ASF On...
9   Skip to content Facebook ASF On Twitter ASF On...
10  Skip to content Facebook ASF On Twitter ASF On...
11  Skip to content Facebook ASF On Twitter ASF On...
12  Skip to content Facebook ASF On Twitter ASF On...
13  Skip to content Facebook ASF On Twitter ASF On...
14  Skip to content Facebook ASF On Twitter ASF On...
15  Skip to content Facebook ASF On Twitter ASF On...
16  Skip to content Facebook ASF On Twitter ASF On...
17  Skip to content Facebook ASF On Twitter ASF On...
18  Skip to content Facebook ASF On Twitter ASF On...
19  Skip to content Facebook ASF On Twitter ASF On...
20  Skip to content Facebook ASF On Twitter ASF On...
21  Skip to content Facebook ASF On Twitter ASF On...
22  Skip to content Facebook ASF On Twitter ASF On...
23  Skip to content Facebook ASF On Twitter ASF On...
24  Skip to content Facebook ASF On Twitter ASF On...
25  Skip to content Facebook ASF On Twitter ASF On...
26  Skip to content Facebook ASF On Twitter ASF On...
27  Skip to content Facebook ASF On Twitter ASF On...
28  Skip to content Facebook ASF On Twitter ASF On...
29  Skip to content Facebook ASF On Twitter ASF On...
30  Skip to content Facebook ASF On Twitter ASF On...
31  Skip to content Facebook ASF On Twitter ASF On...
32  Skip to content Facebook ASF On Twitter ASF On...
33  Skip to content Facebook ASF On Twitter ASF On...
34  Skip to content Facebook ASF On Twitter ASF On...
35  Skip to content Facebook ASF On Twitter ASF On...
36  Skip to content Facebook ASF On Twitter ASF On...
37  Skip to content Facebook ASF On Twitter ASF On...
38  Skip to content Facebook ASF On Twitter ASF On...
39  Skip to content Facebook ASF On Twitter ASF On...
40  Skip to content Facebook ASF On Twitter ASF On...
41  Skip to content Facebook ASF On Twitter ASF On...
42  Skip to content Facebook ASF On Twitter ASF On...
43  Skip to content Facebook ASF On Twitter ASF On...
44  French Politics An American observer comments ...
45  French Politics An American observer comments ...
46  DOW JONES, A NEWS CORP COMPANY News Corp is a ...
47  DOW JONES, A NEWS CORP COMPANY News Corp is a ...
48  DOW JONES, A NEWS CORP COMPANY News Corp is a ...
49  DOW JONES, A NEWS CORP COMPANY News Corp is a ...

我的数据框看起来像如何清洁它。 这是指向我的数据框的链接。

我还提取了我希望在文本文件的开头和结尾删除的这些句子,如何使用它来清理y熊猫数据框。

这是链接; 数据= https://drive.google.com/open?id=1HJbWTUMfiBV54EEtgSXTcsQLzQT1rFgz flitered_text = https://drive.google.com/open?id=1GApPKvA82tx4CDtlOTqe99zKXS3AHiuD

0 个答案:

没有答案