Question

制作网络抓取工具以列出列表，例如Spotify中的播放列表信息，Indeed中的职位描述或Linked In中的公司列表。现在，我有大型文本文件，可以通过转换为csv或字典将其格式化为数据框。

文本文件：

Scribd
MobileQAEngineer




VitaminT
MobileQAEngineer




Welocalize
MobileQAEngineer




RWSMoravia
MobileQAEngineer

所需的输出：

Scribd,MobileQAEngineer
VitaminT,MobileQAEngineer
Welocalize,MobileQAEngineer
RWSMoravia,MobileQAEngineer

尽管我可以尝试以下方法：

if line of text does not have 4 \n afterwards
    then it is the 1st tuple
if line of text has 4 \n afterwards
    then it is the 2st tuple

with open(input("Enter a file to read: "),'r') as f:
    for line in f:
        newline = line + ":"
        #f.write(newline)
        print(newline)

在尝试在行末放置'：'时，我最终在行的前后放置了一个：

:
Scribd
:
MobileQAEngineer
:


:
VitaminT
:
MobileQAEngineer
:


:
Welocalize
:
MobileQAEngineer
:


:
RWSMoravia
:
MobileQAEngineer
:

Answer 1

您可以使用regex解析数据，然后将其转换为DataFrame：

import re
import pandas as pd

with open('data.txt', 'r') as f:
    data = f.read()

m = re.findall('(\w+)\n(\w+)', data)
d = {'Company': [c[0] for c in m], 'Position': [c[1] for c in m]}
df = pd.DataFrame(data=d)

输出：

      Company          Position
0      Scribd  MobileQAEngineer
1    VitaminT  MobileQAEngineer
2  Welocalize  MobileQAEngineer
3  RWSMoravia  MobileQAEngineer

Python：使用格式将大型文本文件转换为数据框

1 个答案: