Question

如果我有一个ASCII文本文件，其内容如下：

我想用整数将其分隔成

v1 v2 v3 v4 v5
1  2  3  4  5

换句话说，每个整数都是一个变量。我知道我可以在R中使用read.fwf，但是由于我的数据集中有将近500个变量，因此与将widths=c(1,)重复并重复{ “ 1” 500次？

我还尝试将ASCII文件导入Excel和SPSS，但都不允许我以固定的整数距离插入变量中断。

Answer 1

您可以通过按原样读取一行来确定文件的宽度，然后将其用于read_fwf。使用tidyverse函数，

Traceback (most recent call last):
  File "c:\users\ragha\appdata\local\programs\python\python37-32\lib\site-packages\pip\_vendor\lockfile\linklockfile.py", line 31, in acquire
    os.link(self.unique_name, self.lock_file)
FileExistsError: [WinError 183] Cannot create a file when that file already exists: 'C:\\Users\\ragha\\AppData\\Local\\pip\\Cache\\DESKTOP-KKG32L1-54fc.16400-1747620134' -> 'C:\\Users\\ragha\\AppData\\Local\\pip\\Cache\\selfcheck.json.lock'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "c:\users\ragha\appdata\local\programs\python\python37-32\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\users\ragha\appdata\local\programs\python\python37-32\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\ragha\AppData\Local\Programs\Python\Python37-32\Scripts\pip.exe\__main__.py", line 9, in <module>
  File "c:\users\ragha\appdata\local\programs\python\python37-32\lib\site-packages\pip\_internal\__init__.py", line 246, in main
    return command.main(cmd_args)
  File "c:\users\ragha\appdata\local\programs\python\python37-32\lib\site-packages\pip\_internal\basecommand.py", line 265, in main
    pip_version_check(session, options)
  File "c:\users\ragha\appdata\local\programs\python\python37-32\lib\site-packages\pip\_internal\utils\outdated.py", line 140, in pip_version_check
    state.save(pypi_version, current_time)
  File "c:\users\ragha\appdata\local\programs\python\python37-32\lib\site-packages\pip\_internal\utils\outdated.py", line 70, in save
    with lockfile.LockFile(self.statefile_path):
  File "c:\users\ragha\appdata\local\programs\python\python37-32\lib\site-packages\pip\_vendor\lockfile\__init__.py", line 197, in __enter__
    self.acquire()
  File "c:\users\ragha\appdata\local\programs\python\python37-32\lib\site-packages\pip\_vendor\lockfile\linklockfile.py", line 50, in acquire
    time.sleep(timeout is not None and timeout / 10 or 0.1)
KeyboardInterrupt

Answer 2

这是您最初选择的使用read.fwf()的选项。

# for the example only, a two line source with different line lengths
input <-  textConnection("12345\n6789")

df1 <- read.fwf(input, widths = rep(1, 500))

ncol(df1)
# [1] 500

但是假设您实际上少于500（如您所说，在本示例中就是这种情况），那么可以将所有值都设置为NA的多余列删除，如下所示。这将使用最长的行来确定保留的列数。

df1 <- df1[, apply(!is.na(df1), 2, all)]

df1
#   V1 V2 V3 V4 V5
# 1  1  2  3  4  5
# 2  6  7  8  9  NA

但是，如果没有可接受的缺失值，请使用any()使用最短的行来确定保留的列数。

df1 <- df1[, apply(!is.na(df1), 2, any)]

df1
#   V1 V2 V3 V4
# 1  1  2  3  4
# 2  6  7  8  9

当然，如果您知道确切的行长并且所有行都相同，则只需将widths = rep(1, x)设置为x到已知长度即可。

Answer 3

如果您使用的是Excel 2010或更高版本，则可以使用Power Query（也称为Get & Transform）导入文件。编辑输入时，有一个split columns选项并指定字符数：

此工具包含在Excel 2016中，并且是Excel 2010及更高版本的免费Microsoft加载项。

如何用整数/数字分割ASCII文件？

3 个答案: