我有一个奇怪的文本文件,其中包含一堆NUL
字符(实际上大约有10个这样的文件),我想以编程方式从R中替换它们.这是一个链接到其中一个文件.在这个问题的帮助下,我终于找到了一种比临时更好的 方法来进入每个文件并找到并替换烦扰的角色.事实证明,它们中的每一对应该对应一个空间([NUL][NUL]
- > ) to maintain the intended line width of the file (which is crucial for reading these as fixed-width further down the road).
However, for robustness' sake, I prefer a more automable approach to the solution, ideally (for organization's sake) something I could add at the beginning of an R script I'm writing to clean up the files. This question looked promising but the accepted answer is insufficient - readLines
每当我尝试在这些文件上使用它时都会抛出错误(除非我激活skipNul
).
有没有办法将此文件的行放入R中,以便我可以使用gsub
或其他任何方法来解决此问题,而无需借助外部程序?
您想要将文件读取为二进制文件,然后您可以替换NUL
s,例如用空格替换它们:
r = readBin("00staff.dat", raw(), file.info("00staff.dat")$size) r[r==as.raw(0)] = as.raw(0x20) ## replace with 0x20 =writeBin(r, "00staff.txt") str(readLines("00staff.txt")) # chr [1:155432] "000540952Anderson Shelley J FW1949 2000R000000000000119460007620 3 0007000704002097907KGKG1616"| __truncated__ ...
你也可以NUL
用一个非常罕见的字符替换s(例如"\01"
)并对字符串进行处理,例如,假设你想用一个空格替换两个NUL
s("\00\00"
):
r = readBin("00staff.dat", raw(), file.info("00staff.dat")$size) r[r==as.raw(0)] = as.raw(1) a = gsub("\01\01", " ", rawToChar(r), fixed=TRUE) s = strsplit(a, "\n", TRUE)[[1]] str(s) # chr [1:155432] "000540952Anderson Shelley J FW1949 2000R000000000000119460007620 3 0007000704002097907KGKG1616"| __truncated__