我有一个包含6列的csv文件,其中一列的文本用逗号分隔,例如BOLT,RD HD SQ SHORT NECK,METRIC.
当我在R中读取此文件时,此列中存在溢出,随后数据移动到新行.
下面我贴了几行
014003051906,ETN5080,0450,螺栓套件上轴,5速,1.000,F 014003051906,ETN5967,0460,传感器传感器FH后挡板,1.000,F 014003051906,ETN64267,0470,倾斜装置传感器,1.000,F
014003065376,03M7184,0020,BOLT - M 8.0 X 1.250 X 20.0 - 8.8-Zinc,4.000,G 014003065376,03M7386,0090,BOLT,RD HD SQ短颈,公制,18.000,G 014003065376,14M7296,0090,NUT,METRIC ,HEX FLANGE,14.000,G
最后两行是问题所在."NUT,METRIC,HEX FLANGE"应该归入一个变量.
怎么解决这个问题?
data <- readLines(con = textConnection("014003051906,ETN5080 ,0450,BOLT KIT UPPER SHAFT WITH 5 SPEED,1.000,F 014003051906,ETN5967 ,0460,SENSOR SENSOR FH BACKSHAFT SPEED,1.000,F 014003051906,ETN64267 ,0470,TILT UNIT SENSOR,1.000,F 014003065376,03M7184 ,0020,BOLT - M 8.0 X 1.250 X 20.0 - 8.8-Zinc,4.000,G 014003065376,03M7386 ,0090,BOLT, RD HD SQ SHORT NECK, METRIC,18.000,G 014003065376,14M7296 ,0090,NUT, METRIC, HEX FLANGE,14.000,G")) pattern <- "^([^,]*),([^,]*),([^,]*),(.*),([^,]*),([^,]*)$" library(stringr) str_match(data, pattern)[, - 1] # [,1] [,2] [,3] [,4] [,5] [,6] # [1,] "014003051906" "ETN5080 " "0450" "BOLT KIT UPPER SHAFT WITH 5 SPEED" "1.000" "F" # [2,] "014003051906" "ETN5967 " "0460" "SENSOR SENSOR FH BACKSHAFT SPEED" "1.000" "F" # [3,] "014003051906" "ETN64267 " "0470" "TILT UNIT SENSOR" "1.000" "F" # [4,] NA NA NA NA NA NA # [5,] "014003065376" "03M7184 " "0020" "BOLT - M 8.0 X 1.250 X 20.0 - 8.8-Zinc" "4.000" "G" # [6,] "014003065376" "03M7386 " "0090" "BOLT, RD HD SQ SHORT NECK, METRIC" "18.000" "G" # [7,] "014003065376" "14M7296 " "0090" "NUT, METRIC, HEX FLANGE" "14.000" "G"
编辑:
初学者的正则表达式解释,用简单的词语请原谅不准确:
初始^
和终端$
的意思是启动和字符串的结尾.
Parens用于分组(str_match()
将提取的组).
.
表示任何字符,.*
表示任何数量的字符.
[^,]
表示任何不是逗号的字符.
放在一起时,这意味着:start of string
- - substring without a comma
(comma
重复3次) - - substring possibly containing commas
- comma
- substring without a comma
- comma
- substring without a comma
,end of string
只有带括号的组被提取.