我的问题是如何将列拆分为多列.我不知道为什么df.toPandas()
不起作用.
例如,我想将'df_test'更改为'df_test2'.我看到很多使用pandas模块的例子.还有另外一种方法吗?先感谢您.
df_test = sqlContext.createDataFrame([ (1, '14-Jul-15'), (2, '14-Jun-15'), (3, '11-Oct-15'), ], ('id', 'date'))
df_test2
id day month year 1 14 Jul 15 2 14 Jun 15 1 11 Oct 15
zero323.. 10
Spark> = 2.2
您可以跳过unix_timestamp
并投射和使用to_date
或to_timestamp
:
from pyspark.sql.functions import to_date, to_timestamp df_test.withColumn("date", to_date("date", "dd-MMM-yy")).show() ## +---+----------+ ## | id| date| ## +---+----------+ ## | 1|2015-07-14| ## | 2|2015-06-14| ## | 3|2015-10-11| ## +---+----------+ df_test.withColumn("date", to_timestamp("date", "dd-MMM-yy")).show() ## +---+-------------------+ ## | id| date| ## +---+-------------------+ ## | 1|2015-07-14 00:00:00| ## | 2|2015-06-14 00:00:00| ## | 3|2015-10-11 00:00:00| ## +---+-------------------+
然后应用下面显示的其他日期时间函数.
Spark <2.2
无法在单个访问中派生多个顶级列.您可以将结构或集合类型与UDF一起使用,如下所示:
from pyspark.sql.types import StringType, StructType, StructField from pyspark.sql import Row from pyspark.sql.functions import udf, col schema = StructType([ StructField("day", StringType(), True), StructField("month", StringType(), True), StructField("year", StringType(), True) ]) def split_date_(s): try: d, m, y = s.split("-") return d, m, y except: return None split_date = udf(split_date_, schema) transformed = df_test.withColumn("date", split_date(col("date"))) transformed.printSchema() ## root ## |-- id: long (nullable = true) ## |-- date: struct (nullable = true) ## | |-- day: string (nullable = true) ## | |-- month: string (nullable = true) ## | |-- year: string (nullable = true)
但它不仅在PySpark中相当冗长,而且价格昂贵.
对于基于日期的转换,您只需使用内置函数:
from pyspark.sql.functions import unix_timestamp, dayofmonth, year, date_format transformed = (df_test .withColumn("ts", unix_timestamp(col("date"), "dd-MMM-yy").cast("timestamp")) .withColumn("day", dayofmonth(col("ts")).cast("string")) .withColumn("month", date_format(col("ts"), "MMM")) .withColumn("year", year(col("ts")).cast("string")) .drop("ts"))
同样,您可以使用regexp_extract
拆分日期字符串.
另请参见从Spark DataFrame中的单个列派生多个列
注意:
如果您使用未针对SPARK-11724打补丁的版本,则需要在unix_timestamp(...)
之前和之后进行修正cast("timestamp")
.
Spark> = 2.2
您可以跳过unix_timestamp
并投射和使用to_date
或to_timestamp
:
from pyspark.sql.functions import to_date, to_timestamp df_test.withColumn("date", to_date("date", "dd-MMM-yy")).show() ## +---+----------+ ## | id| date| ## +---+----------+ ## | 1|2015-07-14| ## | 2|2015-06-14| ## | 3|2015-10-11| ## +---+----------+ df_test.withColumn("date", to_timestamp("date", "dd-MMM-yy")).show() ## +---+-------------------+ ## | id| date| ## +---+-------------------+ ## | 1|2015-07-14 00:00:00| ## | 2|2015-06-14 00:00:00| ## | 3|2015-10-11 00:00:00| ## +---+-------------------+
然后应用下面显示的其他日期时间函数.
Spark <2.2
无法在单个访问中派生多个顶级列.您可以将结构或集合类型与UDF一起使用,如下所示:
from pyspark.sql.types import StringType, StructType, StructField from pyspark.sql import Row from pyspark.sql.functions import udf, col schema = StructType([ StructField("day", StringType(), True), StructField("month", StringType(), True), StructField("year", StringType(), True) ]) def split_date_(s): try: d, m, y = s.split("-") return d, m, y except: return None split_date = udf(split_date_, schema) transformed = df_test.withColumn("date", split_date(col("date"))) transformed.printSchema() ## root ## |-- id: long (nullable = true) ## |-- date: struct (nullable = true) ## | |-- day: string (nullable = true) ## | |-- month: string (nullable = true) ## | |-- year: string (nullable = true)
但它不仅在PySpark中相当冗长,而且价格昂贵.
对于基于日期的转换,您只需使用内置函数:
from pyspark.sql.functions import unix_timestamp, dayofmonth, year, date_format transformed = (df_test .withColumn("ts", unix_timestamp(col("date"), "dd-MMM-yy").cast("timestamp")) .withColumn("day", dayofmonth(col("ts")).cast("string")) .withColumn("month", date_format(col("ts"), "MMM")) .withColumn("year", year(col("ts")).cast("string")) .drop("ts"))
同样,您可以使用regexp_extract
拆分日期字符串.
另请参见从Spark DataFrame中的单个列派生多个列
注意:
如果您使用未针对SPARK-11724打补丁的版本,则需要在unix_timestamp(...)
之前和之后进行修正cast("timestamp")
.