我正在尝试对Pyspark数据帧上的值进行旋转后得到的列的别名.这里的问题是我没有正确设置我在别名调用中放置的列名.
一个具体的例子:
从此数据框开始:
import pyspark.sql.functions as func df = sc.parallelize([ (217498, 100000001, 'A'), (217498, 100000025, 'A'), (217498, 100000124, 'A'), (217498, 100000152, 'B'), (217498, 100000165, 'C'), (217498, 100000177, 'C'), (217498, 100000182, 'A'), (217498, 100000197, 'B'), (217498, 100000210, 'B'), (854123, 100000005, 'A'), (854123, 100000007, 'A') ]).toDF(["user_id", "timestamp", "actions"])
这使
+-------+--------------------+------------+ |user_id| timestamp | actions | +-------+--------------------+------------+ | 217498| 100000001| 'A' | | 217498| 100000025| 'A' | | 217498| 100000124| 'A' | | 217498| 100000152| 'B' | | 217498| 100000165| 'C' | | 217498| 100000177| 'C' | | 217498| 100000182| 'A' | | 217498| 100000197| 'B' | | 217498| 100000210| 'B' | | 854123| 100000005| 'A' | | 854123| 100000007| 'A' |
问题是打电话
df = df.groupby('user_id')\ .pivot('actions')\ .agg(func.count('timestamp').alias('ts_count'), func.mean('timestamp').alias('ts_mean'))
给出列名
df.columns ['user_id', 'A_(count(timestamp),mode=Complete,isDistinct=false) AS ts_count#4L', 'A_(avg(timestamp),mode=Complete,isDistinct=false) AS ts_mean#5', 'B_(count(timestamp),mode=Complete,isDistinct=false) AS ts_count#4L', 'B_(avg(timestamp),mode=Complete,isDistinct=false) AS ts_mean#5', 'C_(count(timestamp),mode=Complete,isDistinct=false) AS ts_count#4L', 'C_(avg(timestamp),mode=Complete,isDistinct=false) AS ts_mean#5']
这是完全不切实际的.
我可以使用这里显示的方法清理我的列名- (正则表达式) 或这里 - (使用withColumnRenamed().但是这些是更新后很容易破解的解决方法.
总结一下:如何使用数据透视表生成的列而不必解析它们?(例如'A_(count(timestamp),mode = Complete,isDistinct = false)AS ts_count#4L'生成的名称)?
任何帮助,将不胜感激 !谢谢