考虑以下数据:
Name | Flag A | 0 A | 1 A | 0 B | 0 B | 1 B | 1
我想将其转换为:
Name | Total | With Flag | Percentage A | 3 | 1 | 33% B | 3 | 2 | 66%
最好是在Spark SQL中.
例如这样:
val df = sc.parallelize(Seq(
("A", 0), ("A", 1), ("A", 0),
("B", 0), ("B", 1), ("B", 1)
)).toDF("Name", "Flag")
df.groupBy($"Name").agg(
count("*").alias("total"),
sum($"flag").alias("with_flag"),
// Do you really want to truncate not for example round?
mean($"flag").multiply(100).cast("integer").alias("percentage"))
// +----+-----+---------+----------+
// |name|total|with_flag|percentage|
// +----+-----+---------+----------+
// | A| 3| 1| 33|
// | B| 3| 2| 66|
// +----+-----+---------+----------+
要么:
df.registerTempTable("df")
sqlContext.sql("""
SELECT name, COUNT(*) total, SUM(flag) with_flag,
CAST(AVG(flag) * 100 AS INT) percentage
FROM df
GROUP BY name""")
// +----+-----+---------+----------+
// |name|total|with_flag|percentage|
// +----+-----+---------+----------+
// | A| 3| 1| 33|
// | B| 3| 2| 66|
// +----+-----+---------+----------+