在Spark开发中,有时为了更好的效率,特别是涉及到关联操作的时候,对数据进行重新分区操作可以提高程序运行效率(很多时候效率的提升远远高于重新分区的消耗,所以进行重新分区还是很有价值的)。
在SparkSQL中,对数据重新分区主要有两个方法 repartition 和 coalesce ,下面将对两个方法比较
repartition
repartition 有三个重载的函数:
- def repartition(numPartitions: Int): DataFrame
 
1 /**2 * Returns a new [[DataFrame]] that has exactly `numPartitions` partitions. 3 * @group dfops 4 * @since 1.3.0 5 */6 def repartition(numPartitions: Int): DataFrame = withPlan { 7 Repartition(numPartitions, shuffle = true, logicalPlan) 8 }
此方法返回一个新的[[DataFrame]],该[[DataFrame]]具有确切的 'numpartition' 分区。
- def repartition(partitionExprs: Column*): DataFrame
 
1 /** 2 * Returns a new [[DataFrame]] partitioned by the given partitioning expressions preserving 3 * the existing number of partitions. The resulting DataFrame is hash partitioned. 4 * 5 * This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL). 6 * 7 * @group dfops 8 * @since 1.6.0 9 */10 @scala.annotation.varargs 11 def repartition(partitionExprs: Column*): DataFrame = withPlan { 12 RepartitionByExpression(partitionExprs.map(_.expr), logicalPlan, numPartitions = None) 13 }
此方法返回一个新的[[DataFrame]]分区,它由保留现有分区数量的给定分区表达式划分。得到的DataFrame是哈希分区的。
这与SQL (Hive QL)中的“distribution BY”操作相同。
- def repartition(numPartitions: Int, partitionExprs: Column*): DataFrame
 
1 /** 2 * Returns a new [[DataFrame]] partitioned by the given partitioning expressions into 3 * `numPartitions`. The resulting DataFrame is hash partitioned. 4 * 5 * This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL). 6 * 7 * @group dfops 8 * @since 1.6.0 9 */10 @scala.annotation.varargs 11 def repartition(numPartitions: Int, partitionExprs: Column*): DataFrame = withPlan { 12 RepartitionByE

