Spark SQL
Spark SQL是Spark用来处理结构化数据的一个模块,它提供了2个编程抽象:DataFrame和DataSet,并且作为分布式SQL查询引擎的作用。
Hive SQL是转换成MapReduce然后提交到集群上执行,大大简化了编写MapReduc的程序的复杂性,由于MapReduce这种计算模型执行效率比较慢。所有Spark SQL的应运而生,它是将Spark SQL转换成RDD,然后提交到集群执行,执行效率非常快!
SparkSession
SparkSession是Spark最新的SQL查询起始点,实质上是SQLContext和HiveContext的组合,所以在SQLContext和HiveContext上可用的API在SparkSession上同样是可以使用的。SparkSession内部封装了sparkContext,所以计算实际上是由sparkContext完成的。
DataFrame
创建
在Spark SQL中SparkSession是创建DataFrame和执行SQL的入口,创建DataFrame有三种方式:通过Spark的数据源进行创建;从一个存在的RDD进行转换;还可以从Hive Table进行查询返回。
读取json文件创建DataFrame
spark读取json按行读取;只要一行符合json的格式即可; scala> val rdd = spark.read.json("/opt/module/spark/spark-local/examples/src/main/resources/people.json")rdd: org.apache.spark.sql.DataFrame = [age: bigint, name: string]scala> rdd.show+----+-------+| age| name|+----+-------+|null|Michael|| 30| Andy|| 19| Justin|+----+-------+
① SQL风格语法 ##转化成sql去执行 scala> rdd.createTempView("user") //view是table的查询结果,只能查不能改scala> spark.sql("select * from user").show+----+-------+| age| name|+----+-------+|null|Michael|| 30| Andy|| 19| Justin|+----+-------+scala> spark.sql("select * from user where age is not null").show+---+------+|age| name|+---+------+| 30| Andy|| 19|Justin|+---+------+
注意:普通临时view是Session范围内的,如果想应用范围内有效,可以使用全局临时表。使用全局临时表时需要全路径访问,如:global_temp.people
scala> rdd.createGlobalTempView("emp") //提升为全局scala> spark.sql("select * from user where age is not null").show+---+------+|age| name|+---+------+| 30| Andy|| 19|Justin|+---+------+scala> spark.sql("select * from emp where age is not null").show //sql默认从当前session中查找,所以查询时需要加上global_temporg.apache.spark.sql.AnalysisException: Table or view not found: emp; line 1 pos 14scala> spark.sql("select * from global_temp.emp where age is not null").show+---+------+|age| name|+---+------+| 30| Andy|| 19|Justin|+---+------+
② 以面向对象方式访问;DSL风格语法 模仿面向对象的方式
scala> rdd.printSchemaroot |-- age: long (nullable = true) |-- name: string (nullable = true)scala> rdd.select("age").show+----+| age|+----+|null|| 30|| 19|+----+scala> rdd.select($"age"+1).show+---------+|(age + 1)|+---------+| null|| 31|| 20|+---------+
RDD转成DF
注意:如果需要RDD与DF或者DS之间操作,那么都需要引入 import spark.implicits._ 【spark不是包名,而是sparkSession对象的名称】
前置条件:导入隐式转换并创建一个RDD
scala> import spark.implicits._ spark对象中的隐式转换规则,而不是导入包名import spark.implicits._ scala> val df = rdd.toDF("id", "name")df: org.apache.spark.sql.DataFrame = [id: bigint, name: string]scala> df.show+----+-------+| id| name|+----+-------+|null|Michael|| 30| Andy|| 19| Justin|+----+-------+scala> df.createTempView("Student")scala> spark.sql("select * from student").show
scala> val x = sc.makeRDD(List(("a",1), ("b",4), ("c", 3)))scala> x.collectres36: Array[(String, Int)] = Array((a,1), (b,4), (c,3))scala> x.toDF("name", "count")res37: org.apache.spark.sql.DataFrame = [name: string, count: int]scala> val y = x.toDF("name", "count")y: org.apache.spark.sql.DataFrame = [name: string, count: int]scala> y.show+----+-----+|name|count|+----+-----+| a | 1|| b | 4|| c | 3|+----+-----+
DF--->RDD 直接调用rdd即可
scala> y.rdd.collectres46: Array[org.apache.spark.sql.Row] = Array([a,1], [b,4], [c,3])scala> df.rdd.collectres49: Array[org.apache.spark.sql.Row] = Array([null,Michael], [30,Andy], [19,Justin])
RDD转换为DataSet
SparkSQL能够自动将包含有case类的RDD转换成DataFrame,case类定义了table的结构,case类属性通过反射变成了表的列名。Case类可以包含诸如Seqs或者Array等复杂的结构。 DataSet是具有强类型的数据集合,需要提供对应的类型信息。
scala> case class People(age: BigInt, name: String)defined class Peoplescala> rdd.collectres77: Array[org.apache.spark.sql.Row] = Array([null,Michael], [30,Andy], [19,Justin])scala> val ds = rdd.as[People]ds: org.apache.spark.sql.Dataset[People] = [age: bigint, name: string]scala> ds.collectres31: Array[People] = Array(People(null,Michael), People(30,Andy), People(19,Justin))
scala> case class Person(name: String, age: Long)defined class Personscala> val caseclassDS = Seq(Person("kris", 20)).toDS()caseclassDS: org.apache.spark.sql.Dataset[Person] = [name: string, age: bigint]scala> caseclassDS.show+----+---+|name|age|+----+---+|kris| 20|+----+---+scala> caseclassDS.collectres51: Array[Person] = Array(Person(kris,20))
通过textFile方法创建rdd并转DS
scala> val textFileRDD = sc.textFile("/opt/module/spark/spark-local/examples/src/main/resources/people.txt")scala> textFileRDD.collectres78: Array[String] = Array(Michael, 29, Andy, 30, Justin, 19)scala> case class Person(name: String, age: Long)defined class Personscala> textFileRDD.map(x=>{val rddMap = x.split(","); Person(rddMap(0), rddMap(1).trim.toInt)}).toDSres80: org.apache.spark.sql.Dataset[Person] = [name: string, age: bigint]
DS ----> RDD 调用rdd方法即可
scala> val DS = Seq(Person("Andy", 32)).toDS() 用这种方式可创建一个DataSetDS: org.apache.spark.sql.Dataset[Person] = [name: string, age: bigint]scala> ds.collectres76: Array[People] = Array(People(null,Michael), People(30,Andy), People(19,Justin))scala> ds.rdd.collectres75: Array[People] = Array(People(null,Michael), People(30,Andy), People(19,Justin))
DF ---> DS
spark.read.json(“ path ”)即是DataFrame类型;
scala> df.collectres72: Array[org.apache.spark.sql.Row] = Array([null,Michael], [30,Andy], [19,Justin])scala> case class Student(id: BigInt, name: String)defined class Studentscala> df.as[Student]res69: org.apache.spark.sql.Dataset[Student] = [id: bigint, name: string]
DS-->DF
这种方法就是在给出每一列的类型后,使用as方法,转成Dataset,这在数据类型是DataFrame又需要针对各个字段处理时极为方便。在使用一些特殊的操作时,一定要加上 import spark.implicits._ 不然toDF、toDS无法使用。
scala> ds.collectres73: Array[People] = Array(People(null,Michael), People(30,Andy), People(19,Justin))scala> ds.toDFres74: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
三者的共性
(1)RDD、DataFrame、Dataset全都是spark平台下的分布式弹性数据集,为处理超大型数据提供便利;
(2)三者都有惰性机制,在进行创建、转换,如map方法时,不会立即执行,只有在遇到Action如foreach时,三者才会开始遍历运算;
(3)三者有许多共同的函数,如filter,排序等;
(4)在对DataFrame和Dataset进行操作许多操作都需要这个包:import spark.implicits._(在创建好SparkSession对象后尽量直接导入)
互相转化
RDD关心数据,DataFrame关心结构,DataSet关心类型;
① 将RDD转换为DataFrame,需要增加结构信息,所以调用toDF方法,需要增加结构;
② 将RDD转换为DataSet,需要增加结构和类型信息,所以需要转换为指定类型后,调用toDS方法;
③ 将DataFrame转换为DataSet时,因为已经包含结构信息,只有增加类型信息就可以,所以调用as[类型]
④因为DF中本身包含数据,所以转换为RDD时,直接调用rdd即可;
⑤因为DS中本身包含数据,所以转换为RDD时,直接调用rdd即可;
⑥因为DS本身包含数据结构信息,所以转换为DF时,直接调用toDF即可
三者的区别
联系:RDD、DataFrame、DataSet三者的联系是都是spark当中的一种数据类型,RDD是SparkCore当中的,DataFrame和DataSet都是SparkSql中的,它俩底层都基于RDD实现的;
区别:RDD 优点: ①编译时类型安全 ;②面向对象的编程风格 ; ③直接通过类名点的方式来操作数据; 缺点是通信or IO操作都需要序列化和反序列化的性能开销 ,比较耗费性能; GC的性能开销 ,频繁的创建和销毁对象, 势必会增加GC;
DataFrame引入了schema和off-heap堆外内存不会频繁GC,减少了内存的开销; 缺点是类型不安全;
DataSet结合了它俩的优点并且把缺点给屏蔽掉了;
1. RDD: ① RDD一般和spark mlib同时使用; ② RDD不支持sparksql操作
2. DataFrame:
1)与RDD和Dataset不同,DataFrame每一行的类型固定为Row,每一列的值没法直接访问,只有通过解析才能获取各个字段的值,
testDF.foreach{ line => val col1=line.getAs[String]("col1") val col2=line.getAs[String]("col2")}
2)DataFrame与Dataset一般不与spark mlib同时使用
3)DataFrame与Dataset均支持sparksql的操作,比如select,groupby之类,还能注册临时表/视窗,进行sql语句操作,如:dataDF.createOrReplaceTempView("tmp")
spark.sql("select ROW,DATE from tmp where DATE is not null order by DATE").show(100,false)
4)DataFrame与Dataset支持一些特别方便的保存方式,比如保存成csv,可以带上表头,这样每一列的字段名一目了然
val saveoptions = Map("header" -> "true", "delimiter" -> "\t", "path" -> "hdfs://hadoop102:9000/test") //保存datawDF.write.format("com.atguigu.spark.csv").mode(SaveMode.Overwrite).options(saveoptions).save() //读取val options = Map("header" -> "true", "delimiter" -> "\t", "path" -> "hdfs://hadoop102:9000/test")val datarDF= spark.read.options(options).format("com.atguigu.spark.csv").load()利用这样的保存方式,可以方便的获得字段名和列的对应,而且分隔符(delimiter)可以自由指定。
3. Dataset:
1)Dataset和DataFrame拥有完全相同的成员函数,区别只是每一行的数据类型不同。
2)DataFrame也可以叫Dataset[Row],每一行的类型是Row,不解析,每一行究竟有哪些字段,各个字段又是什么类型都无从得知,只能用上面提到的getAS方法或者模式匹配拿出特定字段。而Dataset中,每一行是什么类型是不一定的,在自定义了case class之后可以很自由的获得每一行的信息
case class Coltest(col1:String,col2:Int)extends Serializable //定义字段名和类型/** rdd ("a", 1) ("b", 1) ("a", 1)**/val test: Dataset[Coltest]=rdd.map{line=> Coltest(line._1,line._2) }.toDStest.map{ line=> println(line.col1) println(line.col2) }
可以看出,Dataset在需要访问列中的某个字段时是非常方便的,然而,如果要写一些适配性很强的函数时,如果使用Dataset,行的类型又不确定,可能是各种case class,无法实现适配,这时候用DataFrame即Dataset[Row]就能比较好的解决问题
IDEA创建SparkSQL程序
object TestSparkSql { def main(args: Array[String]): Unit = { //创建配置对象 val conf: SparkConf = new SparkConf().setAppName("SQL").setMaster("local[*]") //创建环境对象 val sparkSession: SparkSession = SparkSession.builder().config(conf).getOrCreate() //导入隐式转换 import sparkSession.implicits._ //执行操作 // TODO 创建DataFrame val df: DataFrame = sparkSession.read.json("input/input.json") df.createTempView("user") sparkSession.sql("select * from user") //df.show() // TODO 创建DataSet val ds: Dataset[Employ] = Seq(Employ("jing", 18)).toDS() //ds.show() // TODO 将DataFrame转换为DataSet val dfToDs: Dataset[Employ] = df.as[Employ] dfToDs.foreach(x => { println(x.name + "\t" + x.age) }) //TODO 将RDD转换为DataSet val rdd: RDD[(String, Int)] = sparkSession.sparkContext.makeRDD(Array(("aa", 19))) val employRdd: RDD[Employ] = rdd.map { case (name, age) => Employ(name, age) } //employRdd.toDS().show() // TODO 将RDD转换为DataFrame //rdd.toDF().show() val rddToDf: DataFrame = sparkSession.sparkContext.makeRDD(Array(("kris", 18))).toDF("username", "age") //TODO 将DataFrame转换为RDD[Row] df.rdd.foreach(row => { println(row.getLong(0)+ "," + row.getString(1)) }) // TODO 将DataSet转换为RDD[类型] val dsToRdd: RDD[Employ] = df.as[Employ].rdd sparkSession.stop() }}case class Employ(name: String, age: BigInt)
用户自定义函数
Spark SQL数据的加载与保存
通用加载/保存方法 load和save
通用的读写方法是 sparkSql只读这parquet file这种类型的文件; 否则要改变它的文件类型需要加.format
加上format("json");输出也是这个类型scala>val df = spark.read.load("/opt/module/spark/spark-local/examples/src/main/resources/users.parquet").showscala>df.select("name", " color").write.save("user.parquet") //保存数据java.lang.RuntimeException: file:/opt/module/spark/spark-local/examples/src/main/resources/people.json is not a Parquet file. 用load读取json数据scala> spark.read.format("json").load("/opt/module/spark/spark-local/examples/src/main/resources/people.json").showdf.write.format("json").save("/..")spark.read.format("json").mode("overwrite").save("/..json")
MySQL
Spark SQL可以通过JDBC从关系型数据库中读取数据的方式创建DataFrame,通过对DataFrame一系列的计算后,还可以将数据再写回关系型数据库中。
可在启动shell时指定相关的数据库驱动路径,或者将相关的数据库驱动放到spark的类路径下。
[kris@hadoop101 jars]$ cp /opt/software/mysql-connector-java-5.1.27/mysql-connector-java-5.1.27-bin.jar ./scala> val connectionProperties = new java.util.Properties()connectionProperties: java.util.Properties = {}scala> connectionProperties.put("user", "root")res0: Object = nullscala> connectionProperties.put("password", "123456")res1: Object = nullscala> val jdbcDF2 = spark.read.jdbc("jdbc:mysql://hadoop101:3306/rdd", "test", connectionProperties)jdbcDF2: org.apache.spark.sql.DataFrame = [id: int, name: string]scala> jdbcDF2.show+---+-------+| id| name|+---+-------+| 1| Google|| 2| Baidu|| 3| Ali|| 4|Tencent|| 5| Amazon|+---+-------+jdbcDF2.write.mode("append").jdbc("jdbc:mysql://hadoop101:3306/rdd", "test", connectionProperties)scala> val rdd = sc.makeRDD(Array((6, "FaceBook")))rdd: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[4] at makeRDD at:24scala> rdd.toDF("id", "name")res5: org.apache.spark.sql.DataFrame = [id: int, name: string]scala> val df = rdd.toDF("id", "name")df: org.apache.spark.sql.DataFrame = [id: int, name: string]scala> df.show+---+--------+| id| name|+---+--------+| 6|FaceBook|+---+--------+scala> df.write.mode("append").jdbc("jdbc:mysql://hadoop101:3306/rdd", "test", connectionProperties)scala> jdbcDF2.show+---+--------+| id| name|+---+--------+| 1| Google|| 2| Baidu|| 3| Ali|| 4| Tencent|| 5| Amazon|| 6|FaceBook|+---+--------+
Hive
Apache Hive是Hadoop上的SQL引擎,Spark SQL编译时可以包含Hive支持,也可以不包含。包含Hive支持的Spark SQL可以支持Hive表访问、UDF(用户自定义函数)以及Hive查询语言(HQL)等。spark-shell默认是Hive支持的;代码中是默认不支持的,需要手动指定(加一个参数即可)。
如果要使用内嵌的Hive,什么都不用做,直接用就可以了。
可以修改其数据仓库地址,参数为:--conf spark.sql.warehouse.dir=./wear
scala> spark.sql("create table emp(name String, age Int)").show19/04/11 01:10:17 WARN HiveMetaStore: Location: file:/opt/module/spark/spark-local/spark-warehouse/emp specified for non-external table:empscala> spark.sql("load data local inpath '/opt/module/spark/spark-local/examples/src/main/resources/people.txt' into table emp").showscala> spark.sql("show tables").show+--------+---------+-----------+|database|tableName|isTemporary|+--------+---------+-----------+| default| emp| false|+--------+---------+-----------+scala> spark.sql("select * from emp").show/opt/module/spark/spark-local/spark-warehouse/emp[kris@hadoop101 emp]$ ll-rwxr-xr-x. 1 kris kris 32 4月 11 01:10 people.txt
外部Hive应用
[kris@hadoop101 spark-local]$ rm -rf metastore_db/ spark-warehouse/[kris@hadoop101 conf]$ cp hive-site.xml /opt/module/spark/spark-local/conf/[kris@hadoop101 spark-local]$ bin/spark-shell scala> spark.sql("show tables").show+--------+--------------------+-----------+|database| tableName|isTemporary|+--------+--------------------+-----------+| default| bigtable| false|| default| business| false|| default| dept| false|| default| dept_partition| false|| default| dept_partition2| false|| default| dept_partitions| false|| default| emp| false|...[kris@hadoop101 spark-local]$ bin/spark-sql log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).log4j:WARN Please initialize the log4j system properly.spark-sql (default)> show tables;
代码中操作Hive
val sparkSession = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()支持hive
伴生对象相当于static,可直接类名.
给类起别名,相当于属性使用type ..spark.sql("select age, addName(name) from user").showscala> case class tbStock(ordernumber:String,locationid:String,dateid:String) extends Serializablescala> val tbStockRdd = spark.sparkContext.textFile("/opt/module/datas/sparkData/tbStock.txt")tbStockRdd: org.apache.spark.rdd.RDD[String] = /opt/module/datas/sparkData/tbStock.txt MapPartitionsRDD[30] at textFile at:23scala> val tbStockDS = tbStockRdd.map(_.split("\t")).map(x => tbStock(x(0), x(1), x(2))).toDStbStockDS: org.apache.spark.sql.Dataset[tbStock] = [ordernumber: string, locationid: string ... 1 more field]scala> tbStockDS.show+-----------+----------+----------+|ordernumber|locationid| dateid|+-----------+----------+----------+| lj111| jd| 2018-3-13|| lj112| jd| 2018-2-13|| lj113| jd| 2019-1-13|| lj114| jd| 2019-3-13|| lj115| jd| 2018-9-13|| lj116| jd|2018-11-13|| lj117| jd|2017-12-13|| lj118| jd| 2017-5-13|+-----------+----------+----------+scala> case class tbStockDetail(ordernumber:String, rownum:Int, itemid:String, number:Int, price:Double, amount:Double) extends Serializabledefined class tbStockDetailscala> val tbStockDetailRdd = spark.sparkContext.textFile("/opt/module/datas/sparkData/tbStockDetail.txt")tbStockDetailRdd: org.apache.spark.rdd.RDD[String] = /opt/module/datas/sparkData/tbStockDetail.txt MapPartitionsRDD[43] at textFile at :23scala> val tbStockDetailDS = tbStockDetailRdd.map(_.split("\t")).map(x => tbStockDetail(x(0), x(1).trim().toInt, x(2), x(3).trim().toInt, x(4).trim().toDouble,x(5).trim().toDouble)).toDStbStockDetailDS: org.apache.spark.sql.Dataset[tbStockDetail] = [ordernumber: string, rownum: int ... 4 more fields]scala> tbStockDetailDS.show+-----------+------+------+------+-----+------+|ordernumber|rownum|itemid|number|price|amount|+-----------+------+------+------+-----+------+| lj111| 12|item11| 10|100.0| 300.0|| lj112| 12|item12| 10|100.0| 200.0|| lj113| 12|item13| 10|100.0| 300.0|| lj114| 12|item14| 10|100.0| 100.0|| lj115| 12|item15| 10|100.0| 300.0|| lj116| 12|item16| 10|100.0| 700.0|| lj117| 12|item17| 10|100.0| 600.0|| lj118| 12|item18| 10|100.0| 500.0|+-----------+------+------+------+-----+------+
tbstock、tbstockdetail--amount 、tbdate计算所有订单中每年的销售单数、销售总额三个表连接后以count(distinct a.ordernumber)计销售单数,sum(b.amount)计销售总额select theyear, count(tbstock.ordernumber), sum(tbstockdetail.amount) from tbstock join tbstockdetail on tbstock.ordernumber = tbstockdetail.ordernumber join tbdate on tbdate.dateid = tbstock.dateid group by tbdate.theyear order by tbdate.theyear;统计每年最大金额订单的销售额:统计每个订单一共有多少销售额select a.dateid, a.ordernumber, sum(b.amount) sumAmountfrom tbstock a join tbstockdetail b on a.ordernumber = b.ordernumber group by a.dateid, a.ordernumberselect theyear, max(c.sumAmount) sumOfAmountfrom tbdate join (select a.dateid, a.ordernumber, sum(b.amount) sumAmountfrom tbstock a join tbstockdetail b on a.ordernumber = b.ordernumber group by a.dateid, a.ordernumber)c on tbdate.dateid = c.dateid group by tbdate.theyear order by tbdate.theyear desc计算所有订单中每年最畅销货品目标:统计每年最畅销货品(哪个货品销售额amount在当年最高,哪个就是最畅销货品)1求出每年每个货品的销售额每年 tbdate.theyear 货品tbstockdetail.itemid销售额amount在当年最高 select tbdate.theyear, tbstockdetail.itemid, sum(tbstockdetail.amount) sumAmountfrom tbdate join tbstock on tbdate.dateid = tbstock.dateidjoin tbstockdetail on tbstockdetail.ordernumber = tbstock.ordernumbergroup by tbdate.theyear, tbstockdetail.itemid2在第一步的基础上,统计每年 所有 货品中的最大金额select aa.theyear, max(sumAmount) maxAmount from (select tbdate.theyear, tbstockdetail.itemid, sum(tbstockdetail.amount) sumAmount from tbdate join tbstock on tbdate.dateid = tbstock.dateid join tbstockdetail on tbstockdetail.ordernumber = tbstock.ordernumber group by tbdate.theyear, tbstockdetail.itemid)aa group by aa.theyear用最大销售额和统计好的每个货品的销售额join,以及用年join,集合得到最畅销货品那一行信息每年每个货品的销售额 join 每年所有货品中的最大金额select distinct e.theyear, e.itemid, f.maxAmount from (select tbdate.theyear, tbstockdetail.itemid, sum(tbstockdetail.amount) sumAmount from tbdate join tbstock on tbdate.dateid = tbstock.dateid join tbstockdetail on tbstockdetail.ordernumber = tbstock.ordernumber group by tbdate.theyear, tbstockdetail.itemid)e join (select aa.theyear, max(sumAmount) maxAmount from (select tbdate.theyear, tbstockdetail.itemid, sum(tbstockdetail.amount) sumAmount from tbdate join tbstock on tbdate.dateid = tbstock.dateid join tbstockdetail on tbstockdetail.ordernumber = tbstock.ordernumber group by tbdate.theyear, tbstockdetail.itemid)aa group by aa.theyear)f on e.theyear = f.theyear and e.sumAmount = f.maxAmount order by e.theyear