<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>Spark-Notes on boboboker~</title>
        <link>https://blog.mxtao.top/tags/spark-notes/</link>
        <description>Recent content in Spark-Notes on boboboker~</description>
        <generator>Hugo -- gohugo.io</generator>
        <language>zh-cn</language>
        <copyright>all rights reserved.</copyright>
        <lastBuildDate>Thu, 29 Apr 2021 20:30:00 +0800</lastBuildDate><atom:link href="https://blog.mxtao.top/tags/spark-notes/index.xml" rel="self" type="application/rss+xml" /><item>
        <title>Spark相关内容随记</title>
        <link>https://blog.mxtao.top/posts/platform/spark/spark-notes/</link>
        <pubDate>Sat, 04 Jul 2020 00:00:00 +0800</pubDate>
        
        <guid>https://blog.mxtao.top/posts/platform/spark/spark-notes/</guid>
        <description>&lt;h1 id=&#34;spark-相关内容随记&#34;&gt;Spark 相关内容随记
&lt;/h1&gt;&lt;p&gt;随手记录Spark相关的问题、思考等&lt;/p&gt;
&lt;p&gt;&lt;a class=&#34;link&#34; href=&#34;https://cloud.tencent.com/developer/article/1038770&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Spark SQL在100TB上的自适应执行实践&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a class=&#34;link&#34; href=&#34;http://spark.apache.org/docs/latest/sql-ref-functions-udf-aggregate.html&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;User Defined Aggregate Functions (UDAFs)&lt;/a&gt;&lt;/p&gt;
&lt;h2 id=&#34;spark-sql---datasource&#34;&gt;Spark SQL - DataSource
&lt;/h2&gt;&lt;p&gt;通过实现Spark定义的DataSource接口为Spark新增自定义数据源&lt;/p&gt;
&lt;p&gt;数据源API目前分V1和V2版本，&lt;del&gt;到目前为止&lt;a class=&#34;link&#34; href=&#34;https://spark.apache.org/releases/spark-release-3-0-0.html&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;&lt;em&gt;Spark 3.0.0&lt;/em&gt;&lt;/a&gt;似乎还没有完成进化&lt;/del&gt;，已在3.0.0版本完成V2版重构&lt;/p&gt;
&lt;p&gt;&lt;a class=&#34;link&#34; href=&#34;https://issues.apache.org/jira/browse/SPARK-25390&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Data source V2 API refactoring&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;预计将在3.2.0版本将V2版API稳定下来&lt;/p&gt;
&lt;p&gt;&lt;a class=&#34;link&#34; href=&#34;https://issues.apache.org/jira/browse/SPARK-25186&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Stabilize Data Source V2 API&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a class=&#34;link&#34; href=&#34;https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-data-source-api-v2.html&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-data-source-api-v2.html&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a class=&#34;link&#34; href=&#34;https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-DataSourceV2.html&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-DataSourceV2.html&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a class=&#34;link&#34; href=&#34;https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-DataSource.html&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-DataSource.html&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a class=&#34;link&#34; href=&#34;http://blog.madhukaraphatak.com/categories/datasource-v2-series/&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Category: datasource-v2-series&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a class=&#34;link&#34; href=&#34;http://blog.madhukaraphatak.com/categories/datasource-v2-spark-three/&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Category: datasource-v2-spark-three&lt;/a&gt;&lt;/p&gt;
&lt;h2 id=&#34;spark-sql---csv&#34;&gt;Spark SQL - CSV
&lt;/h2&gt;&lt;p&gt;CSV类型文件中，出于各种原因可能导致Spark SQL解析数据会出错。&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;以下问题举例在Hadoop2.6.0-Spark2.1.1-Scala2.10.6-JDK1.7生产环境出现，较新版本中的Spark具体行为暂不可知。该Spark版本已被魔改且无代码，离线环境中只有Spark2.4.4-Scala2.11，尝试看下源代码发现该部分已被重构，抛异常的类都没有了&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;例如，有些字段里面包含了特殊字符，导致Spark SQL解析行数据时出现了字段截断错误，从而导致列错位，有些转换函数直接执行失败，进而导致整个任务失败。&lt;/p&gt;
&lt;p&gt;问题解决方式是强制指定&lt;code&gt;mode=DROPMALFORMED&lt;/code&gt;，直接将问题数据丢弃，这是Spark SQL直接支持的配置，看文档的时候可能看到了，但是无视掉了。。。&lt;/p&gt;
&lt;p&gt;Spark文档中对于CSV支持的配置有详细介绍。&lt;/p&gt;
&lt;p&gt;最新版本的参考文档：&lt;a class=&#34;link&#34; href=&#34;https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrameReader.html#csv%28paths:String*%29:org.apache.spark.sql.DataFrame&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;DataFrameReader#csv&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Spark 2.4.6参考文档：&lt;a class=&#34;link&#34; href=&#34;https://spark.apache.org/docs/2.4.6/api/scala/index.html#org.apache.spark.sql.DataFrameReader@csv%28paths:String*%29:org.apache.spark.sql.DataFrame&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;DataFrameReader#csv&lt;/a&gt;&lt;/p&gt;
&lt;h2 id=&#34;spark-cli&#34;&gt;Spark CLI
&lt;/h2&gt;&lt;p&gt;要脱离灵活性太差的自研任务调度服务、逐渐开始习惯用原生CLI进行进行任务的提交&lt;/p&gt;
&lt;p&gt;&lt;code&gt;spark-submit --name JOB-NAME --master yarn --deploy-mode cluster --conf spark.yarn.submit.waitAppCompletion=false --class com.mxtao.App --jars /xxx/xxx.jar,/xxx/xxxx.jar --queue xx --pincipal xxx@DOMAN --keytab xxx.keytab main-class-in-this-jar.jar args-for-main&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;a class=&#34;link&#34; href=&#34;https://spark.apache.org/docs/latest/submitting-applications.html&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Submitting Applications&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a class=&#34;link&#34; href=&#34;https://spark.apache.org/docs/latest/running-on-yarn.html#spark-properties&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Running Spark on YARN - Spark Properties&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a class=&#34;link&#34; href=&#34;https://spark.apache.org/docs/2.4.6/submitting-applications.html&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Submitting Applications&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a class=&#34;link&#34; href=&#34;https://spark.apache.org/docs/2.4.6/running-on-yarn.html#spark-properties&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Running Spark on YARN - Spark Properties&lt;/a&gt;&lt;/p&gt;
</description>
        </item>
        
    </channel>
</rss>
