Hadoop
May 8
Spark SQL 你需要知道的十件事-内容引用国外牛人的ppt
Spark SQL 10 Things You Need to Know
本文的ppt从十个方面介绍 Spark SQL 的使用及注意事项,主要包括:
1、Spark SQL 使用场景
2数据加载:云和本地, RDDs 和 DataFrames
3SQL 和 DataFrame API 比较,它们之间的区别
4模式: 隐式和显示模式解释,数据类型
5、数据加载以及结果保存等
6SQL 使用场景,什么时候不适合使用 SQL
7使用 SQL 进行 ETL
8操作 JSON 数据
9、从外部数据库读取和写入
10、在真实环境下测试你的 SQL

链接: https://pan.baidu.com/s/15vI6c7une1is-T1CTFOuTQ 密码: 7fcu
Tags: ,
May 6
我们通常应该都是使用SecureCRT或者Xshell来连接linux服务器的,再spark-shell的时候不小心输错了或者想修改但是发现不能退格,
而且提示也不是复写,而是追加,这样根本就没法写程序.相当的不方便。
于是乎。
如果使用SecureCrt
解决办法:
1.打开【选项】-->【会话选项】
2、终端-仿真    在终端中选择Linux
3.映射键 中勾选【其他映射】下的两个两个选项【Backspace发送delete】和【Delete发送backspace】
4.至此已经成功了,但是如果远程长时间未操作 就会中断连接,下次再操作时需要等待,其实也很影响使用,在这里也附上解决办法(可选)
在终端选项中,反空闲勾选【发送字串】\n 每300秒

how-to-solve-spack-shell-can-not-delete-char
Sep 13
mahout hadoop 对应关系
mahout 与hadoop 版本对应
mahout 与hadoop 安装有版本对应问题吗?比如Hadoop 1.x 能装哪些版本的Mahout,Hadoop 2.x 能用哪些版本的Mahout.

官网0.9版本可以下载到的Mahout版本都是基于Hadoop 1.x版本编译的,如果直接运行在Hadoop 2.x版本上就会报错抛出异常。
0.9 版本基于(hadoop-core-1.2.1.jar)
解决方法就是下载最新的源码,并且编译成Hadoop 2.x兼容模式,下面是具体编译方法:
1. 使用git命令克隆Mahout最新的源码到本地,
    git clone https://github.com/apache/mahout.git
2. Mahout源代码下载完成后,直接使用mvn命令编译源代码,注意要加上hadoop2.version=2.4.1参数让编译后的Mahout可以兼容Hadoop 2.4.1版本。这里版本可以填写任何的2.x版本。
    mvn -Dhadoop2.version=2.4.1 -DskipTests clean install

官网building介绍
https://mahout.apache.org/developers/buildingmahout.html

关于Mahout的安装配置,有两种方式:
其一,下载源码(直接下载源码或者通过svn下载源码都可以),然后使用Maven进行编译;
其二,下载完整包进行解压缩。里面有一个lib/hadoop文件夹,查看对应的版本。这里我使用的是mahout-distribution-0.9.tar.gz完整包进行解压缩安装。

mahout-and-hadoop-map
----end

Tags: ,
Jul 9
一个hadoop wordcount 运行的日志

hadoop@itlife365 ~]$hadoop jar wcountljs.jar com.itlife365.bigdata.hadoop.mr.wordcount.WordCountRunner /user/hadoop/input/
17/06/19 21:31:12 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
17/06/19 21:31:13 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
17/06/19 21:31:17 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
17/06/19 21:31:23 INFO input.FileInputFormat: Total input paths to process : 1
17/06/19 21:31:24 INFO mapreduce.JobSubmitter: number of splits:1
17/06/19 21:31:33 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1336611296_0001
17/06/19 21:31:39 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
17/06/19 21:31:39 INFO mapreduce.Job: Running job: job_local1336611296_0001
17/06/19 21:31:39 INFO mapred.LocalJobRunner: OutputCommitter set in config null
17/06/19 21:31:40 INFO mapreduce.Job: Job job_local1336611296_0001 running in uber mode : false
17/06/19 21:31:40 INFO mapreduce.Job:  map 0% reduce 0%
17/06/19 21:31:41 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
17/06/19 21:31:45 INFO mapred.LocalJobRunner: Waiting for map tasks
17/06/19 21:31:45 INFO mapred.LocalJobRunner: Starting task: attempt_local1336611296_0001_m_000000_0
17/06/19 21:31:49 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
17/06/19 21:31:49 INFO mapred.MapTask: Processing split: hdfs://itlife365:9000/user/hadoop/input/wordcountdemo.txt:0+155
17/06/19 21:32:21 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
17/06/19 21:32:21 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
17/06/19 21:32:21 INFO mapred.MapTask: soft limit at 83886080
17/06/19 21:32:21 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
17/06/19 21:32:21 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
17/06/19 21:32:21 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
17/06/19 21:32:32 INFO mapred.LocalJobRunner:
17/06/19 21:32:32 INFO mapred.MapTask: Starting flush of map output
17/06/19 21:32:32 INFO mapred.MapTask: Spilling map output
17/06/19 21:32:32 INFO mapred.MapTask: bufstart = 0; bufend = 402; bufvoid = 104857600
17/06/19 21:32:32 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214276(104857104); length = 121/6553600
17/06/19 21:32:34 INFO mapred.MapTask: Finished spill 0
17/06/19 21:32:35 INFO mapred.Task: Task:attempt_local1336611296_0001_m_000000_0 is done. And is in the process of committing
17/06/19 21:32:37 INFO mapred.LocalJobRunner: map
17/06/19 21:32:37 INFO mapred.LocalJobRunner: map
17/06/19 21:32:37 INFO mapred.Task: Task 'attempt_local1336611296_0001_m_000000_0' done.
17/06/19 21:32:37 INFO mapred.LocalJobRunner: Finishing task: attempt_local1336611296_0001_m_000000_0
17/06/19 21:32:37 INFO mapred.LocalJobRunner: map task executor complete.
17/06/19 21:32:37 INFO mapreduce.Job:  map 100% reduce 0%
17/06/19 21:32:38 INFO mapred.LocalJobRunner: Waiting for reduce tasks
17/06/19 21:32:38 INFO mapred.LocalJobRunner: Starting task: attempt_local1336611296_0001_r_000000_0
17/06/19 21:32:39 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
17/06/19 21:32:39 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@39882
17/06/19 21:32:42 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=363285696, maxSingleShuffleLimit=90821424, mergeThreshold=239768576, ioSortFactor=10, memToMemMergeOutputsThreshold=10
17/06/19 21:32:43 INFO reduce.EventFetcher: attempt_local1336611296_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
17/06/19 21:32:47 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local1336611296_0001_m_000000_0 decomp: 466 len: 470 to MEMORY
17/06/19 21:32:48 INFO reduce.InMemoryMapOutput: Read 466 bytes from map-output for attempt_local1336611296_0001_m_000000_0
17/06/19 21:32:48 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 466, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->466
17/06/19 21:32:48 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
17/06/19 21:32:49 INFO mapred.LocalJobRunner: 1 / 1 copied.
17/06/19 21:32:49 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
17/06/19 21:32:50 INFO mapred.Merger: Merging 1 sorted segments
17/06/19 21:32:50 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 463 bytes
17/06/19 21:32:50 INFO reduce.MergeManagerImpl: Merged 1 segments, 466 bytes to disk to satisfy reduce memory limit
17/06/19 21:32:50 INFO reduce.MergeManagerImpl: Merging 1 files, 470 bytes from disk
17/06/19 21:32:50 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
17/06/19 21:32:50 INFO mapred.Merger: Merging 1 sorted segments
17/06/19 21:32:50 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 463 bytes
17/06/19 21:32:50 INFO mapred.LocalJobRunner: 1 / 1 copied.
17/06/19 21:32:53 INFO mapred.LocalJobRunner: reduce > reduce
17/06/19 21:32:53 INFO mapreduce.Job:  map 100% reduce 67%
17/06/19 21:32:56 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
17/06/19 21:32:56 INFO mapred.LocalJobRunner: reduce > reduce
17/06/19 21:33:02 INFO mapred.LocalJobRunner: reduce > reduce
17/06/19 21:33:02 INFO mapreduce.Job:  map 100% reduce 100%
17/06/19 21:33:06 INFO mapred.Task: Task:attempt_local1336611296_0001_r_000000_0 is done. And is in the process of committing
17/06/19 21:33:07 INFO mapred.LocalJobRunner: reduce > reduce
17/06/19 21:33:07 INFO mapred.Task: Task attempt_local1336611296_0001_r_000000_0 is allowed to commit now
17/06/19 21:33:08 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1336611296_0001_r_000000_0' to hdfs://itlife365:9000/user/hadoop/output/_temporary/0/task_local1336611296_0001_r_000000
17/06/19 21:33:08 INFO mapred.LocalJobRunner: reduce > reduce
17/06/19 21:33:08 INFO mapred.Task: Task 'attempt_local1336611296_0001_r_000000_0' done.
17/06/19 21:33:08 INFO mapred.LocalJobRunner: Finishing task: attempt_local1336611296_0001_r_000000_0
17/06/19 21:33:08 INFO mapred.LocalJobRunner: reduce task executor complete.
17/06/19 21:33:11 INFO mapreduce.Job: Job job_local1336611296_0001 completed successfully
17/06/19 21:33:12 INFO mapreduce.Job: Counters: 38
        File System Counters
                FILE: Number of bytes read=17844
                FILE: Number of bytes written=527618
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=310
                HDFS: Number of bytes written=186
                HDFS: Number of read operations=17
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=4
        Map-Reduce Framework
                Map input records=4
                Map output records=31
                Map output bytes=402
                Map output materialized bytes=470
                Input split bytes=122
                Combine input records=0
                Combine output records=0
                Reduce input groups=24
                Reduce shuffle bytes=470
                Reduce input records=31
                Reduce output records=24
                Spilled Records=62
                Shuffled Maps =1
                Failed Shuffles=0
                Merged Map outputs=1
                GC time elapsed (ms)=1577
                CPU time spent (ms)=0
                Physical memory (bytes) snapshot=0
                Virtual memory (bytes) snapshot=0
                Total committed heap usage (bytes)=242360320
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=155
        File Output Format Counters
                Bytes Written=186
[hadoop@itlife365 ~]$

[hadoop@itlife365 ~]$ hadoop fs -cat /user/hadoop/input/*.txt
hello everyoen:
 this is a test of hadoop wordcount simple,Please feel easy!
 and is a demo will more beautiful in future.
 and is a good begin .. thinks
[hadoop@itlife365 ~]$

[hadoop@itlife365 ~]$ hadoop fs -cat /user/hadoop/output/part-r-00000
        3
..      1
a       3
and     2
beautiful       1
begin   1
demo    1
easy!   1
everyoen:       1
feel    1
future. 1
good    1
hadoop  1
hello   1
in      1
is      3
more    1
of      1
simple,Please   1
test    1
thinks  1
this    1
will    1
wordcount       1
[hadoop@itlife365 ~]$
Tags: ,
Jul 9
hadoop mapreducer 中如果输出文件存在会报org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs处理
hadoop@itlife365 ~]$ hadoop jar wcountljs.jar com.itlife365.bigdata.hadoop.mr.wordcount.WordCountRunner
17/06/19 20:30:56 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
17/06/19 20:30:57 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://itlife365:9000/user/hadoop/output already exists
        at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:146)
        at org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:562)
        at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:432)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1296)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1293)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:1293)
        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1314)
        at com.itlife365.bigdata.hadoop.mr.wordcount.WordCountRunner.main(WordCountRunner.java:65)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
方法1:进入hdfs 的目录删除这个目录 要自己手动删除输出文件夹
方法2:代码中添加判断,如果有存在则删除
其实可以在代码上利用hdfs的文件操作,解决这个问题。思想就是在代码运行之前,也就是提交作业之前,判断output文件夹是否存在,如果存在则删除。关键代码如下:
//判断output输出文件夹是否存在,如果存在则删除
  //Path path = new Path(otherArgs[1]);// 取第1个表示输出目录参数(第0个参数是输入目录)
  Path path = new Path("hdfs://itlife365:9000/user/hadoop/output");
  FileSystem fileSystem = path.getFileSystem(conf);// 根据path找到这个文件
  if(fileSystem.exists(path)){
   fileSystem.delete(path, true);// true的意思是,就算output有东西,也一带删除 
  }

        FileInputFormat.addInputPath(job, new Path(Args[0])); 
        FileOutputFormat.setOutputPath(job, new Path(Args[1])); 
        System.exit(job.waitForCompletion(true) ? 0 : 1); 

--重新打包问题解决 hadoop FileAlreadyExistsException Output
分页: 1/3 第一页 1 2 3 下页 最后页 [ 显示模式: 摘要 | 列表 ]