DataX概述

DataX是一个在异构的数据库/文件系统之间高速交换数据的工具,实现了在任意的数据处理系统(RDBMS/Hdfs/Local filesystem)之间的数据交换。

目前成熟的数据导入导出工具比较多,但是一般都只能用于数据导入或者导出,并且只能支持一个或者几个特定类型的数据库。
这样带来的一个问题是,如果我们拥有很多不同类型的数据库/文件系统(Mysql/Oracle/Rac/Hive/Other…),
并且经常需要在它们之间导入导出数据,那么我们可能需要开发/维护/学习使用一批这样的工具(jdbcdump/dbloader /multithread/getmerge+sqlloader/mysqldumper…)。而且以后每增加一种库类型,我们需要的工具数目将线性增 长。(当我们需要将mysql的数据导入oracle的时候,有没有过想从jdbcdump和dbloader上各掰下来一半拼在一起到冲动?)这些工具 有些使用文件中转数据,有些使用管道,不同程度的为数据中转带来额外开销,效率差别很非常大。
很多工具也无法满足ETL任务中常见的需求,比如日期格式转化,特性字符的转化,编码转换。

另外,有些时候,我们希望在一个很短的时间窗口内,将一份数据从一个数据库同时导出到多个不同类型的数据库。

大数据分析之CentOS7安装部署DataX+Datax-Web集群环境(图1)

DataX环境要求

1、Linux(CentOS)

2、JDK1.8或以上

3、Python(推荐Python2.6.X)

4、Apache Maven 3.x (Compile DataX)

一、下载

1、DataX安装部署方案一:直接下载DataX工具包--DataX下载地址

1.1、DataX安装部署方案二:下载DataX源码,自己编译--DataX源码(不推荐)

2.Datax-Web下载

从Git上直接拉源代码进行编译,在项目的根目录下执行如下命令

git clone https://github.com/WeiYe-Jing/datax-web.git

2.1Datax-Web源码下载

直接下载官方源码包

https://github.com/WeiYe-Jing/datax-web/blob/master/doc/datax-web/datax-web-deploy.md

二、上传

[root@master opt]# rz

-rw-r--r--. 1 root root 829372407 6月  26 08:55 datax.tar.gz
-rw-r--r--. 1 root root 217566120 6月  26 10:59 datax-web-2.1.2.tar.gz

三、解压缩

[root@master opt]# tar xf datax.tar.gz -C /usr/local/big/
[root@master opt]# tar xf datax-web-2.1.2.tar.gz -C /usr/local/big/

[root@master opt]# cd /usr/local/big/datax/

四、DataX安装部署方案一,DataX工具包部署安装步骤

1、自检脚本: python {YOUR_DATAX_HOME}/bin/datax.py {YOUR_DATAX_HOME}/job/job.json

[root@master datax]# python ./bin/datax.py ./job/job.json

DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.

2021-06-26 08:59:38.950 [main] INFO  VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl
2021-06-26 08:59:38.960 [main] INFO  Engine - the machine info  =>

        osInfo: Oracle Corporation 1.8 25.291-b10
        jvmInfo:        Linux amd64 3.10.0-1160.el7.x86_64
        cpu num:        4

        totalPhysicalMemory:    -0.00G
        freePhysicalMemory:     -0.00G
        maxFileDescriptorCount: -1
        currentOpenFileDescriptorCount: -1

        GC Names        [PS MarkSweep, PS Scavenge]

        MEMORY_NAME                    | allocation_size                | init_size
        PS Eden Space                  | 256.00MB                       | 256.00MB
        Code Cache                     | 240.00MB                       | 2.44MB
        Compressed Class Space         | 1,024.00MB                     | 0.00MB
        PS Survivor Space              | 42.50MB                        | 42.50MB
        PS Old Gen                     | 683.00MB                       | 683.00MB
        Metaspace                      | -0.00MB                        | 0.00MB

2021-06-26 08:59:38.981 [main] INFO  Engine -
{
        "content":[
                {
                        "reader":{
                                "name":"streamreader",
                                "parameter":{
                                        "column":[
                                                {
                                                        "type":"string",
                                                        "value":"DataX"
                                                },
                                                {
                                                        "type":"long",
                                                        "value":19890604
                                                },
                                                {
                                                        "type":"date",
                                                        "value":"1989-06-04 00:00:00"
                                                },
                                                {
                                                        "type":"bool",
                                                        "value":true
                                                },
                                                {
                                                        "type":"bytes",
                                                        "value":"test"
                                                }
                                        ],
                                        "sliceRecordCount":100000
                                }
                        },
                        "writer":{
                                "name":"streamwriter",
                                "parameter":{
                                        "encoding":"UTF-8",
                                        "print":false
                                }
                        }
                }
        ],
        "setting":{
                "errorLimit":{
                        "percentage":0.02,
                        "record":0
                },
                "speed":{
                        "byte":10485760
                }
        }
}

2021-06-26 08:59:39.000 [main] WARN  Engine - prioriy set to 0, because NumberFormatException, the value is: null
2021-06-26 08:59:39.002 [main] INFO  PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0
2021-06-26 08:59:39.002 [main] INFO  JobContainer - DataX jobContainer starts job.
2021-06-26 08:59:39.005 [main] INFO  JobContainer - Set jobId = 0
2021-06-26 08:59:39.024 [job-0] INFO  JobContainer - jobContainer starts to do prepare ...
2021-06-26 08:59:39.025 [job-0] INFO  JobContainer - DataX Reader.Job [streamreader] do prepare work .
2021-06-26 08:59:39.026 [job-0] INFO  JobContainer - DataX Writer.Job [streamwriter] do prepare work .
2021-06-26 08:59:39.026 [job-0] INFO  JobContainer - jobContainer starts to do split ...
2021-06-26 08:59:39.027 [job-0] INFO  JobContainer - Job set Max-Byte-Speed to 10485760 bytes.
2021-06-26 08:59:39.028 [job-0] INFO  JobContainer - DataX Reader.Job [streamreader] splits to [1] tasks.
2021-06-26 08:59:39.029 [job-0] INFO  JobContainer - DataX Writer.Job [streamwriter] splits to [1] tasks.
2021-06-26 08:59:39.051 [job-0] INFO  JobContainer - jobContainer starts to do schedule ...
2021-06-26 08:59:39.057 [job-0] INFO  JobContainer - Scheduler starts [1] taskGroups.
2021-06-26 08:59:39.059 [job-0] INFO  JobContainer - Running by standalone Mode.
2021-06-26 08:59:39.069 [taskGroup-0] INFO  TaskGroupContainer - taskGroupId=[0] start [1] channels for [1] tasks.
2021-06-26 08:59:39.073 [taskGroup-0] INFO  Channel - Channel set byte_speed_limit to -1, No bps activated.
2021-06-26 08:59:39.073 [taskGroup-0] INFO  Channel - Channel set record_speed_limit to -1, No tps activated.
2021-06-26 08:59:39.085 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started
2021-06-26 08:59:39.386 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[303]ms
2021-06-26 08:59:39.387 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] completed it's tasks.
2021-06-26 08:59:49.082 [job-0] INFO  StandAloneJobContainerCommunicator - Total 100000 records, 2600000 bytes | Speed 253.91KB/s, 10000 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.042s |  All Task WaitReaderTime 0.067s | Percentage 100.00%
2021-06-26 08:59:49.082 [job-0] INFO  AbstractScheduler - Scheduler accomplished all tasks.
2021-06-26 08:59:49.084 [job-0] INFO  JobContainer - DataX Writer.Job [streamwriter] do post work.
2021-06-26 08:59:49.085 [job-0] INFO  JobContainer - DataX Reader.Job [streamreader] do post work.
2021-06-26 08:59:49.086 [job-0] INFO  JobContainer - DataX jobId [0] completed successfully.
2021-06-26 08:59:49.088 [job-0] INFO  HookInvoker - No hook invoked, because base dir not exists or is a file: /usr/local/big/datax/hook
2021-06-26 08:59:49.094 [job-0] INFO  JobContainer -
         [total cpu info] =>
                averageCpu                     | maxDeltaCpu                    | minDeltaCpu             
                -1.00%                         | -1.00%                         | -1.00%

         [total gc info] =>
                 NAME                 | totalGCCount       | maxDeltaGCCount    | minDeltaGCCount    | totalGCTime        | maxDeltaGCTime     | minDeltaGCTime
                 PS MarkSweep         | 0                  | 0                  | 0                  | 0.000s             | 0.000s             | 0.000s
                 PS Scavenge          | 0                  | 0                  | 0                  | 0.000s             | 0.000s             | 0.000s

2021-06-26 08:59:49.095 [job-0] INFO  JobContainer - PerfTrace not enable!
2021-06-26 08:59:49.097 [job-0] INFO  StandAloneJobContainerCommunicator - Total 100000 records, 2600000 bytes | Speed 253.91KB/s, 10000 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.042s |  All Task WaitReaderTime 0.067s | Percentage 100.00%
2021-06-26 08:59:49.099 [job-0] INFO  JobContainer -
任务启动时刻                    : 2021-06-26 08:59:39
任务结束时刻                    : 2021-06-26 08:59:49
任务总计耗时                    :                 10s
任务平均流量                    :          253.91KB/s
记录写入速度                    :          10000rec/s
读出记录总数                    :              100000
读写失败总数                    :                   0

五、DataX安装部署方案二,DataX源码编译安装部署教程

1、下载DataX源码,可根据上方链接,也可以git拉取

[root@master opt]# git clone git@github.com:alibaba/DataX.git

2、通过maven打包

[root@master opt]# cd  {DataX_source_code_home}
[root@master opt]# mvn -U clean package assembly:assembly -Dmaven.test.skip=true

2.1打包成功输出日志大致

[INFO] BUILD SUCCESS
[INFO] -----------------------------------------------------------------
[INFO] Total time: 08:59 min
[INFO] Finished at: 2021-06-26T08:59:49+08:00
[INFO] Final Memory: 133M/960M
[INFO] -----------------------------------------------------------------

2.2打包成功后的DataX包位于 {DataX_source_code_home}/target/datax/datax/

结构如下

[root@master opt]# cd  {DataX_source_code_home}
[root@master opt]# ls ./target/datax/datax/
bin conf job lib log log_perf plugin script tmp

可以通过命令查看配置模板: python datax.py -r {YOUR_READER} -w {YOUR_WRITER}

DataX (UNKNOWN_DATAX_VERSION), From Alibaba !
Copyright (C) 2010-2015, Alibaba Group. All Rights Reserved.
Please refer to the streamreader document:
    https://github.com/alibaba/DataX/blob/master/streamreader/doc/streamreader.md 

Please refer to the streamwriter document:
     https://github.com/alibaba/DataX/blob/master/streamwriter/doc/streamwriter.md 
 
Please save the following configuration as a json file and  use
     python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json 
to run the job.

{
    "job": {
        "content": [
            {
                "reader": {
                    "name": "streamreader", 
                    "parameter": {
                        "column": [], 
                        "sliceRecordCount": ""
                    }
                }, 
                "writer": {
                    "name": "streamwriter", 
                    "parameter": {
                        "encoding": "", 
                        "print": true
                    }
                }
            }
        ], 
        "setting": {
            "speed": {
                "channel": ""
            }
        }
    }
}

2.3根据模板配置json如下

#stream2stream.json
{
  "job": {
    "content": [
      {
        "reader": {
          "name": "streamreader",
          "parameter": {
            "sliceRecordCount": 10,
            "column": [
              {
                "type": "long",
                "value": "10"
              },
              {
                "type": "string",
                "value": "hello,你好,世界-DataX"
              }
            ]
          }
        },
        "writer": {
          "name": "streamwriter",
          "parameter": {
            "encoding": "UTF-8",
            "print": true
          }
        }
      }
    ],
    "setting": {
      "speed": {
        "channel": 5
       }
    }
  }
}

2.4启动DataX

[root@master opt]# cd {YOUR_DATAX_DIR_BIN}
[root@master bin]# python datax.py ./stream2stream.json

2.5同步结束,显示日志如下

...
2021-06-26 09:19:49.099 [job-0] INFO  JobContainer -
任务启动时刻                    : 2021-06-26 09:19:39
任务结束时刻                    : 2021-06-26 09:19:49
任务总计耗时                    :                 10s
任务平均流量                    :          253.91KB/s
记录写入速度                    :          10000rec/s
读出记录总数                    :              100000
读写失败总数                    :                   0

五、Datax可视化界面环境搭建之Datax-web部署

1、进入源码目录

[root@master big]# cd datax-web-2.1.2/
[root@master datax-web-2.1.2]# ll
总用量 28
drwxrwxrwx. 3 root root   104 6月  23 2020 bin
drwxr-xr-x. 2 root root    77 6月  26 11:00 packages
-rwxrwxrwx. 1 root root 13455 6月  23 2020 README.md
-rwxrwxrwx. 1 root root  9177 6月  23 2020 userGuid.md

2、执行一键安装脚本
进入解压后的目录,找到bin目录下面的install.sh文件,如果选择交互式的安装,则直接执行

## 前端交互安装执行如下
[root@master datax-web-2.1.2]# ./bin/install.sh

## 静默安装执行如下
[root@master datax-web-2.1.2]# ./bin/install.sh --force

## 前端交互安装详情
2021-06-26 11:00:56.126 [INFO] (2602) Creating directory: [/usr/local/big/datax-web-2.1.2/bin/../modules].
2021-06-26 11:00:56.136 [INFO] (2602)  ####### Start To Uncompress Packages ######
2021-06-26 11:00:56.142 [INFO] (2602) Uncompressing....
Do you want to decompress this package: [datax-admin_2.1.2_1.tar.gz]? (Y/N)y
2021-06-26 11:01:14.345 [INFO] (2602)  Uncompress package: [datax-admin_2.1.2_1.tar.gz] to modules directory
Do you want to decompress this package: [datax-executor_2.1.2_1.tar.gz]? (Y/N)y
2021-06-26 11:01:25.314 [INFO] (2602)  Uncompress package: [datax-executor_2.1.2_1.tar.gz] to modules directory
2021-06-26 11:01:25.652 [INFO] (2602)  ####### Finish To Umcompress Packages ######
Scan modules directory: [/usr/local/big/datax-web-2.1.2/bin/../modules] to find server under dataxweb
2021-06-26 11:01:25.661 [INFO] (2602)  ####### Start To Install Modules ######
2021-06-26 11:01:25.665 [INFO] (2602) Module servers could be installed:
 [datax-admin]  [datax-executor]
Do you want to confiugre and install [datax-admin]? (Y/N)y
2021-06-26 11:01:28.265 [INFO] (2602)  Install module server: [datax-admin]
Start to make directory
2021-06-26 11:01:28.323 [INFO] (2884)  Start to build directory
2021-06-26 11:01:28.328 [INFO] (2884) Creating directory: [/usr/local/big/datax-web-2.1.2/modules/datax-admin/bin/../logs].
2021-06-26 11:01:28.435 [INFO] (2884) Directory or file: [/usr/local/big/datax-web-2.1.2/modules/datax-admin/bin/../conf] has bee
2021-06-26 11:01:28.441 [INFO] (2884) Creating directory: [/usr/local/big/datax-web-2.1.2/modules/datax-admin/bin/../data].
end to make directory
Start to initalize database
Do you want to confiugre and install [datax-executor]? (Y/N)y
2021-06-26 11:01:30.552 [INFO] (2602)  Install module server: [datax-executor]
2021-06-26 11:01:30.619 [INFO] (2944)  Start to build directory
2021-06-26 11:01:30.625 [INFO] (2944) Creating directory: [/usr/local/big/datax-web-2.1.2/modules/datax-executor/bin/../logs].
2021-06-26 11:01:30.718 [INFO] (2944) Directory or file: [/usr/local/big/datax-web-2.1.2/modules/datax-executor/bin/../conf] has
2021-06-26 11:01:30.723 [INFO] (2944) Creating directory: [/usr/local/big/datax-web-2.1.2/modules/datax-executor/bin/../data].
2021-06-26 11:01:30.816 [INFO] (2944) Creating directory: [/usr/local/big/datax-web-2.1.2/modules/datax-executor/bin/../json].
2021-06-26 11:01:30.889 [INFO] (2602)  ####### Finish To Install Modules ######

3、配置MySQL数据库连接

[root@master datax-web-2.1.2]# vim ./modules/datax-admin/conf/bootstrap.properties

#Database
DB_HOST=47.1*4.*5.***
DB_PORT=3306
DB_USERNAME=data-web
DB_PASSWORD=JiLZXbsZbCbs
DB_DATABASE=data-web
大数据分析之CentOS7安装部署DataX+Datax-Web环境(图2)

六、DataX插件体系

经过几年积累,DataX目前已经有了比较全面的插件体系,主流的RDBMS数据库、NOSQL、大数据计算系统都已经接入。DataX目前支持数据如

大数据分析之CentOS7安装部署DataX+Datax-Web环境(图2)