ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • Hive & Hadoop 연결하기
    Hive 2023. 12. 17. 08:15
    728x90
    반응형

    - 목차

     

    소개.

    이번 글에서는 Hive 와 Hadoop 을 연결하여 어떠한 방식으로 두 Application 사이의 소통이 이루어지는지 알아보려고 합니다.

    Docker Container 환경에서 진행할 예정입니다.

     

    Hive & Hadoop 연결하기.

     

    1. hadoop config 파일을 생성합니다.

    하둡은 core-site.xml, hdfs-site.xml 등의 xml 설정을 기본으로 합니다.

    아래 명령어를 통해서 hadoop_config 정보를 생성합니다.

    cat <<EOF> /tmp/hadoop_config
    CORE-SITE.XML_fs.default.name=hdfs://namenode
    CORE-SITE.XML_fs.defaultFS=hdfs://namenode
    HDFS-SITE.XML_dfs.namenode.rpc-address=namenode:8020
    HDFS-SITE.XML_dfs.replication=1
    MAPRED-SITE.XML_mapreduce.framework.name=yarn
    MAPRED-SITE.XML_yarn.app.mapreduce.am.env=HADOOP_MAPRED_HOME=$HADOOP_HOME
    MAPRED-SITE.XML_mapreduce.map.env=HADOOP_MAPRED_HOME=$HADOOP_HOME
    MAPRED-SITE.XML_mapreduce.reduce.env=HADOOP_MAPRED_HOME=$HADOOP_HOME
    YARN-SITE.XML_yarn.resourcemanager.hostname=resourcemanager
    YARN-SITE.XML_yarn.nodemanager.pmem-check-enabled=false
    YARN-SITE.XML_yarn.nodemanager.delete.debug-delay-sec=600
    YARN-SITE.XML_yarn.nodemanager.vmem-check-enabled=false
    YARN-SITE.XML_yarn.nodemanager.aux-services=mapreduce_shuffle
    CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.maximum-applications=10000
    CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.maximum-am-resource-percent=0.1
    CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator
    CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.queues=default
    CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.default.capacity=100
    CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.default.user-limit-factor=1
    CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.default.maximum-capacity=100
    CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.default.state=RUNNING
    CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.default.acl_submit_applications=*
    CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.default.acl_administer_queue=*
    CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.node-locality-delay=40
    CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.queue-mappings=
    CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.queue-mappings-override.enable=false
    EOF

     

    2. hive-site.xml 파일을 생성합니다.

    cat <<EOF> /tmp/hive-site.xml
    <?xml version="1.0" encoding="UTF-8"?>
    <configuration>
        <property>
            <name>hive.server2.enable.doAs</name>
            <value>false</value>
        </property>
        <property>
            <name>hive.tez.exec.inplace.progress</name>
            <value>false</value>
        </property>
        <property>
            <name>hive.exec.scratchdir</name>
            <value>/opt/hive/scratch_dir</value>
        </property>
        <property>
            <name>hive.user.install.directory</name>
            <value>/opt/hive/install_dir</value>
        </property>
        <property>
            <name>tez.runtime.optimize.local.fetch</name>
            <value>true</value>
        </property>
        <property>
            <name>hive.exec.submit.local.task.via.child</name>
            <value>false</value>
        </property>
        <property>
            <name>mapreduce.framework.name</name>
            <value>local</value>
        </property>
        <property>
            <name>tez.local.mode</name>
            <value>true</value>
        </property>
        <property>
            <name>hive.execution.engine</name>
            <value>tez</value>
        </property>
        <property>
            <name>hive.metastore.warehouse.dir</name>
            <value>/opt/hive/data/warehouse</value>
        </property>
        <property>
            <name>metastore.metastore.event.db.notification.api.auth</name>
            <value>false</value>
        </property>
    </configuration>
    EOF

     

    3. hadoop docker-compose.yaml 생성합니다.

    아래 명령어를 통해서 docker-compose.yaml 을 생성합니다.

    yaml 파일의 위치는 /tmp 디렉토리로 설정하였습니다.

    NameNode, DataNode, HiveServer, ResourceManager, NodeManager 를 각각 1개씩 생성합니다.

     

    cat <<EOF> /tmp/hive-docker-compose.yaml
    version: "2"
    services:
       namenode:
          platform: linux/amd64
          container_name: namenode
          image: apache/hadoop:3
          hostname: namenode
          command: ["hdfs", "namenode"]
          ports:
            - 9870:9870
            - 8020:8020
          env_file:
            - /tmp/hadoop_config
          environment:
              ENSURE_NAMENODE_DIR: "/tmp/hadoop-root/dfs/name"
          networks:
            - hadoop_network
       datanode:
          platform: linux/amd64   
          container_name: datanode
          depends_on:
            - namenode
          links:
            - namenode:namenode        
          image: apache/hadoop:3
          command: ["hdfs", "datanode"]
          env_file:
            - /tmp/hadoop_config 
          networks:
            - hadoop_network
       hive:
          platform: linux/amd64   
          container_name: hive
          depends_on:
            - namenode
            - datanode        
            - nodemanager        
            - resourcemanager                        
          image: apache/hive:3.1.3
          ports:
            - 10000:10000
            - 10002:10002
          volumes:
            - type: bind
              source: '/tmp/hive-site.xml'
              target: '/opt/hive/conf/hive-site.xml'
            
          environment:
              SERVICE_NAME: hiveserver2
          networks:
            - hadoop_network        
       resourcemanager:
          platform: linux/amd64   
          image: apache/hadoop:3
          hostname: resourcemanager
          command: ["yarn", "resourcemanager"]
          ports:
             - 8088:8088
          env_file:
            - /tmp/hadoop_config
          volumes:
            - ./test.sh:/opt/test.sh
          networks:
            - hadoop_network
       nodemanager:
          platform: linux/amd64   
          image: apache/hadoop:3
          command: ["yarn", "nodemanager"]
          env_file:
            - /tmp/hadoop_config
          networks:
            - hadoop_network
    
    networks:
      hadoop_network:
        name: hadoop_network
    EOF

     

    4. docker-compose 를 실행합니다.

    docker-compose -f /tmp/hive-docker-compose.yaml --project-name=hive up -d

     

     

     

    Hive Table 생성하기.

     

    1. Hadoop Namespace 생성.

    Hadoop NameNode 에 commerce 라는 이름의 Namespace 를 생성합니다.

    docker exec -it namenode hdfs dfs -mkdir /commerce/

     

    Root Namespace 하위에 존재하는 Namespace 들을 살펴보면,

    commerce 라는 Namespace 가 생성되어 있음을 확인할 수 있습니다.

    docker exec -it namenode hdfs dfs -ls -h /
    Found 1 items
    drwxr-xr-x   - hadoop supergroup          0 2023-12-17 22:49 /commerce

     

     

    생성된 commerce 네임스페이스의 ACL 에 hive 라는 사용자를 추가합니다.

    이렇게함으로써 Hive 와 같은 외부 클라이언트에서 데이터 생성이 가능해집니다.

    hdfs dfs -setfacl -m user:hive:rwx /commerce
    hdfs dfs -getfacl /commerce
    # file: /commerce
    # owner: hadoop
    # group: supergroup
    user::rwx
    user:hive:rwx
    group::r-x
    mask::rwx
    other::r-x

     

    2. Hive Table 생성하기.

     

    1. 먼저 Hive 의 beeline CLI 로 진입해야합니다.

    docker exec -it hive beeline -u 'jdbc:hive2://hive:10000/default'
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/opt/hive/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/opt/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
    Connecting to jdbc:hive2://hive:10000/default
    Connected to: Apache Hive (version 3.1.3)
    Driver: Hive JDBC (version 3.1.3)
    Transaction isolation: TRANSACTION_REPEATABLE_READ
    Beeline version 3.1.3 by Apache Hive

     

    2. 먼저 commerce database 를 생성합니다.

    create database commerce;
    INFO  : Compiling command(queryId=hive_20231218000456_27454250-92ca-4bf4-9e37-dd80d319c289): create database commerce
    INFO  : Concurrency mode is disabled, not creating a lock manager
    INFO  : Semantic Analysis Completed (retrial = false)
    INFO  : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
    INFO  : Completed compiling command(queryId=hive_20231218000456_27454250-92ca-4bf4-9e37-dd80d319c289); Time taken: 3.21 seconds
    INFO  : Concurrency mode is disabled, not creating a lock manager
    INFO  : Executing command(queryId=hive_20231218000456_27454250-92ca-4bf4-9e37-dd80d319c289): create database commerce
    INFO  : Starting task [Stage-0:DDL] in serial mode
    INFO  : Completed executing command(queryId=hive_20231218000456_27454250-92ca-4bf4-9e37-dd80d319c289); Time taken: 0.265 seconds
    INFO  : OK
    INFO  : Concurrency mode is disabled, not creating a lock manager
    No rows affected (4.254 seconds)

     

     

    3. 생성된 데이터베이스를 확인할 수 있습니다.

    show databases;
    INFO  : Compiling command(queryId=hive_20231218000534_40e254f9-33b2-40d5-8d09-85fb650bd3b1): show databases
    INFO  : Concurrency mode is disabled, not creating a lock manager
    INFO  : Semantic Analysis Completed (retrial = false)
    INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:database_name, type:string, comment:from deserializer)], properties:null)
    INFO  : Completed compiling command(queryId=hive_20231218000534_40e254f9-33b2-40d5-8d09-85fb650bd3b1); Time taken: 0.377 seconds
    INFO  : Concurrency mode is disabled, not creating a lock manager
    INFO  : Executing command(queryId=hive_20231218000534_40e254f9-33b2-40d5-8d09-85fb650bd3b1): show databases
    INFO  : Starting task [Stage-0:DDL] in serial mode
    INFO  : Completed executing command(queryId=hive_20231218000534_40e254f9-33b2-40d5-8d09-85fb650bd3b1); Time taken: 0.127 seconds
    INFO  : OK
    INFO  : Concurrency mode is disabled, not creating a lock manager
    +----------------+
    | database_name  |
    +----------------+
    | commerce       |
    | default        |
    +----------------+

     

     

    4. Hive Table 을 생성합니다.

    USE commerce;
    CREATE EXTERNAL TABLE product (
      id INT,
      name STRING,
      price DOUBLE
    )
    STORED AS PARQUET
    LOCATION 'hdfs://namenode:8020/commerce';
    INFO  : Compiling command(queryId=hive_20231218004148_0001ae29-8621-44ee-83c1-c7143091b910): CREATE EXTERNAL TABLE product (
    id INT,
    name STRING,
    price DOUBLE
    )
    STORED AS PARQUET
    LOCATION 'hdfs://namenode:8020/commerce'
    INFO  : Concurrency mode is disabled, not creating a lock manager
    INFO  : Semantic Analysis Completed (retrial = false)
    INFO  : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
    INFO  : Completed compiling command(queryId=hive_20231218004148_0001ae29-8621-44ee-83c1-c7143091b910); Time taken: 2.892 seconds
    INFO  : Concurrency mode is disabled, not creating a lock manager
    INFO  : Executing command(queryId=hive_20231218004148_0001ae29-8621-44ee-83c1-c7143091b910): CREATE EXTERNAL TABLE product (
    id INT,
    name STRING,
    price DOUBLE
    )
    STORED AS PARQUET
    LOCATION 'hdfs://namenode:8020/commerce'
    INFO  : Starting task [Stage-0:DDL] in serial mode
    INFO  : Completed executing command(queryId=hive_20231218004148_0001ae29-8621-44ee-83c1-c7143091b910); Time taken: 1.936 seconds
    INFO  : OK
    INFO  : Concurrency mode is disabled, not creating a lock manager
    No rows affected (5.666 seconds)

     

    5. Row 를 생성합니다.

    insert into product values (1, 'iPhone15', 15.50);
    INFO  : Compiling command(queryId=hive_20231218011801_0d81c744-2e60-4a55-a973-a4341bd9b156): insert into product values (1, 'iPhone15', 15.50)
    INFO  : Concurrency mode is disabled, not creating a lock manager
    INFO  : Semantic Analysis Completed (retrial = false)
    INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:_col0, type:int, comment:null), FieldSchema(name:_col1, type:string, comment:null), FieldSchema(name:_col2, type:double, comment:null)], properties:null)
    INFO  : Completed compiling command(queryId=hive_20231218011801_0d81c744-2e60-4a55-a973-a4341bd9b156); Time taken: 4.186 seconds
    INFO  : Concurrency mode is disabled, not creating a lock manager
    INFO  : Executing command(queryId=hive_20231218011801_0d81c744-2e60-4a55-a973-a4341bd9b156): insert into product values (1, 'iPhone15', 15.50)
    INFO  : Query ID = hive_20231218011801_0d81c744-2e60-4a55-a973-a4341bd9b156
    INFO  : Total jobs = 1
    INFO  : Launching Job 1 out of 1
    INFO  : Starting task [Stage-1:MAPRED] in serial mode
    INFO  : Subscribed to counters: [] for queryId: hive_20231218011801_0d81c744-2e60-4a55-a973-a4341bd9b156
    INFO  : Tez session hasn't been created yet. Opening session
    INFO  : Dag name: insert into product values (1, 'iPh...15.50) (Stage-1)
    INFO  : Status: Running (Executing on YARN cluster with App id application_1702862285673_0001)
    
    ----------------------------------------------------------------------------------------------
            VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED  
    ----------------------------------------------------------------------------------------------
    Map 1 .......... container     SUCCEEDED      1          1        0        0       0       0  
    Reducer 2 ...... container     SUCCEEDED      1          1        0        0       0       0  
    ----------------------------------------------------------------------------------------------
    VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 15.16 s    
    ----------------------------------------------------------------------------------------------
    INFO  : Starting task [Stage-2:DEPENDENCY_COLLECTION] in serial mode
    INFO  : Starting task [Stage-0:MOVE] in serial mode
    INFO  : Loading data to table commerce.product from hdfs://namenode:8020/commerce/.hive-staging_hive_2023-12-18_01-18-01_132_1289189002752009820-2/-ext-10000
    INFO  : Starting task [Stage-3:STATS] in serial mode
    INFO  : Completed executing command(queryId=hive_20231218011801_0d81c744-2e60-4a55-a973-a4341bd9b156); Time taken: 18.844 seconds
    INFO  : OK
    INFO  : Concurrency mode is disabled, not creating a lock manager
    No rows affected (23.554 seconds)

     

     

    6. 생성된 데이터를 확인합니다.

    docker exec -it namenode hdfs dfs -ls /commerce/
    Found 1 items
    -rw-r--r--   3 hive supergroup        640 2023-12-18 01:18 /commerce/000000_0

     

    반응형

    'Hive' 카테고리의 다른 글

    Hive MySQL Metastore 알아보기  (0) 2023.12.17
    Docker 로 Hive 구현하기  (0) 2023.12.16
Designed by Tistory.