[ClickHouse] Multi Sharding 구현하기 ( Docker, Shards )

Database/Clickhouse 2024. 2. 17. 16:21

728x90

- 목차

들어가며.

ClickHouse 는 Multi Sharding 을 구현할 수 있습니다.

Distributed Table Engine 을 통해서 생성되는 MergeTree Family 들이 존재하며,

Distributed Table 들은 Partition Key 를 통해서 여러 ClickHouse Node 들로 분배되어 저장됩니다.

Distributed Table 을 잘 설명할 수 있는 사진 몇장을 아래에 첨부해두었습니다.

이 이미지 자료를 통해서 좀 더 상세히 설명드리면,

ClickHouse 는 여러 노드들로 구성되어 하나의 클러스터를 구성할 수 있습니다.

만약 2개의 서버로 클러스터를 구성한다면 2개의 Sharding 을 적용하거나

1개의 Shard 에서 Primary - Replica 를 1개씩 적용할 수도 있습니다.

6개의 서버로 클러스터를 만든다면 6개의 노드로 분배되는 구조 또는 3개의 Primary 와 3개의 Replica 로 구성할 수도 있습니다.

출처 : https://clickhouse.com/docs/en/architecture/horizontal-scaling

출처 : https://doc.hcs.huawei.com/productdesc/mrs/mrs_01_24054.html

2개의 Node 와 2개의 Shard.

만약 2개의 서버가 있어서 2개의 Node 들로 구성된 ClickHouse Cluster 를 구축한다면,

아래와 같이 2개의 Shard 를 만들 수 있습니다.

이 경우에 Distributed MergeTree Table 을 통해서 절반의 데이터는 Shard 1, 그 외의 데이터는 Shard 2에 저장할 수 있게 됩니다.

2개의 Node 와 Primary - Replica 구조.

또한 2개의 Node 가 가용될 때에 아래와 같이 1개의 단일 Shard 내에서 Primary - Replica 구성을 할 수도 있습니다.

4개의 Node 와 Primary - Replica 구조.

일반적으로 Replica 를 구성하는 것은 Fail Over 에 대응하기 위해서 선택이 아닌 필수의 영역이기에

4개의 노드는 2개의 Shard 를 구성하여 아래와 같이 Primary - Replica 구성을 하게 됩니다.

config.xml remote_server 세팅하기.

저는 bitnami/clickhouse 도커 이미지를 통해서 구축 실습을 진행하겠습니다.

https://hub.docker.com/r/bitnami/clickhouse

hub.docker.com

ClickHouse 는 config.xml 설정 파일을 통해서 클러스터의 설정이 가능합니다.

기본적으로 /bitnami/clickhouse/etc 디렉토리 하위에 설정 파일들이 존재합니다.

config.xml, user.xml 등이 위치하구요.

config.d 그리고 user.d 디렉토리의 하위에 설정 파일을 생성하여 추가적인 설정을 적용할 수 있습니다.

기본적인 ClickHouse 의 설정에는 Cluster 와 관련된 설정이 되어 있지 않습니다.

따라서 Cluster 에 대한 Setting 을 적용한 Config XML File 을 config.d 디렉토리 하위에 추가해야합니다.

ClickHouse Docker Container 실행해보기.

먼저 기본적인 ClickHouse Docker Container 의 실행법을 알아보도록 하겠습니다.

저의 경우는 24.8.4 버전의 이미지를 사용하도록 하겠습니다.

간단히 아래의 Docker Command 에 대해서 설명드리면, --platform Option 을 통해서 일반적인 리눅스 환경으로 세팅해주고요.

ALLOW_EMPTY_PASSWORD 를 yes 로 설정하여 Id/Password 기반의 Authentication 을 비활성화합니다.

docker run -d \
--platform=linux/amd64 \
-e ALLOW_EMPTY_PASSWORD=yes \
--name clickhouse \
bitnami/clickhouse:24.8.4

위 명령어가 성공적으로 실행된다면 아래와 같은 결과를 Docker Desktop 에서 확인할 수 있습니다.

Cluster 구현해보기.

이번에는 2개의 Node 로 2개의 Shard 를 구성해보겠습니다.

먼저 Cluster 의 설정을 위해서 cluster.xml 이라는 XML 파일을 생성합니다.

<remote_server> 태그 하위에 관련된 Cluster 설정 정보를 기입합니다.

<my_cluster> 태그는 생성할 클러스터의 이름입니다.

그리고 2개의 <shard> 태그를 통해서 2개의 Shard 를 만들구요.

<shard> 태그 내부에 <replica> 태그를 통해서 Shard 를 구성할 서버의 정보를 기입합니다.

저의 경우에는 Server 의 Hostname 은 clickhouse-node1, clickhouse-node2 로 설정할 예정이라 아래와 같이 기입하였습니다.

cat <<EOF> /tmp/cluster.xml
<clickhouse>
    <remote_servers>
        <my_cluster>
            <shard>
                <replica>
                    <host>clickhouse-node1</host>
                    <port>9000</port>
                </replica>
            </shard>
            <shard>
                <replica>
                    <host>clickhouse-node2</host>
                    <port>9000</port>
                </replica>
            </shard>
        </my_cluster>
    </remote_servers>
    <zookeeper>
        <node index="1">
            <host>zookeeper</host>
            <port>2181</port>
        </node>
    </zookeeper>    
</clickhouse>    
EOF

생성한 cluster.xml 설정을 이미지로 표현하면 아래와 같습니다.

페이지 상단에서 설명한 이미지인데요. 2개의 Node 로 2개의 Shard 를 구성하는 방식입니다.

그리고 Docker Container 를 생성할 Docker Compose 파일은 아래와 같습니다.

cat <<EOF> /tmp/clickhouse.yaml
version: '3.8'
services:
  clickhouse-node1:
    platform: linux/amd64
    image: bitnami/clickhouse:22.3.20
    container_name: clickhouse-node1
    hostname: clickhouse-node1
    environment:
      - ALLOW_EMPTY_PASSWORD=yes
    volumes:
      - /tmp/cluster.xml:/bitnami/clickhouse/etc/conf.d/override.xml:ro
    ports:
      - "8123:8123"  # HTTP port
      - "9000:9000"  # TCP port
    networks:
      - clickhouse-network

  clickhouse-node2:
    platform: linux/amd64  
    image: bitnami/clickhouse:22.3.20
    container_name: clickhouse-node2
    hostname: clickhouse-node2
    environment:
      - ALLOW_EMPTY_PASSWORD=yes    
    volumes:
      - /tmp/cluster.xml:/bitnami/clickhouse/etc/conf.d/override.xml:ro
    ports:
      - "8124:8123"  # HTTP port
      - "9001:9000"  # TCP port
    networks:
      - clickhouse-network
      
  zookeeper:
    platform: linux/amd64  
    image: bitnami/zookeeper:3.6
    hostname: zookeeper
    container_name: zookeeper
    ports:
      - "2181:2181"  # Default ZooKeeper port
    environment:
      ZOO_SERVER_ID: 1
      ZOO_SERVERS: server.1=zookeeper:2888:3888
      ALLOW_ANONYMOUS_LOGIN: yes
    networks:
      - clickhouse-network

networks:
  clickhouse-network:
    driver: bridge
EOF

ClickHouse 의 각 서버 (노드) 는 <remote_server> 태그 하위에 설정된 정보와 자신의 Hostname 을 비교합니다.

이러한 IP Matching 방식을 통해서 자신의 어떤 Shard 의 Replica 인지를 판단할 수 있습니다.

또한 다른 Node 들의 네트워크적인 위치 정보 또한 파악할 수 있어 Cluster 를 구현할 수 있습니다.

마지막으로 아래의 Docker Command 를 통해 Docker Compose 를 실행합니다.

docker-compose -f /tmp/clickhouse.yaml --project-name clickhouse up -d

위 절차를 마치셨다면, Docker Desktop 에서 아래와 같은 결과를 확인할 수 있습니다.

이제 ClickHouse Node 들이 서로 Cluster 를 구성하였는지 마지막으로 파악해보도록 하겠습니다.

생성된 2개의 Node 중 하나의 Node 에서 ClickHouse Client Shell 을 통해 아래의 SQL 명령어를 입력합니다.

( 참고로 clickhouse-client sh 를 통해서 진입 가능합니다. )

SELECT * FROM system.clusters FORMAT Vertical;

Row 1:
──────
cluster:                 my_cluster
shard_num:               1
shard_weight:            1
internal_replication:    0
replica_num:             1
host_name:               clickhouse-node1
host_address:            172.26.0.3
port:                    9000
is_local:                1
user:                    default
default_database:        
errors_count:            0
slowdowns_count:         0
estimated_recovery_time: 0
database_shard_name:     
database_replica_name:   
is_active:               ᴺᵁᴸᴸ
replication_lag:         ᴺᵁᴸᴸ
recovery_time:           ᴺᵁᴸᴸ

Row 2:
──────
cluster:                 my_cluster
shard_num:               2
shard_weight:            1
internal_replication:    0
replica_num:             1
host_name:               clickhouse-node2
host_address:            172.26.0.2
port:                    9000
is_local:                0
user:                    default
default_database:        
errors_count:            0
slowdowns_count:         0
estimated_recovery_time: 0
database_shard_name:     
database_replica_name:   
is_active:               ᴺᵁᴸᴸ
replication_lag:         ᴺᵁᴸᴸ
recovery_time:           ᴺᵁᴸᴸ

Distributed Table 생성하기.

이제 Multi Shard 가 구축되었기 때문에 Distributed Table 을 생성해보도록 하겠습니다.

간단한 시계열 데이터를 저장하는 Table 을 생성하겠습니다.

MergeTree 엔진을 사용하는 테이블을 생성하며, 아래의 Create Table DDL 에 대해서 간단히 설명드리겠습니다.

Distributed Table 을 만들기 위해서는 먼저 모든 Node 에 Local Table 을 생성합니다.

그리고 ON CLUSTER statement 를 통해서 모든 Node 에 CREATE TABLE DDL 이 전달되도록 합니다.

만약 ON CLUSTER 를 사용하지 않으면 모든 Node 에 일일이 접속하여 create table 문장을 실행해야합니다.

CREATE TABLE time_series_data_local ON CLUSTER my_cluster
(
    event_time DateTime,
    metric Float32,
    device_id String,
    location String,
    tags Array(String) DEFAULT []
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(event_time)
ORDER BY (device_id, event_time)
TTL event_time + INTERVAL 1 YEAR
SETTINGS index_granularity = 8192;

이렇게 위의 과정을 마치게 되면, clickhouse-node1 과 clickhouse-node2 에 time_series_data_local 테이블이 생성됩니다.

clickhouse-node1 :) show tables;

SHOW TABLES

Query id: 69d464ad-6031-42c2-b407-2ccab5c1fdac

┌─name───────────────────┐
│ time_series_data_local │
└────────────────────────┘

1 rows in set. Elapsed: 0.017 sec.

clickhouse-node2 :) show tables;

SHOW TABLES

Query id: 32a5b8f7-a280-4f40-9a47-45c9b664c83f

┌─name───────────────────┐
│ time_series_data_local │
└────────────────────────┘

1 rows in set. Elapsed: 0.027 sec.

그리고 아래와 같이 Distributed 엔진을 통해서 Distributed Table 을 생성할 수 있습니다.

이 또한 on cluster 구문을 통해서 모든 Node 에 Distributed Table 이 생성되도록 합니다.

CREATE TABLE time_series_data on cluster my_cluster
ENGINE = Distributed(
    'my_cluster',
    'default',
    'time_series_data_local',
    rand()
);

아래와 같이 모든 Node 에 Local Table 과 Distributed Table 이 생성됨을 확인할 수 있습니다.

clickhouse-node1 :) show tables;

SHOW TABLES

Query id: f1f8c8bd-7358-45fc-adff-61817c11e703

┌─name───────────────────┐
│ time_series_data       │
│ time_series_data_local │
└────────────────────────┘

2 rows in set. Elapsed: 0.017 sec. 

clickhouse-node2 :) show tables;

SHOW TABLES

Query id: 6225a126-9c7b-4faa-910b-bfdd73d0a788

┌─name───────────────────┐
│ time_series_data       │
│ time_series_data_local │
└────────────────────────┘

2 rows in set. Elapsed: 0.022 sec.

데이터 추가하기.

Distributed Table 로 데이터를 추가해보도록 하겠습니다.

아래와 같이 5개의 Row 들을 추가합니다.

INSERT INTO time_series_data (event_time, metric, device_id, location, tags) VALUES
('2023-01-11 12:00:00', 25.4, 'device_01', 'New York', ['temperature']),
('2023-01-11 12:05:00', 26.1, 'device_02', 'San Francisco', ['temperature']),
('2023-01-11 12:10:00', 24.7, 'device_03', 'Los Angeles', ['temperature']),
('2023-01-11 12:15:00', 27.2, 'device_04', 'Chicago', ['temperature']),
('2023-01-11 12:20:00', 22.5, 'device_05', 'Houston', ['temperature']);

5개의 Row 들이 추가된 이후에 clickhouse-node1 과 clickhouse-node2 의 데이터 상태를 확인해보면,

아래와 같이 분산되어 저장됨을 알 수 있습니다.

clickhouse-node1 :) select * from time_series_data_local;

SELECT *
FROM time_series_data_local

Query id: d488ed91-9989-4085-b862-7b44da8d8ea0

┌──────────event_time─┬─metric─┬─device_id─┬─location─┬─tags────────────┐
│ 2023-01-11 12:15:00 │   27.2 │ device_04 │ Chicago  │ ['temperature'] │
└─────────────────────┴────────┴───────────┴──────────┴─────────────────┘

1 rows in set. Elapsed: 0.042 sec. 

clickhouse-node2 :) select * from time_series_data_local;

SELECT *
FROM time_series_data_local

Query id: 8d00c11d-5c76-41b1-af8f-642c5806cfd2

┌──────────event_time─┬─metric─┬─device_id─┬─location──────┬─tags────────────┐
│ 2023-01-11 12:00:00 │   25.4 │ device_01 │ New York      │ ['temperature'] │
│ 2023-01-11 12:05:00 │   26.1 │ device_02 │ San Francisco │ ['temperature'] │
│ 2023-01-11 12:10:00 │   24.7 │ device_03 │ Los Angeles   │ ['temperature'] │
│ 2023-01-11 12:20:00 │   22.5 │ device_05 │ Houston       │ ['temperature'] │
└─────────────────────┴────────┴───────────┴───────────────┴─────────────────┘

4 rows in set. Elapsed: 0.040 sec.

'Database > Clickhouse' 카테고리의 다른 글

[Clickhouse] ReplicatedMergeTree 알아보기 (0)	2024.02.18
[Clickhouse] Docker 로 Clickhouse 구현하기 (2)	2024.02.17
[Clickhouse] Shard & Replica Cluster 구성하기 (0)	2024.02.14
[ClickHouse] Compact Wide Parts 알아보기 ( part_type ) (0)	2024.01.16
[ClickHouse] Block 알아보기 (0)	2024.01.10

ABOUT ME

코딩수집 코딩수집

- 목차

들어가며.

2개의 Node 와 2개의 Shard.

2개의 Node 와 Primary - Replica 구조.

4개의 Node 와 Primary - Replica 구조.

config.xml remote_server 세팅하기.

ClickHouse Docker Container 실행해보기.

Cluster 구현해보기.

Distributed Table 생성하기.

데이터 추가하기.

'Database > Clickhouse' 카테고리의 다른 글

티스토리툴바

ABOUT ME

- 목차

들어가며.

2개의 Node 와 2개의 Shard.

2개의 Node 와 Primary - Replica 구조.

4개의 Node 와 Primary - Replica 구조.

config.xml remote_server 세팅하기.

ClickHouse Docker Container 실행해보기.

Cluster 구현해보기.

Distributed Table 생성하기.

데이터 추가하기.

'Database > Clickhouse' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바