Avro File 알아보기

BigData 2023. 10. 4. 10:52

- 목차

소개.

Avro 는 두가지 기능을 제공합니다.
첫번째는 직렬화 기능입니다.
Avro 는 Serialization Framework 로서 직렬화와 역직렬화를 위한 방식을 제공합니다.
두번째는 File Format 으로 사용됩니다.
Avro File 은 .avro 라는 확장자를 가지며, Seriailzed 된 record 들이 저장되는 파일입니다.

이번 글에서는 File Format 으로써 Avro 를 알아보도록 하겠습니다.

구조.

Avro File 은 아래와 같은 구조를 가집니다.
각각의 요소에 대해서 일아보겠습니다.

1. Header
2. Data Block
3. Footer

Header.

Avro File 의 Header 는 Avro File 의 메타데이터들이 저장되는 영역입니다.
Header 의 위치는 이름이 의미하듯이 Avro File 의 최상단에 위치하며,
Avro Schema 가 어떻게 구성되었는지
어떤 압축 방식을 사용하는지에 대한 정보가 저장됩니다.

Header 는 어떻게 생겼을까 ?

헤더는 위에서 언급한 것처럼 Avro Schema 와 compression 방식에 대한 정보가 기록됩니다.
Avro File 은 자체적으로 인코딩와 압축이 되는 바이너리 파일이기 때문에 cat 과 같은 방식으로 읽는 것보단 스크립트를 생성하여 읽는 방법을 소개할까 합니다.

아래는 Header 를 읽기 위한 파이썬 코드입니다.

import fastavro

schema = {"namespace": "example.avro",
          "type": "record",
          "name": "User",
          "fields": [{"name": "name", "type": "string"}, {"name": "age", "type": "int"}]
          }
avro_file_path = 'users.avro'

with open(avro_file_path, "wb") as new_file:
    fastavro.writer(new_file, schema, [{"name": "westlife", "age": 30}], metadata={"owner" : "westlife0615"})

with open(avro_file_path, 'rb') as avro_file:
    magic_number = avro_file.read(4).hex()

with open(avro_file_path, 'rb') as avro_file:
    avro_reader = fastavro.reader(avro_file)
    avro_schema = avro_reader.writer_schema
    codec = avro_reader.codec
    metadata = avro_reader.metadata
    print(f"Avro Schema : {avro_schema}")
    print(f"Compression Codec : {codec}")
    print(f"magic number : {magic_number}")
    print(f"metadata's owner : {metadata['owner']}")

Avro Schema : {'type': 'record', 'name': 'example.avro.User', 'fields': [{'name': 'name', 'type': 'string'}, {'name': 'age', 'type': 'int'}]}
Compression Codec : null
magic number : 4f626a01
metadata's owner : westlife0615

magic number.

매직 넘버는 Avro File 을 시작하는 데이터로 해당 파일이 Avro File 임을 식별하기 위해서 사용됩니다.
자바의 byte code 도 비슷한 맥락으로 매직넘버를 사용하곤 합니다.
매직 넘버의 존재이유는 단순히 해당 파일이 Avro File 임을 알리기 위한 목적입니다.
여러 Avro File 라이브러리들이 매직 넘버에 대한 처리가 작성되어 있습니다.

Avro File 의 매직 넘버는 Obj\x01 입니다.
이를 unicode 로 표현하면 4F 62 6A 01 이구요.
file 의 문자들을 읽어들이면 위처럼 ASCII 또는 유니코드를 확인하실 수 있습니다.

( 참고로 deflate 로 압축된 경우의 매직넘버는 ac018406 입니다. snappy 는 'ac01e408' 입니다. )

schema.

Avro 는 Serialization Framework 로써 데이터의 직렬화와 역직렬화를 지원한다고 말씀드렸습니다.
이 과정에서 Avro 는 Schema 를 사용하게 되는데요.
Schema 를 사용함으로써 파일을 공간 효율적으로 작성할 수 있습니다.

예를 들어,
아래와 같은 스키마가 있다고 가정하겠습니다.

{
    "type": "record",
    "name": "Person",
    "fields": [
        {"name": "name", "type": "string"},
        {"name": "age", "type": "int"}
    ]
}

저장하고자 하는 데이터는 아래와 같습니다.

{
  "name": "westlife",
  "age": 30
}

그리고 저장된 레코드의 형식은 아래와 같습니다.
schema 로 지정한 name 과 age 는 OA, 1E 로 표현됩니다.
human-readable 한 포맷은 아니지만 확실히 저장 용량이 줄어듭니다.
그리고 레코드의 직렬화 과정에서 스키마를 활용하기 때문에 역직렬화 과정에서도 스키마가 사용되고
스키마의 변경은 신중해야할 포인트입니다.

\x0A\x08westlife\x1E30

압축을 통해서 추가적으로 사이즈를 줄일 수도 있습니다.

compression algorithm.

Avro File 은 압축 알고리즘을 선택할 수 있습니다.
대표적으로 snappy, deflate 등을 지원합니다.

Data Block.

Data Block 은 Avro Record 들이 저장되는 실질적인 데이터 영역입니다.
Record 들은 하나씩 저장되지 않고 일정 사이즈만큼 벌크로 저장됩니다.
여러 Avro 라이브러리에서 Avro Record 의 쓰기 작업을 할때, 한 Record 씩 append 되는 방식이 아니라 bulk write, batch write 로 수행되는데요.
이 과정에서 1번의 bulk write 가 곧 하나의 Data Block 생성으로 이러집니다.
Record 의 양이 많다면 Avro File 의 Data Block 또한 그 갯수가 비례하여 증가합니다.

Data Block 의 구성요소를 알아보도록 하겠습니다.

Record Count.

Data Block 은 Record 갯수 정보와 함께 시작합니다.
Record 들은 bulk 단위로 write 되기 때문에 Avro File 를 쓰는 클라이언트는 Serialized Record 들과 그 갯수를 알고 있습니다.
write 과정에서 Record 의 생성과 더불어 총 갯수도 함께 저장합니다.

아래는 생성된 Avro File 의 Data Block 을 읽어들이는 코드 예시입니다.

Record Count 와 Record 들을 조회합니다.

import fastavro

schema = {"namespace": "example.avro",
          "type": "record",
          "name": "User",
          "fields": [{"name": "name", "type": "string"}, {"name": "age", "type": "int"}]
          }
avro_file_path = 'users.avro'

with open(avro_file_path, "wb") as new_file:
    fastavro.writer(new_file, schema, [
        {"name": "westlife1", "age": 20},
        {"name": "westlife2", "age": 30},
        {"name": "westlife3", "age": 40},
    ], metadata={"owner" : "westlife0615"})

with open(avro_file_path, 'rb') as avro_file:
    block_readers = fastavro.block_reader(avro_file)
    for block_reader in block_readers:
        print(f"num_records : {block_reader.num_records}")
        for record in block_reader:
            print(record)

<실행 결과>

num_records : 3
{'name': 'westlife1', 'age': 20}
{'name': 'westlife2', 'age': 30}
{'name': 'westlife3', 'age': 40}

Serialized Records.

Data Block 은 Record 를 저장하는 실질적인 데이터 저장 영역입니다.
모든 Record 들은 Avro Serialization 방식에 따라 직렬화되며,
Data Block 의 Record 들은 직렬화된 상태로 저장됩니다.

아래의 코드는 직렬화를 테스트하기 위한 코드 예시입니다.

import fastavro

schema = {"namespace": "example.avro",
          "type": "record",
          "name": "User",
          "fields": [{"name": "name", "type": "string"}, {"name": "email", "type": "string"}]
          }
avro_file_path = 'users.avro'

with open(avro_file_path, "wb") as new_file:
    fastavro.writer(new_file, schema, [
        {"name": "westlife1", "email": "westlife1@naver.com"},
        {"name": "westlife2", "email": "westlife2@naver.com"},
        {"name": "westlife3", "email": "westlife3@naver.com"},
    ], metadata={"owner" : "westlife0615"})

with open(avro_file_path, 'rb') as avro_file:
    block_readers = fastavro.block_reader(avro_file)
    for block_reader in block_readers:
        print(f"num_records : {block_reader.num_records}")
        for record in block_reader:
            print(record)

Record 의 직렬화된 결과를 확인하기 위해서 hexdump 를 사용하였습니다.

hexdump -C  users.avro

00000000  4f 62 6a 01 06 0a 6f 77  6e 65 72 18 77 65 73 74  |Obj...owner.west|
00000010  6c 69 66 65 30 36 31 35  14 61 76 72 6f 2e 63 6f  |life0615.avro.co|
00000020  64 65 63 08 6e 75 6c 6c  16 61 76 72 6f 2e 73 63  |dec.null.avro.sc|
00000030  68 65 6d 61 a8 02 7b 22  6e 61 6d 65 73 70 61 63  |hema..{"namespac|
00000040  65 22 3a 20 22 65 78 61  6d 70 6c 65 2e 61 76 72  |e": "example.avr|
00000050  6f 22 2c 20 22 74 79 70  65 22 3a 20 22 72 65 63  |o", "type": "rec|
00000060  6f 72 64 22 2c 20 22 6e  61 6d 65 22 3a 20 22 55  |ord", "name": "U|
00000070  73 65 72 22 2c 20 22 66  69 65 6c 64 73 22 3a 20  |ser", "fields": |
00000080  5b 7b 22 6e 61 6d 65 22  3a 20 22 6e 61 6d 65 22  |[{"name": "name"|
00000090  2c 20 22 74 79 70 65 22  3a 20 22 73 74 72 69 6e  |, "type": "strin|
000000a0  67 22 7d 2c 20 7b 22 6e  61 6d 65 22 3a 20 22 65  |g"}, {"name": "e|
000000b0  6d 61 69 6c 22 2c 20 22  74 79 70 65 22 3a 20 22  |mail", "type": "|
000000c0  73 74 72 69 6e 67 22 7d  5d 7d 00 b4 b0 96 79 44  |string"}]}....yD|
000000d0  6a 87 af 79 08 e3 95 3d  f7 04 14 06 b4 01 12 77  |j..y...=.......w|
000000e0  65 73 74 6c 69 66 65 31  26 77 65 73 74 6c 69 66  |estlife1&westlif|
000000f0  65 31 40 6e 61 76 65 72  2e 63 6f 6d 12 77 65 73  |e1@naver.com.wes|
00000100  74 6c 69 66 65 32 26 77  65 73 74 6c 69 66 65 32  |tlife2&westlife2|
00000110  40 6e 61 76 65 72 2e 63  6f 6d 12 77 65 73 74 6c  |@naver.com.westl|
00000120  69 66 65 33 26 77 65 73  74 6c 69 66 65 33 40 6e  |ife3&westlife3@n|
00000130  61 76 65 72 2e 63 6f 6d  b4 b0 96 79 44 6a 87 af  |aver.com...yDj..|
00000140  79 08 e3 95 3d f7 04 14                           |y...=...|

Sync Marker.

sync marker 는 Avro File 에서 Data Block 를 구분짓는 용도로 사용됩니다.
Sync Marker 는 Data Block 의 마지막 부분에 위치하며,
그 결과로 모든 Data Block 들은 그 사이사이에 Sync Marker 가 존재합니다.
이 Sync Marker 를 기준으로 Data Block 이 물리적으로 나뉘게 됩니다.

Sync Marker 의 값은 랜덤으로 생성된 16바이트를 사용합니다.
그리고 Sync Marker 를 통해서 Data Block 단위의 읽기가 가능해집니다.

Footer.

Footer 는 Avro File 의 마지막 영역의 데이터입니다.
Avro File 의 메타정보를 담고 있으며,
Schema, Codec 등의 정보를 저장합니다.
이는 Header 또한 가지는 정보인데요.
이렇게 데이터의 중복을 허용하는 이유는 파일 컨텐츠의 접근을 수월하게 하기 위함이라고 하네요.

'BigData' 카테고리의 다른 글

Thrift 알아보기 (0)	2023.11.04
Avro Serialization 알아보기. (0)	2023.10.05
RabbitMQ 에 대해서 (0)	2023.04.09
apache spark 란 (0)	2023.01.12
hdfs (hadoop) 에 대해서 (0)	2023.01.11

ABOUT ME

코딩수집 코딩수집

- 목차

관련된 글

소개.

구조.

Header.

Header 는 어떻게 생겼을까 ?

magic number.

schema.

compression algorithm.

Data Block.

Record Count.

Serialized Records.

Sync Marker.

Footer.

'BigData' 카테고리의 다른 글

티스토리툴바

ABOUT ME

- 목차

관련된 글

소개.

구조.

Header.

Header 는 어떻게 생겼을까 ?

magic number.

schema.

compression algorithm.

Data Block.

Record Count.

Serialized Records.

Sync Marker.

Footer.

'BigData' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바