[Spark] Row 알아보기

Spark 2024. 3. 3. 12:33

- 목차

들어가며.

이번 글에서는 Spark 의 Row 에 대해서 알아보려고 합니다.

Row 는 DataFrame 을 구성하는 개별적인 데이터 레코드를 의미합니다.

그래서 Row 들의 모음이 곧 DataFrame 이라고 볼 수 있죠.

DataFrame 을 생성하는 형태는 아래와 같습니다.

SparkSession 의 createDataFrame 함수를 사용하여 DataFrame 을 만들게 되는데요.

이때에 필수적인 Arguments 가 Row 와 Schema 입니다.

# row 생성.
rows = [Row(name="Andy", age=32), Row(name="Bob", age=43)]

# schema 생성.
schema = StructType([
    StructField("name", StringType(), False),
    StructField("age", IntegerType(), False)
])

# DataFrame 생성.
df = spark.createDataFrame(rows, schema)

즉, DataFrame 은 Row 와 Schema 의 조합으로 생성되고, Schema 라는 제약사항을 가지는 Row 의 그룹으로 생각할 수 있습니다.

Row 생성하기.

Row 는 DataFrame 을 생성하기 위해서 활용되는 기본이 되는 정보입니다.

그리고 StructType 은 DataFrame 의 메타데이터로써 데이터의 타입과 여타 설정을 적용합니다.

구체적으로 StructType 은 Name, DataType, Nullable Flag 3가지로 구성되죠.

Named Arguments.

Row 는 Named Arguments 를 통해서 생성할 수 있습니다.

아래의 예시처럼 name 과 age 라는 칼럼을 가지는 DataFrame 을 생성하기 위해서 Row 를 생성합니다.

Row 의 생성자는 name 과 age 라는 Named Arguments 를 통해서 만들어집니다.

row = Row(name="Andy", age=32)
rows = [row]
schema = StructType([
    StructField("name", StringType(), False),
    StructField("age", IntegerType(), False)
])

df = spark.createDataFrame(rows, schema)
df.show()

< df.show 결과 >

+----+---+
|name|age|
+----+---+
|Andy| 32|
+----+---+

Positional Arguments.

Row 는 Named Arguments 뿐만 아니라 Positional Arguments 방식으로도 생성이 가능합니다.

DataFrame 을 생성하는 과정에서 Row 와 StructType 데이터가 활용되는데요.

Row 를 구성하는 Item 의 순서와 StructType 을 구성하는 StructField 의 순서가 일치되는 것이 가장 중요합니다.

row = Row("Andy", 32)
rows = [row]
schema = StructType([
    StructField("name", StringType(), False),
    StructField("age", IntegerType(), False)
])

df = spark.createDataFrame(rows, schema)
df.show()

< df.show 결과 >

+----+---+
|name|age|
+----+---+
|Andy| 32|
+----+---+

DataFrame Collect.

DataFrame 의 Collect 함수는 DataFrame 으로부터 Row 들을 반환합니다.

Collect 는 하나의 Action 으로 Collect 의 호출되는 시점에 누적된 Transformation 들이 적용됩니다.

그래서 Collect 가 반환하는 Row 는 원본 Row 와 형태가 다를 수 있습니다.

아래 예시는 "city" 라는 새로운 칼럼을 DataFrame 에 추가하는 withColumn 이란 Transformation 을 적용합니다.

그 이후에 "collect" Action 을 실행하여 Row 들을 획득합니다.

row = Row("Andy", 32)
rows = [row]
schema = StructType([
    StructField("name", StringType(), False),
    StructField("age", IntegerType(), False),
])

df = spark.createDataFrame(rows, schema)
df = df.withColumn("city", col=lit("Seoul"))
result = df.collect()
df.show()

print("positional access : ", result[0][0])
print("positional access : ", result[0][1])
print("positional access : ", result[0][2])

print("named access : ", result[0]["name"])
print("named access : ", result[0]["age"])
print("named access : ", result[0]["city"])

< 출력 결과 >

아래의 결과와 같이 city 라는 새로운 칼럼이 추가된 결과를 알 수 있습니다.

"Collect" Action 으로 얻어진 Row 는 Named 또는 Positional Arguments 방식을 통해서 Row 내부의 접근에 가능합니다.

+----+---+-----+
|name|age| city|
+----+---+-----+
|Andy| 32|Seoul|
+----+---+-----+

positional access :  Andy
positional access :  32
positional access :  Seoul
named access :  Andy
named access :  32
named access :  Seoul

'Spark' 카테고리의 다른 글

[Spark] Window 알아보기 ( lag, lead, sum ) (0)	2024.05.15
[Spark] Spark 로 Web File Reader 구현하기 ( SparkFiles ) (0)	2024.03.18
[Spark] approxCountDistinct 알아보기 (0)	2024.02.22
[Spark] Logical Plan 알아보기 1 (Catalyst Optimizer) (2)	2024.01.28
[Spark] JDBC DataFrameReader 알아보기 (MySQL) (0)	2024.01.28

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

인기포스트

ABOUT ME

코딩수집 코딩수집

- 목차

들어가며.

Row 생성하기.

Named Arguments.

Positional Arguments.

DataFrame Collect.

'Spark' 카테고리의 다른 글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역

인기포스트

ABOUT ME

- 목차

들어가며.

Row 생성하기.

Named Arguments.

Positional Arguments.

DataFrame Collect.

'Spark' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역