[SparkML] Linear Regression 구현하기 ( Kaggle )

Spark/SparkML 2021. 12. 5. 21:43

- 목차

소개.

아래 링크는 kaggle 의 Conf. Interval for Inferences 의 Linear Regression 관련 문제입니다.

https://www.kaggle.com/code/abdulazizdusbabaev/conf-interval-for-inferences/notebook

Conf. Interval for Inferences

Explore and run machine learning code with Kaggle Notebooks | Using data from Linear Regression

www.kaggle.com

SparkML 을 기반으로 Linear Regression 을 구현한 예시코드입니다.

import os
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, isnan
from pyspark.sql.types import StructType, StructField, DoubleType, IntegerType

if __name__ == "__main__":
    spark = SparkSession.builder \
        .appName("conf_interval_for_inferences") \
        .master("local[*]") \
        .config("spark.driver.bindAddress", "localhost") \
        .getOrCreate()

    train_file_path = os.path.abspath(os.path.join("..", "resources/Conf_Interval_for_Inferences/train.csv"))
    test_file_path = os.path.abspath(os.path.join("..", "resources/Conf_Interval_for_Inferences/test.csv"))

    schema = StructType([
        StructField("x", IntegerType(), True),
        StructField("y", DoubleType(), True)
    ])

    train_df = spark.read.csv(train_file_path, header=True, schema=schema)
    train_df = train_df.filter((col("y").isNotNull() & ~isnan(col("y"))))

    test_df = spark.read.csv(test_file_path, header=True, schema=schema)
    test_df = test_df.filter((col("y").isNotNull() & ~isnan(col("y"))))

    assembler = VectorAssembler(inputCols=["x"], outputCol="feature")
    train_df = assembler.transform(train_df)
    test_df = assembler.transform(test_df)

    lr = LinearRegression(featuresCol="feature", labelCol="y")
    model = lr.fit(train_df)
    prediction = model.transform(test_df)
    prediction.show()

    spark.stop()

+---+-----------+-------+------------------+
|  x|          y|feature|        prediction|
+---+-----------+-------+------------------+
| 77|79.77515201| [77.0]| 76.94327593863451|
| 21|23.17727887| [21.0]|20.906518554681234|
| 22|25.60926156| [22.0]| 21.90717493653754|
| 20|17.85738813| [20.0]|19.905862172824925|
| 36|41.84986439| [36.0]| 35.91636428252587|
| 15|9.805234876| [15.0]|14.902580263543383|
| 62|58.87465933| [62.0]|61.933430210789886|
| 95|97.61793701| [95.0]| 94.95509081204807|
| 20|18.39512747| [20.0]|19.905862172824925|
|  5|8.746747654|  [5.0]| 4.896016444980298|
|  4|2.811415826|  [4.0]|3.8953600631239897|
| 19|17.09537241| [19.0]|18.905205790968616|
| 96|95.14907176| [96.0]| 95.95574719390437|
| 62|61.38800663| [62.0]|61.933430210789886|
| 36|40.24701716| [36.0]| 35.91636428252587|
| 15|14.82248589| [15.0]|14.902580263543383|
| 65|66.95806869| [65.0]|  64.9353993563588|
| 14|16.63507984| [14.0]|13.901923881687075|
| 87|90.65513736| [87.0]|  86.9498397571976|
| 69|77.22982636| [69.0]| 68.93802488378404|
+---+-----------+-------+------------------+
only showing top 20 rows

'Spark > SparkML' 카테고리의 다른 글

[SparkML] StandardScaler 알아보기 ( 표준화, Feature Scaling ) (0)	2024.01.08
[SparkML] Kaggle EDA + Regression 구현하기 (0)	2023.02.24
[SparkML] Kaggle Stars Classification 구현하기 (Logistic Regression) (0)	2023.02.21
[SparkML] VectorAssembler 알아보기 (0)	2021.12.04

ABOUT ME

코딩수집 코딩수집

- 목차

소개.

'Spark > SparkML' 카테고리의 다른 글

티스토리툴바

ABOUT ME

- 목차

소개.

'Spark > SparkML' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바