-
[SparkML] Linear Regression 구현하기 ( Kaggle )Spark/SparkML 2021. 12. 5. 21:43728x90반응형
- 목차
소개.
아래 링크는 kaggle 의 Conf. Interval for Inferences 의 Linear Regression 관련 문제입니다.
https://www.kaggle.com/code/abdulazizdusbabaev/conf-interval-for-inferences/notebook
SparkML 을 기반으로 Linear Regression 을 구현한 예시코드입니다.
import os from pyspark.ml.feature import VectorAssembler from pyspark.ml.regression import LinearRegression from pyspark.sql import SparkSession from pyspark.sql.functions import col, isnan from pyspark.sql.types import StructType, StructField, DoubleType, IntegerType if __name__ == "__main__": spark = SparkSession.builder \ .appName("conf_interval_for_inferences") \ .master("local[*]") \ .config("spark.driver.bindAddress", "localhost") \ .getOrCreate() train_file_path = os.path.abspath(os.path.join("..", "resources/Conf_Interval_for_Inferences/train.csv")) test_file_path = os.path.abspath(os.path.join("..", "resources/Conf_Interval_for_Inferences/test.csv")) schema = StructType([ StructField("x", IntegerType(), True), StructField("y", DoubleType(), True) ]) train_df = spark.read.csv(train_file_path, header=True, schema=schema) train_df = train_df.filter((col("y").isNotNull() & ~isnan(col("y")))) test_df = spark.read.csv(test_file_path, header=True, schema=schema) test_df = test_df.filter((col("y").isNotNull() & ~isnan(col("y")))) assembler = VectorAssembler(inputCols=["x"], outputCol="feature") train_df = assembler.transform(train_df) test_df = assembler.transform(test_df) lr = LinearRegression(featuresCol="feature", labelCol="y") model = lr.fit(train_df) prediction = model.transform(test_df) prediction.show() spark.stop()
+---+-----------+-------+------------------+ | x| y|feature| prediction| +---+-----------+-------+------------------+ | 77|79.77515201| [77.0]| 76.94327593863451| | 21|23.17727887| [21.0]|20.906518554681234| | 22|25.60926156| [22.0]| 21.90717493653754| | 20|17.85738813| [20.0]|19.905862172824925| | 36|41.84986439| [36.0]| 35.91636428252587| | 15|9.805234876| [15.0]|14.902580263543383| | 62|58.87465933| [62.0]|61.933430210789886| | 95|97.61793701| [95.0]| 94.95509081204807| | 20|18.39512747| [20.0]|19.905862172824925| | 5|8.746747654| [5.0]| 4.896016444980298| | 4|2.811415826| [4.0]|3.8953600631239897| | 19|17.09537241| [19.0]|18.905205790968616| | 96|95.14907176| [96.0]| 95.95574719390437| | 62|61.38800663| [62.0]|61.933430210789886| | 36|40.24701716| [36.0]| 35.91636428252587| | 15|14.82248589| [15.0]|14.902580263543383| | 65|66.95806869| [65.0]| 64.9353993563588| | 14|16.63507984| [14.0]|13.901923881687075| | 87|90.65513736| [87.0]| 86.9498397571976| | 69|77.22982636| [69.0]| 68.93802488378404| +---+-----------+-------+------------------+ only showing top 20 rows
반응형'Spark > SparkML' 카테고리의 다른 글
[SparkML] StandardScaler 알아보기 ( 표준화, Feature Scaling ) (0) 2024.01.08 [SparkML] Kaggle EDA + Regression 구현하기 (0) 2023.02.24 [SparkML] Kaggle Stars Classification 구현하기 (Logistic Regression) (0) 2023.02.21 [SparkML] VectorAssembler 알아보기 (0) 2021.12.04