ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • [SparkML] Linear Regression 구현하기 ( Kaggle )
    Spark/SparkML 2021. 12. 5. 21:43
    728x90
    반응형

    - 목차

     

    소개.

    아래 링크는 kaggleConf. Interval for Inferences Linear Regression 관련 문제입니다.

    https://www.kaggle.com/code/abdulazizdusbabaev/conf-interval-for-inferences/notebook

     

    Conf. Interval for Inferences

    Explore and run machine learning code with Kaggle Notebooks | Using data from Linear Regression

    www.kaggle.com

     

    SparkML 을 기반으로 Linear Regression 을 구현한 예시코드입니다.

     

    import os
    from pyspark.ml.feature import VectorAssembler
    from pyspark.ml.regression import LinearRegression
    from pyspark.sql import SparkSession
    from pyspark.sql.functions import col, isnan
    from pyspark.sql.types import StructType, StructField, DoubleType, IntegerType
    
    if __name__ == "__main__":
        spark = SparkSession.builder \
            .appName("conf_interval_for_inferences") \
            .master("local[*]") \
            .config("spark.driver.bindAddress", "localhost") \
            .getOrCreate()
    
        train_file_path = os.path.abspath(os.path.join("..", "resources/Conf_Interval_for_Inferences/train.csv"))
        test_file_path = os.path.abspath(os.path.join("..", "resources/Conf_Interval_for_Inferences/test.csv"))
    
        schema = StructType([
            StructField("x", IntegerType(), True),
            StructField("y", DoubleType(), True)
        ])
    
        train_df = spark.read.csv(train_file_path, header=True, schema=schema)
        train_df = train_df.filter((col("y").isNotNull() & ~isnan(col("y"))))
    
        test_df = spark.read.csv(test_file_path, header=True, schema=schema)
        test_df = test_df.filter((col("y").isNotNull() & ~isnan(col("y"))))
    
        assembler = VectorAssembler(inputCols=["x"], outputCol="feature")
        train_df = assembler.transform(train_df)
        test_df = assembler.transform(test_df)
    
        lr = LinearRegression(featuresCol="feature", labelCol="y")
        model = lr.fit(train_df)
        prediction = model.transform(test_df)
        prediction.show()
    
        spark.stop()
    +---+-----------+-------+------------------+
    |  x|          y|feature|        prediction|
    +---+-----------+-------+------------------+
    | 77|79.77515201| [77.0]| 76.94327593863451|
    | 21|23.17727887| [21.0]|20.906518554681234|
    | 22|25.60926156| [22.0]| 21.90717493653754|
    | 20|17.85738813| [20.0]|19.905862172824925|
    | 36|41.84986439| [36.0]| 35.91636428252587|
    | 15|9.805234876| [15.0]|14.902580263543383|
    | 62|58.87465933| [62.0]|61.933430210789886|
    | 95|97.61793701| [95.0]| 94.95509081204807|
    | 20|18.39512747| [20.0]|19.905862172824925|
    |  5|8.746747654|  [5.0]| 4.896016444980298|
    |  4|2.811415826|  [4.0]|3.8953600631239897|
    | 19|17.09537241| [19.0]|18.905205790968616|
    | 96|95.14907176| [96.0]| 95.95574719390437|
    | 62|61.38800663| [62.0]|61.933430210789886|
    | 36|40.24701716| [36.0]| 35.91636428252587|
    | 15|14.82248589| [15.0]|14.902580263543383|
    | 65|66.95806869| [65.0]|  64.9353993563588|
    | 14|16.63507984| [14.0]|13.901923881687075|
    | 87|90.65513736| [87.0]|  86.9498397571976|
    | 69|77.22982636| [69.0]| 68.93802488378404|
    +---+-----------+-------+------------------+
    only showing top 20 rows

    반응형
Designed by Tistory.