ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • [SparkML] Kaggle Stars Classification 구현하기 (Logistic Regression)
    Spark/SparkML 2023. 2. 21. 15:36
    728x90
    반응형

    - 목차

     

    소개.

    아래 링크는 Kaggle 의 Stars Classification 관련한 웹 링크입니다.

     

    https://www.kaggle.com/code/ybifoundation/stars-classification/notebook

     

    Stars Classification

    Explore and run machine learning code with Kaggle Notebooks | Using data from No attached data sources

    www.kaggle.com

     

    별의 특성들로부터 항성을 분류하는 Classification 을 구현해야합니다.

    항성분류법은 O,B,A,F,G,K,M 등급으로 분류되는 표현법이구요.

    별의 크기, 밝기, 온도 등의 feature 들로부터 항성을 분류합니다.

     

    제공되는 데이터 Column 의 설명입니다.

     

    - Absolute Temperature (in K) // 절대온도
    - Relative Luminosity (L/Lo) // 상대 휘도
    - Relative Radius (R/Ro) // 직경
    - Absolute Magnitude (Mv) // 질량
    - Star Color (white,Red,Blue,Yellow,yellow-orange etc)
    - Spectral Class (O,B,A,F,G,K,,M) // 항성 분류
    - Star Type (Red Dwarf, Brown Dwarf, White Dwarf, Main Sequence , SuperGiants, HyperGiants)
    - Lo = 3.828 x 10^26 Watts (Avg Luminosity of Sun)
    - Ro = 6.9551 x 10^8 m (Avg Radius of Sun)

     

    구현하기.

     

    저의 경우에는 Logistic Regression 을 활용하여 Stars Classification 모델링에 접근하였습니다.

     

    import pandas as pd
    
    from pyspark.ml.classification import LogisticRegression
    from pyspark.ml.feature import VectorAssembler, StringIndexer
    from pyspark.sql import SparkSession, functions
    from pyspark.sql.functions import col
    from pyspark.sql.types import LongType, DoubleType, StringType, StructType, StructField
    
    star_pd_df = pd.read_csv("https://github.com/YBIFoundation/Dataset/raw/main/Stars.csv")
    spark = SparkSession.builder.master("local[*]").appName("starts-classification") \
        .config("spark.driver.bindAddress", "localhost").getOrCreate()
    
    schema = StructType([
        StructField("Temperature(K)", LongType(), True),
        StructField("Luminosity(L / Lo)", DoubleType(), True),
        StructField("Radius(R / Ro)", DoubleType(), True),
        StructField("Absolute magnitude(Mv)", DoubleType(), True),
        StructField("Star type", LongType(), True),
        StructField("Star category", StringType(), True),
        StructField("Star color", StringType(), True),
        StructField("Spectral Class", StringType(), True),
    ])
    star_df = spark.createDataFrame(star_pd_df, schema=schema)
    star_df.printSchema()
    star_df.show()
    count = star_df.count()
    print(f" count is {count} ")
    
    assembler = VectorAssembler(inputCols=["Temperature(K)", "Luminosity(L / Lo)", "Radius(R / Ro)", "Absolute magnitude(Mv)"], outputCol="features_vector")
    star_df = assembler.transform(star_df)
    
    indexer = StringIndexer()
    indexer = indexer.setInputCol(value="Spectral Class")
    indexer = indexer.setOutputCol(value="label_vector")
    indexer = indexer.fit(star_df)
    star_df = indexer.transform(star_df)
    
    lr = LogisticRegression(maxIter=1000, regParam=0.01, featuresCol="features_vector", labelCol="label_vector")
    model = lr.fit(star_df)
    predictions = model.transform(star_df)
    predictions_with_matching = predictions.select("label_vector", "prediction").withColumn("is_matched", col("label_vector") == col("prediction"))
    
    predictions_with_matching.show()
    predictions_with_matching.groupBy("is_matched").agg(functions.count("is_matched").alias("count")).show()
    root
     |-- Temperature(K): long (nullable = true)
     |-- Luminosity(L / Lo): double (nullable = true)
     |-- Radius(R / Ro): double (nullable = true)
     |-- Absolute magnitude(Mv): double (nullable = true)
     |-- Star type: long (nullable = true)
     |-- Star category: string (nullable = true)
     |-- Star color: string (nullable = true)
     |-- Spectral Class: string (nullable = true)
    
    +--------------+------------------+--------------+----------------------+---------+-------------+----------+--------------+
    |Temperature(K)|Luminosity(L / Lo)|Radius(R / Ro)|Absolute magnitude(Mv)|Star type|Star category|Star color|Spectral Class|
    +--------------+------------------+--------------+----------------------+---------+-------------+----------+--------------+
    |          3068|            0.0024|          0.17|                 16.12|        0|  Brown Dwarf|       Red|             M|
    |          3042|            5.0E-4|        0.1542|                  16.6|        0|  Brown Dwarf|       Red|             M|
    |          2600|            3.0E-4|         0.102|                  18.7|        0|  Brown Dwarf|       Red|             M|
    |          2800|            2.0E-4|          0.16|                 16.65|        0|  Brown Dwarf|       Red|             M|
    |          1939|           1.38E-4|         0.103|                 20.06|        0|  Brown Dwarf|       Red|             M|
    |          2840|            6.5E-4|          0.11|                 16.98|        0|  Brown Dwarf|       Red|             M|
    |          2637|            7.3E-4|         0.127|                 17.22|        0|  Brown Dwarf|       Red|             M|
    |          2600|            4.0E-4|         0.096|                  17.4|        0|  Brown Dwarf|       Red|             M|
    |          2650|            6.9E-4|          0.11|                 17.45|        0|  Brown Dwarf|       Red|             M|
    |          2700|            1.8E-4|          0.13|                 16.05|        0|  Brown Dwarf|       Red|             M|
    |          3600|            0.0029|          0.51|                 10.69|        1|    Red Dwarf|       Red|             M|
    |          3129|            0.0122|        0.3761|                 11.79|        1|    Red Dwarf|       Red|             M|
    |          3134|            4.0E-4|         0.196|                 13.21|        1|    Red Dwarf|       Red|             M|
    |          3628|            0.0055|         0.393|                 10.48|        1|    Red Dwarf|       Red|             M|
    |          2650|            6.0E-4|          0.14|                11.782|        1|    Red Dwarf|       Red|             M|
    |          3340|            0.0038|          0.24|                 13.07|        1|    Red Dwarf|       Red|             M|
    |          2799|            0.0018|          0.16|                 14.79|        1|    Red Dwarf|       Red|             M|
    |          3692|           0.00367|          0.47|                  10.8|        1|    Red Dwarf|       Red|             M|
    |          3192|           0.00362|        0.1967|                 13.53|        1|    Red Dwarf|       Red|             M|
    |          3441|             0.039|         0.351|                 11.18|        1|    Red Dwarf|       Red|             M|
    +--------------+------------------+--------------+----------------------+---------+-------------+----------+--------------+
    only showing top 20 rows
    
    +------------+----------+----------+
    |label_vector|prediction|is_matched|
    +------------+----------+----------+
    |         0.0|       0.0|      true|
    |         0.0|       0.0|      true|
    |         0.0|       0.0|      true|
    |         0.0|       0.0|      true|
    |         0.0|       0.0|      true|
    |         0.0|       0.0|      true|
    |         0.0|       0.0|      true|
    |         0.0|       0.0|      true|
    |         0.0|       0.0|      true|
    |         0.0|       0.0|      true|
    |         0.0|       0.0|      true|
    |         0.0|       0.0|      true|
    |         0.0|       0.0|      true|
    |         0.0|       0.0|      true|
    |         0.0|       0.0|      true|
    |         0.0|       0.0|      true|
    |         0.0|       0.0|      true|
    |         0.0|       0.0|      true|
    |         0.0|       0.0|      true|
    |         0.0|       0.0|      true|
    +------------+----------+----------+
    only showing top 20 rows
    
    +----------+-----+
    |is_matched|count|
    +----------+-----+
    |      true|  190|
    |     false|   50|
    +----------+-----+

    반응형
Designed by Tistory.