-
[SparkML] Kaggle Stars Classification 구현하기 (Logistic Regression)Spark/SparkML 2023. 2. 21. 15:36728x90반응형
- 목차
소개.
아래 링크는 Kaggle 의 Stars Classification 관련한 웹 링크입니다.
https://www.kaggle.com/code/ybifoundation/stars-classification/notebook
별의 특성들로부터 항성을 분류하는 Classification 을 구현해야합니다.
항성분류법은 O,B,A,F,G,K,M 등급으로 분류되는 표현법이구요.
별의 크기, 밝기, 온도 등의 feature 들로부터 항성을 분류합니다.
제공되는 데이터 Column 의 설명입니다.
- Absolute Temperature (in K) // 절대온도
- Relative Luminosity (L/Lo) // 상대 휘도
- Relative Radius (R/Ro) // 직경
- Absolute Magnitude (Mv) // 질량
- Star Color (white,Red,Blue,Yellow,yellow-orange etc)
- Spectral Class (O,B,A,F,G,K,,M) // 항성 분류
- Star Type (Red Dwarf, Brown Dwarf, White Dwarf, Main Sequence , SuperGiants, HyperGiants)
- Lo = 3.828 x 10^26 Watts (Avg Luminosity of Sun)
- Ro = 6.9551 x 10^8 m (Avg Radius of Sun)구현하기.
저의 경우에는 Logistic Regression 을 활용하여 Stars Classification 모델링에 접근하였습니다.
import pandas as pd from pyspark.ml.classification import LogisticRegression from pyspark.ml.feature import VectorAssembler, StringIndexer from pyspark.sql import SparkSession, functions from pyspark.sql.functions import col from pyspark.sql.types import LongType, DoubleType, StringType, StructType, StructField star_pd_df = pd.read_csv("https://github.com/YBIFoundation/Dataset/raw/main/Stars.csv") spark = SparkSession.builder.master("local[*]").appName("starts-classification") \ .config("spark.driver.bindAddress", "localhost").getOrCreate() schema = StructType([ StructField("Temperature(K)", LongType(), True), StructField("Luminosity(L / Lo)", DoubleType(), True), StructField("Radius(R / Ro)", DoubleType(), True), StructField("Absolute magnitude(Mv)", DoubleType(), True), StructField("Star type", LongType(), True), StructField("Star category", StringType(), True), StructField("Star color", StringType(), True), StructField("Spectral Class", StringType(), True), ]) star_df = spark.createDataFrame(star_pd_df, schema=schema) star_df.printSchema() star_df.show() count = star_df.count() print(f" count is {count} ") assembler = VectorAssembler(inputCols=["Temperature(K)", "Luminosity(L / Lo)", "Radius(R / Ro)", "Absolute magnitude(Mv)"], outputCol="features_vector") star_df = assembler.transform(star_df) indexer = StringIndexer() indexer = indexer.setInputCol(value="Spectral Class") indexer = indexer.setOutputCol(value="label_vector") indexer = indexer.fit(star_df) star_df = indexer.transform(star_df) lr = LogisticRegression(maxIter=1000, regParam=0.01, featuresCol="features_vector", labelCol="label_vector") model = lr.fit(star_df) predictions = model.transform(star_df) predictions_with_matching = predictions.select("label_vector", "prediction").withColumn("is_matched", col("label_vector") == col("prediction")) predictions_with_matching.show() predictions_with_matching.groupBy("is_matched").agg(functions.count("is_matched").alias("count")).show()
root |-- Temperature(K): long (nullable = true) |-- Luminosity(L / Lo): double (nullable = true) |-- Radius(R / Ro): double (nullable = true) |-- Absolute magnitude(Mv): double (nullable = true) |-- Star type: long (nullable = true) |-- Star category: string (nullable = true) |-- Star color: string (nullable = true) |-- Spectral Class: string (nullable = true) +--------------+------------------+--------------+----------------------+---------+-------------+----------+--------------+ |Temperature(K)|Luminosity(L / Lo)|Radius(R / Ro)|Absolute magnitude(Mv)|Star type|Star category|Star color|Spectral Class| +--------------+------------------+--------------+----------------------+---------+-------------+----------+--------------+ | 3068| 0.0024| 0.17| 16.12| 0| Brown Dwarf| Red| M| | 3042| 5.0E-4| 0.1542| 16.6| 0| Brown Dwarf| Red| M| | 2600| 3.0E-4| 0.102| 18.7| 0| Brown Dwarf| Red| M| | 2800| 2.0E-4| 0.16| 16.65| 0| Brown Dwarf| Red| M| | 1939| 1.38E-4| 0.103| 20.06| 0| Brown Dwarf| Red| M| | 2840| 6.5E-4| 0.11| 16.98| 0| Brown Dwarf| Red| M| | 2637| 7.3E-4| 0.127| 17.22| 0| Brown Dwarf| Red| M| | 2600| 4.0E-4| 0.096| 17.4| 0| Brown Dwarf| Red| M| | 2650| 6.9E-4| 0.11| 17.45| 0| Brown Dwarf| Red| M| | 2700| 1.8E-4| 0.13| 16.05| 0| Brown Dwarf| Red| M| | 3600| 0.0029| 0.51| 10.69| 1| Red Dwarf| Red| M| | 3129| 0.0122| 0.3761| 11.79| 1| Red Dwarf| Red| M| | 3134| 4.0E-4| 0.196| 13.21| 1| Red Dwarf| Red| M| | 3628| 0.0055| 0.393| 10.48| 1| Red Dwarf| Red| M| | 2650| 6.0E-4| 0.14| 11.782| 1| Red Dwarf| Red| M| | 3340| 0.0038| 0.24| 13.07| 1| Red Dwarf| Red| M| | 2799| 0.0018| 0.16| 14.79| 1| Red Dwarf| Red| M| | 3692| 0.00367| 0.47| 10.8| 1| Red Dwarf| Red| M| | 3192| 0.00362| 0.1967| 13.53| 1| Red Dwarf| Red| M| | 3441| 0.039| 0.351| 11.18| 1| Red Dwarf| Red| M| +--------------+------------------+--------------+----------------------+---------+-------------+----------+--------------+ only showing top 20 rows +------------+----------+----------+ |label_vector|prediction|is_matched| +------------+----------+----------+ | 0.0| 0.0| true| | 0.0| 0.0| true| | 0.0| 0.0| true| | 0.0| 0.0| true| | 0.0| 0.0| true| | 0.0| 0.0| true| | 0.0| 0.0| true| | 0.0| 0.0| true| | 0.0| 0.0| true| | 0.0| 0.0| true| | 0.0| 0.0| true| | 0.0| 0.0| true| | 0.0| 0.0| true| | 0.0| 0.0| true| | 0.0| 0.0| true| | 0.0| 0.0| true| | 0.0| 0.0| true| | 0.0| 0.0| true| | 0.0| 0.0| true| | 0.0| 0.0| true| +------------+----------+----------+ only showing top 20 rows +----------+-----+ |is_matched|count| +----------+-----+ | true| 190| | false| 50| +----------+-----+
반응형'Spark > SparkML' 카테고리의 다른 글
[SparkML] StandardScaler 알아보기 ( 표준화, Feature Scaling ) (0) 2024.01.08 [SparkML] Kaggle EDA + Regression 구현하기 (0) 2023.02.24 [SparkML] Linear Regression 구현하기 ( Kaggle ) (0) 2021.12.05 [SparkML] VectorAssembler 알아보기 (0) 2021.12.04