데이터 전처리 및 결측치

세포라 데이터 전처리 및 결측치 처리

[병합할 데이터]

•

제품 데이터

product_name

size

variation_type

ingredients

sale_price_usd(함수로 조정)

limited_edition(타입 변경 int64→TF로 변경 해야함)

new(타입 변경 int64→TF로 변경 해야함)

online_only(타입 변경 int64→TF로 변경 해야함)

out_of_stock(타입 변경 int64→TF로 변경 해야함)

sephora_exclusive(타입 변경 int64→TF로 변경 해야함)

highlights

primary_category 8494 non-null object secondary_category 8486 non-null object tertiary_category 7504 non-null object

child_count 8494 non-null int64

child_max_price 2754 non-null float64 child_min_price 2754 non-null float64

리뷰 데이터에서 제거

skin_tone

eye_color

hair_color

통합 전처리

is_recommended → bool 변경

helpfulness 결측치 0 처리, 반올림, … 0으로 변경

review_text 결측치 제거

review_title 결측치 unknown 로 변경

skin_type 결측치 unknown 으로 변경

size 결측치 제거

variation_type 결측치 제거

ingredients 결측치 제거

highlights 결측치 unknown 으로 변경

tertiary_category 결측치 unknown 대체

child_max_price 결측값 0으로 대체 child_min_price 결측값 0으로 대체

전처리 부분

리뷰를 기준으로 레프트 조인

스킨케어 제품만 보기

helpfulness 은 도움 되는 비율

1. 100곱하기

NAN 값은 0으로 하기

리뷰 텍스트  NULL 값 제거

 review_title = UNKNOWN 로 하기 

skin_type = NONE 으로 하기 

size 결측치 제거. 

size 결측치 개수: 43363 size 결측치 비율: 3.96%

ingredients 결측치 제고 하기 ( 약 2만개 )

highlights 결측치 = unknown

차일드 price null  값 = 0이로 통일하기 

[코드 통합]

from google.colab import drive
drive.mount('/content/drive')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from scipy.stats import chi2_contingency
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
import statsmodels.api as sm
from imblearn.over_sampling import SMOTE

raw_df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/cjong/product_info.csv")
product_df = raw_df.copy()

review1 = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/cjong/reviews_0-250.csv")
review2 = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/cjong/reviews_250-500.csv")
review3 = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/cjong/reviews_500-750.csv")
review4 = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/cjong/reviews_750-1250.csv")
review5 = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/cjong/reviews_1250-end.csv")
reviews_df = pd.concat([review1, review2, review3, review4, review5], ignore_index=True)
# "Unnamed" 같은 불필요한 컬럼 제거
reviews_df = reviews_df.loc[:, ~reviews_df.columns.str.contains('^Unnamed')]
Plain Text
복사

# 1. 리뷰 데이터에서 불필요한 컬럼 제거
reviews_df = reviews_df.drop(columns=["skin_tone", "eye_color", "hair_color"], errors="ignore")

# 2. product_df에서 필요한 컬럼만 선택
product_cols = [
    "product_id", "product_name", "size", "variation_type", "ingredients",
    "sale_price_usd", "limited_edition", "new", "online_only", 
    "out_of_stock", "sephora_exclusive",
    "highlights", "primary_category", "secondary_category", "tertiary_category",
    "child_count", "child_max_price", "child_min_price"
]
product_sub = product_df[product_cols].copy()

# 3. sale_price_usd 조정 (결측 시 price_usd로 대체)
product_sub["sale_price_usd"] = product_df["sale_price_usd"].fillna(product_df["price_usd"])

# 4. 정수 → 불리언 변환
bool_cols = ["limited_edition", "new", "online_only", "out_of_stock", "sephora_exclusive"]
for col in bool_cols:
    product_sub[col] = product_sub[col].astype(bool)

# 5. 리뷰 데이터와 병합 (product_id 기준)
merged_df = pd.merge(reviews_df, product_sub, on="product_id", how="left")
# product_name_y를 최종 product_name으로 사용
merged_df["product_name"] = merged_df["product_name_y"]

# 불필요한 중복 컬럼 제거
merged_df = merged_df.drop(columns=["product_name_x", "product_name_y"], errors="ignore")

print("✅ 병합 완료! shape:", merged_df.shape)
display(merged_df.head())
Plain Text
복사

[결측치 처리]

# primary_category가 "Skincare"인 상품만 추출
product_Skincare = product_df[product_df["primary_category"] == "Skincare"].copy()

# 1. is_recommended → bool 타입 변환
merged_df["is_recommended"] = merged_df["is_recommended"].fillna(0).astype(bool)
# 2. helpfulness → 결측치 0, 소수점 반올림(2자리)
merged_df["helpfulness"] = merged_df["helpfulness"].fillna(0).round(2)
# 3. review_text → 결측치 제거
merged_df = merged_df.dropna(subset=["review_text"])

# 4. review_title → 결측치 "Unknown" 대체
merged_df["review_title"] = merged_df["review_title"].fillna("Unknown")

# 5. skin_type → 결측치 "Unknown" 대체
merged_df["skin_type"] = merged_df["skin_type"].fillna("Unknown")

# 6. size → 결측치 제거
merged_df = merged_df.dropna(subset=["size"])

# 7. variation_type → 결측치 제거
merged_df = merged_df.dropna(subset=["variation_type"])

# 8. ingredients → 결측치 제거
merged_df = merged_df.dropna(subset=["ingredients"])

# 9. highlights → 결측치 "Unknown" 대체
merged_df["highlights"] = merged_df["highlights"].fillna("Unknown")

# 10. tertiary_category → 결측치 "Unknown" 대체
merged_df["tertiary_category"] = merged_df["tertiary_category"].fillna("Unknown")

# 11. child_max_price / child_min_price → 결측치 0 대체
merged_df["child_max_price"] = merged_df["child_max_price"].fillna(0)
merged_df["child_min_price"] = merged_df["child_min_price"].fillna(0)
Plain Text
복사

브랜드 설정을 위한 데이터 분석 

브랜드 설정   ⇒ face gym( 비건, 동물권, 뷰티 디바이스) 

신청서( 올리브영 or 크롤링) 기획서 제출하기 

데이터 전처리(결측치, 이상치, 타입 등 정리)

주요 변수 탐색 및 시각화 (분포, 상관관계 등)

문제 정의를 위한 초기 인사이트 도출