◾

[결측치 & 이상치 처리 방법]

Archive

ETA

2025/06/16

Main Task

Sub Task

담당자

메모

상태

Done

생성 일시

2025/06/16 07:00

우선 순위

High

진행률 %

Task : 결측치 & 이상치 통합

•

세 개의 데이터프레임을 병합

df_merged = pd.merge(df_user, df_act, on='user_id', how='inner')
df_merged = pd.merge(df_merged, df_stat, on='user_id', how='inner')
Plain Text
복사

user_demographics

•

birthday : null 값 채우기 → 평균값

df_user['birthday_filled']= pd.to_datetime(df_user['birthday'])
birth_avg = df_user['birthday_filled'].mean()
df_user['birthday_filled']= df_user['birthday_filled'].fillna(birth_avg)
Python
복사

df_user['birthday_filled']= pd.to_datetime(df_user['birthday'])
df_user['birthday'].dtype  # datetime64[ns]여야 정상
df_user['birthday'].isna().sum()  # NaT 개수 확인 -> 20 

birth_avg = df_user['birthday'].mean() # 평균 생일로 결측치 대체 

df_user['birthday_filled_mean']= df_user['birthday'].fillna(birth_avg) 

df_user['birthday_filled_mean'].isnull().sum()
Python
복사

•

gender : 카테고리 통일하기 → 소문자

df_user['user_demographics_gender'] = df_user['gender'].str.lower().str.strip().tolist()
Python
복사

df_user['gender'].value_counts()

df_user['gender'].unique()

df_user['gender'] = df_user['gender'].str.lower().str.strip().tolist()

df_user['gender'].value_counts()
Python
복사

•

theme_mode : 카테고리 통일하기 → 소문자

df_user['theme_mode'] = df_user['theme_mode'].replace({
    'customized' : 'custom', 
    'dark_mode' : 'dark',
    'Light' : 'light'
    })
Python
복사

df_user['theme_mode'].value_counts()

df_user['theme_mode'].unique()

df_user['theme_mode'] = df_user['theme_mode'].replace({
    'customized' : 'custom', 
    'dark_mode' : 'dark',
    'Light' : 'light'
    })
    
df_user['theme_mode'].value_counts()
Python
복사

book_reading_status

•

dropout_reason_detail : 오타 수정 → 데이터 삽입

df_stat['dropout_reason_detail_filled'] = df_stat['dropout_reason_detail'].fillna(df_stat['dropout_reason_category'])
df_stat['dropout_reason_detail_filled'] = df_stat['dropout_reason_detail_filled'].replace({'금한일' :'급한일'})
Python
복사

df_stat.isnull().sum()

df_stat['dropout_reason_detail_filled'] = df_stat['dropout_reason_detail'].fillna(df_stat['dropout_reason_category'])

df_stat['dropout_reason_detail_filled'].unique()

df_stat['dropout_reason_detail_filled'] = df_stat['dropout_reason_detail_filled'].replace({'금한일' :'급한일'})
Python
복사

user_activity

•

last_access_timestamp → 평균값

df_act['last_access_timestamp'] = pd.to_datetime(df_act['last_access_timestamp'], format='%Y.%m.%d %H:%M', errors='coerce')
timestamp_avg = df_act['last_access_timestamp'].mean()
df_act['last_access_timestamp_filled']= df_act['last_access_timestamp'].fillna(timestamp_avg)
Python
복사

df_act['last_access_timestamp'].dtype  # datetime64[ns]여야 정상

df_act['last_access_timestamp'] = pd.to_datetime(df_act['last_access_timestamp'], format='%Y.%m.%d %H:%M', errors='coerce')

df_act['last_access_timestamp'].isna().sum()  # NaT 개수 확인 -> 20 

timestamp_avg = df_act['last_access_timestamp'].mean() # 평균 접속날짜로 결측치 대체 
df_act['last_access_timestamp_filled']= df_act['last_access_timestamp'].fillna(timestamp_avg)
Python
복사

개인별 이상치 처리 방법

노시현

윤수민

정준영

김승인

임문수