Task : 결측치 & 이상치 통합
•
세 개의 데이터프레임을 병합
df_merged = pd.merge(df_user, df_act, on='user_id', how='inner')
df_merged = pd.merge(df_merged, df_stat, on='user_id', how='inner')
Plain Text
복사
user_demographics
•
birthday : null 값 채우기 → 평균값
df_user['birthday_filled']= pd.to_datetime(df_user['birthday'])
birth_avg = df_user['birthday_filled'].mean()
df_user['birthday_filled']= df_user['birthday_filled'].fillna(birth_avg)
Python
복사
df_user['birthday_filled']= pd.to_datetime(df_user['birthday'])
df_user['birthday'].dtype # datetime64[ns]여야 정상
df_user['birthday'].isna().sum() # NaT 개수 확인 -> 20
birth_avg = df_user['birthday'].mean() # 평균 생일로 결측치 대체
df_user['birthday_filled_mean']= df_user['birthday'].fillna(birth_avg)
df_user['birthday_filled_mean'].isnull().sum()
Python
복사
•
gender : 카테고리 통일하기 → 소문자
df_user['user_demographics_gender'] = df_user['gender'].str.lower().str.strip().tolist()
Python
복사
df_user['gender'].value_counts()
df_user['gender'].unique()
df_user['gender'] = df_user['gender'].str.lower().str.strip().tolist()
df_user['gender'].value_counts()
Python
복사
•
theme_mode : 카테고리 통일하기 → 소문자
df_user['theme_mode'] = df_user['theme_mode'].replace({
'customized' : 'custom',
'dark_mode' : 'dark',
'Light' : 'light'
})
Python
복사
df_user['theme_mode'].value_counts()
df_user['theme_mode'].unique()
df_user['theme_mode'] = df_user['theme_mode'].replace({
'customized' : 'custom',
'dark_mode' : 'dark',
'Light' : 'light'
})
df_user['theme_mode'].value_counts()
Python
복사
book_reading_status
•
dropout_reason_detail : 오타 수정 → 데이터 삽입
df_stat['dropout_reason_detail_filled'] = df_stat['dropout_reason_detail'].fillna(df_stat['dropout_reason_category'])
df_stat['dropout_reason_detail_filled'] = df_stat['dropout_reason_detail_filled'].replace({'금한일' :'급한일'})
Python
복사
df_stat.isnull().sum()
df_stat['dropout_reason_detail_filled'] = df_stat['dropout_reason_detail'].fillna(df_stat['dropout_reason_category'])
df_stat['dropout_reason_detail_filled'].unique()
df_stat['dropout_reason_detail_filled'] = df_stat['dropout_reason_detail_filled'].replace({'금한일' :'급한일'})
Python
복사
user_activity
•
last_access_timestamp → 평균값
df_act['last_access_timestamp'] = pd.to_datetime(df_act['last_access_timestamp'], format='%Y.%m.%d %H:%M', errors='coerce')
timestamp_avg = df_act['last_access_timestamp'].mean()
df_act['last_access_timestamp_filled']= df_act['last_access_timestamp'].fillna(timestamp_avg)
Python
복사
df_act['last_access_timestamp'].dtype # datetime64[ns]여야 정상
df_act['last_access_timestamp'] = pd.to_datetime(df_act['last_access_timestamp'], format='%Y.%m.%d %H:%M', errors='coerce')
df_act['last_access_timestamp'].isna().sum() # NaT 개수 확인 -> 20
timestamp_avg = df_act['last_access_timestamp'].mean() # 평균 접속날짜로 결측치 대체
df_act['last_access_timestamp_filled']= df_act['last_access_timestamp'].fillna(timestamp_avg)
Python
복사
개인별 이상치 처리 방법
노시현
윤수민
정준영
김승인
임문수

