day 8 파이썬 Visualization

7 minute read

파이썬 데이터 Visualization

파이썬으로 그래프를 그리기

막대그래프 그려보기

import matplotlib.pyplot as plt
%matplotlib inline

#그래프 데이터 
subject = ['English', 'Math', 'Korean', 'Science', 'Computer']
points = [40, 90, 50, 60, 100]

# 축 그리기
fig = plt.figure()
ax1 = fig.add_subplot(1,1,1)

# 그래프 그리기
ax1.bar(subject, points)

# 라벨, 타이틀 달기
plt.xlabel('Subject')
plt.ylabel('Points')
plt.title("Yuna's Test Result")

# 보여주기
plt.savefig('./barplot.png')  # 그래프를 이미지로 출력
plt.show()                            # 그래프를 화면으로 출력

선 그래프 그려보기

from datetime import datetime
import pandas as pd
import os

#그래프 데이터 
csv_path = os.getenv("HOME") + "/aiffel/data_visualization/data/AMZN.csv"
data = pd.read_csv(csv_path ,index_col=0, parse_dates=True)
price = data['Close']

# 축 그리기 및 좌표축 설정
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
price.plot(ax=ax, style='black')
plt.ylim([1600,2200])
plt.xlim(['2019-05-01','2020-03-01'])

# 주석달기
important_data = [(datetime(2019, 6, 3), "Low Price"),(datetime(2020, 2, 19), "Peak Price")]
for d, label in important_data:
    ax.annotate(label, xy=(d, price.asof(d)+10), #주석을 달 좌표(x,y)
                xytext=(d,price.asof(d)+100), #주석 텍스트가 위차할 좌표(x,y)
                arrowprops=dict(facecolor='red')) #화살표 추가 및 색 설정

# 그리드, 타이틀 달기
plt.grid()
ax.set_title('StockPrice')

# 보여주기
plt.show()

plot 사용법

import numpy as np
x = np.linspace(0, 10, 100) #0에서 10까지 균등한 간격으로  100개의 숫자를 만들라는 뜻입니다.
plt.plot(x, np.sin(x),'o')
plt.plot(x, np.cos(x),'--', color='black') 
plt.show()

x = np.linspace(0, 10, 100) 

plt.subplot(2,1,1)
plt.plot(x, np.sin(x),'orange','o')

plt.subplot(2,1,2)
plt.plot(x, np.cos(x), 'orange') 
plt.show()

x = np.linspace(0, 10, 100) 

plt.plot(x, x + 0, linestyle='solid') 
plt.plot(x, x + 1, linestyle='dashed') 
plt.plot(x, x + 2, linestyle='dashdot') 
plt.plot(x, x + 3, linestyle='dotted')
plt.plot(x, x + 0, '-g') # solid green 
plt.plot(x, x + 1, '--c') # dashed cyan 
plt.plot(x, x + 2, '-.k') # dashdot black 
plt.plot(x, x + 3, ':r'); # dotted red
plt.plot(x, x + 4, linestyle='-') # solid 
plt.plot(x, x + 5, linestyle='--') # dashed 
plt.plot(x, x + 6, linestyle='-.') # dashdot 
plt.plot(x, x + 7, linestyle=':'); # dotted

pandas

pandas.plot메서드 인자

label: 그래프의 범례이름.
ax: 그래프를 그릴 matplotlib의 서브플롯 객체.
style: matplotlib에 전달할 ‘ko–‘같은 스타일의 문자열
alpha: 투명도 (0 ~1)
kind: 그래프의 종류: line, bar, barh, kde
logy: Y축에 대한 로그스케일
use_index: 객체의 색인을 눈금 이름으로 사용할지의 여부
rot: 눈금 이름을 로테이션(0 ~ 360)
xticks, yticks: x축, y축으로 사용할 값
xlim, ylim: x축, y축 한계
grid: 축의 그리드 표시할 지 여부

pandas의 data가 DataFrame일때 plot 메서드 인자

subplots: 각 DataFrame의 칼럼을 독립된 서브플롯에 그린다.
sharex: subplots=True면 같은 X축을 공유하고 눈금과 한계를 연결한다.
sharey: subplots=True면 같은 Y축을 공유한다.
figsize: 그래프의 크기, 튜플로 지정
title: 그래프의 제목을 문자열로 지정
sort_columns: 칼럼을 알파벳 순서로 그린다.

fig, axes = plt.subplots(2, 1)
data = pd.Series(np.random.rand(5), index=list('abcde'))
data.plot(kind='bar', ax=axes[0], color='blue', alpha=1)
data.plot(kind='barh', ax=axes[1], color='red', alpha=0.3)

df = pd.DataFrame(np.random.rand(6,4), columns=pd.Index(['A','B','C','D']))
df.plot(kind='line')

그래프 그리기 정리

fig = plt.figure(): figure 객체를 선언해 ‘도화지를 펼쳐’ 줍니다.
ax1 = fig.add_subplot(1,1,1) : 축을 그립니다.
ax1.bar(x, y) 축안에 어떤 그래프를 그릴지 메소드를 선택한 다음, 인자로 데이터를 넣어줍니다.
그래프 타이틀 축의 레이블 등을 plt의 여러 메소드 grid, xlabel, ylabel 을 이용해서 추가해주고
plt.savefig 메소드를 이용해 저장해 줍니다.

데이터 준비

import pandas as pd
import seaborn as sns

tips = sns.load_dataset("tips")
df = pd.DataFrame(tips)
df.head()
..
..

total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

df.shape
..
..
(244, 7)

df.describe()
..
..
	  total_bill	   tip	       size
count	244.000000	244.000000	244.000000
mean	19.785943	2.998279	2.569672
std	8.902412	1.383638	0.951100
min	3.070000	1.000000	1.000000
25%	13.347500	2.000000	2.000000
50%	17.795000	2.900000	2.000000
75%	24.127500	3.562500	3.000000
max	50.810000	10.000000	6.000000

df.info()
..
..
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  244 non-null    float64 
 1   tip         244 non-null    float64 
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    category
 5   time        244 non-null    category
 6   size        244 non-null    int64   
dtypes: category(4), float64(2), int64(1)
memory usage: 7.4 KB



print(df['sex'].value_counts())
print("===========================")


print(df['time'].value_counts())
print("===========================")


print(df['smoker'].value_counts())
print("===========================")


print(df['day'].value_counts())
print("===========================")


print(df['size'].value_counts())
print("===========================")
..
..
Male      157
Female     87
Name: sex, dtype: int64
===========================
Dinner    176
Lunch      68
Name: time, dtype: int64
===========================
No     151
Yes     93
Name: smoker, dtype: int64
===========================
Sat     87
Sun     76
Thur    62
Fri     19
Name: day, dtype: int64
===========================
2    156
3     38
4     37
5      5
1      4
6      4
Name: size, dtype: int64
===========================

범주형 데이터

범주형 데이터는 주로 막대 그래프를 사용하여 수치를 요약합니다. 일반적으로 가로, 세로, 누적, 그룹화된 막대 그래프를 사용합니다.

막대 그래프(bar graph)

#df의 첫 5행을 확인해봅시다. 
df.head()

total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

grouped = df['tip'].groupby(df['sex'])
grouped.mean() # 성별에 따른 팁의 평균
..
..
sex
Male      3.089618
Female    2.833448
Name: tip, dtype: float64

grouped.size() # 성별에 따른 데이터 량(팁 횟수)
..
..
sex
Male      157
Female     87
Name: tip, dtype: int64

import numpy as np
sex = dict(grouped.mean()) #평균 데이터를 딕셔너리 형태로 바꿔줍니다.
sex
..
..
{'Male': 3.0896178343949052, 'Female': 2.833448275862069}

x = list(sex.keys())  
x
..
..
['Male', 'Female']

y = list(sex.values())
y
..
..
[3.0896178343949052, 2.833448275862069]

import matplotlib.pyplot as plt

plt.bar(x = x, height = y)
plt.ylabel('tip[$]')
plt.title('Tip by Sex')
plt.legend()
..
..
No handles with labels found to put in legend.
<matplotlib.legend.Legend at 0x7f7b777d7250>

sns.barplot(data=df, x='sex', y='tip')
..
..
<AxesSubplot:xlabel='sex', ylabel='tip'>

plt.figure(figsize=(10,6))
sns.barplot(data=df, x='day', y='tip')
plt.ylim(0, 4)
plt.title('Tip by day')
..
..
Text(0.5, 1.0, 'Tip by day')

fig = plt.figure(figsize=(10,7))

ax1 = fig.add_subplot(2,2,1)
sns.barplot(data=df, x='day', y='tip',palette="ch:.25")

ax2 = fig.add_subplot(2,2,2)
sns.barplot(data=df, x='sex', y='tip')

ax3 = fig.add_subplot(2,2,4)
sns.violinplot(data=df, x='sex', y='tip')

ax4 = fig.add_subplot(2,2,3)
sns.violinplot(data=df, x='day', y='tip',palette="ch:.25")
..
..
<AxesSubplot:xlabel='day', ylabel='tip'>

sns.catplot(x="day", y="tip", jitter=False, data=tips)
..
..
<seaborn.axisgrid.FacetGrid at 0x7f7b6eb36110>

수치형 데이터

산점도(scatter plot)

sns.scatterplot(data=df , x='total_bill', y='tip', palette="ch:r=-.2,d=.3_r")
..
..
<AxesSubplot:xlabel='total_bill', ylabel='tip'>

선 그래프(line graph)

#np.random.randn 함수는 표준 정규분포에서 난수를 생성하는 함수입니다. 
#cumsum()은 누적합을 구하는 함수입니다.
plt.plot(np.random.randn(50).cumsum())

x = np.linspace(0, 10, 100) 
plt.plot(x, np.sin(x), 'o')
plt.plot(x, np.cos(x)) 
plt.show()

sns.lineplot(x, np.sin(x))
sns.lineplot(x, np.cos(x))

히스토그램

히스토그램의 용어는 다음과 같습니다.

↔가로축

계급: 변수의 구간, bin (or bucket)

↕세로축

도수: 빈도수, frequency

전체 총량: n

x1은 평균은 100이고 표준편차는 15인 정규분포를 따릅니다.

x2는 평균은 130이고 표준편차는 15인 정규분포를 따릅니다.

도수를 50개의 구간으로 표시하며, 확률 밀도가 아닌 빈도로 표기합니다.

#그래프 데이터 
mu1, mu2, sigma = 100, 130, 15
x1 = mu1 + sigma*np.random.randn(10000)
x2 = mu2 + sigma*np.random.randn(10000)

# 축 그리기
fig = plt.figure()
ax1 = fig.add_subplot(1,1,1)

# 그래프 그리기
patches = ax1.hist(x1, bins=50, density=False) #bins는 x값을 총 50개 구간으로 나눈다는 뜻입니다.
patches = ax1.hist(x2, bins=50, density=False, alpha=0.5)
ax1.xaxis.set_ticks_position('bottom') # x축의 눈금을 아래 표시 
ax1.yaxis.set_ticks_position('left') #y축의 눈금을 왼쪽에 표시

# 라벨, 타이틀 달기
plt.xlabel('Bins')
plt.ylabel('Number of Values in Bin')
ax1.set_title('Two Frequency Distributions')

# 보여주기
plt.show()

예제 데이터의 히스토그램

sns.distplot(df['total_bill'], label = "total_bill")
sns.distplot(df['tip'], label = "tip").legend()# legend()를 이용하여 label을 표시해 줍니다.

df['tip_pct'] = df['tip'] / df['total_bill']
df['tip_pct'].hist(bins=50)

df['tip_pct'].plot(kind='kde')

kind=’kde’로 확률 밀도 그래프로 나타내 보았습니다.

밀도 그래프: 연속된 확률분포를 나타냅니다.

요약

막대 그래프(bar graph), 꺾은선 그래프(line graph), 산점도(scatter plot), 히스토그램(histogram)은 많이 쓰이는 그래프 입니다

시계열 데이터 시각화

데이터 가져오기

csv_path = os.getenv("HOME") + "/aiffel/data_visualization/data/flights.csv"
data = pd.read_csv(csv_path)
flights = pd.DataFrame(data)
flights
..
..

year	month	passengers
0	1949	January	112
1	1949	February	118
2	1949	March	132
3	1949	April	129
4	1949	May	121
...	...	...	...
139	1960	August	606
140	1960	September	508
141	1960	October	461
142	1960	November	390
143	1960	December	432
144 rows × 3 columns

그래프 그리기

sns.barplot(data=flights, x='year', y='passengers')

sns.pointplot(data=flights, x='year', y='passengers')

sns.lineplot(data=flights, x='year', y='passengers')

sns.lineplot(data=flights, x='year', y='passengers', hue='month', palette='ch:.50')
plt.legend(bbox_to_anchor=(1.03, 1), loc=2) #legend 그래프 밖에 추가하기

sns.distplot(flights['passengers'])

Heatmap

pivot

pivot = flights.pivot(index='year', columns='month', values='passengers')
pivot
..
..
month	April	August	December	February	January	July	June	March	May	November	October	September
year												
1949	129	148	118	118	112	148	135	132	121	104	119	136
1950	135	170	140	126	115	170	149	141	125	114	133	158
1951	163	199	166	150	145	199	178	178	172	146	162	184
1952	181	242	194	180	171	230	218	193	183	172	191	209
1953	235	272	201	196	196	264	243	236	229	180	211	237
1954	227	293	229	188	204	302	264	235	234	203	229	259
1955	269	347	278	233	242	364	315	267	270	237	274	312
1956	313	405	306	277	284	413	374	317	318	271	306	355
1957	348	467	336	301	315	465	422	356	355	305	347	404
1958	348	505	337	318	340	491	435	362	363	310	359	404
1959	396	559	405	342	360	548	472	406	420	362	407	463
1960	461	606	432	391	417	622	535	419	472	390	461	508

sns.heatmap(pivot)

sns.heatmap(pivot, linewidths=.2, annot=True, fmt="d")

sns.heatmap(pivot, cmap="YlGnBu")

Share on

Twitter Facebook LinkedIn