Pandas

安裝Pandas

Pandas 的資料結構

Series(串列或字典, index=索引)

DataFrame(串列或字典, index=索引)

安裝Pandas

檢查是否已安裝Pandas

import pandas

pandas.__version__

輸出結果

'1.0.3'

Pandas 的資料結構

Series：單維度資料集，如一個列表或是試算表中某欄位的直向資料，用來處理時間序列相關的資料，主要為建立索引的一維陣列。
DataFrame：雙維度資料集，如一個表格有欄和列，用來處理結構化的資料，例如關聯式資料庫、CSV 等等。
Panel：三維度資料集，用來處理有資料及索引、列索引與欄標籤的三維資料集。

Series(串列或字典, index=索引)

資料為串列( list )

import pandas as pd

data = pd.Series([30, 15, 20])

condition = [True, False, True]

filtered = data[condition]

print(filtered)

輸出結果

0 30

2 20

dtype: int64

import pandas as pd

data = pd.Series([30, 15, 20])

condition = data > 18

#print(condition)

filtered = data[condition]

print(filtered)

輸出結果

0 30

2 20

dtype: int64

import pandas as pd

data = pd.Series(['ffn', 'numpy', 'pandas'])

condition = data.str.contains("p")

#print(condition)

filtered = data[condition]

print(filtered)

輸出結果

1 numpy

2 pandas

dtype: object

資料為字典( Dictionary )

import pandas as pd

dict = {

'pk1': 'ffn',

'pk2': "numpy",

'pk3': "pandas"

}

#data = pd.Series(dict, index = dict.keys()) # 排序與原 dict 相同

data = pd.Series(dict)

print(data)

輸出結果

pk1 ffn

pk2 numpy

pk3 pandas

dtype: object

DataFrame(串列或字典, index=索引)

資料為字典( Dictionary )

import pandas as pd

data = pd.DataFrame({

'name':['Amy', 'Bob', 'John'],

'age':[18,19,21]

})

condition = [True, False, True]

filtered = data[condition]

print(filtered)

或

import pandas as pd

nm = ['Amy', 'Bob', 'John']

ag = [18,19,21]

data = pd.DataFrame({

'name':nm,

'age':ag

})

condition = [True, False, True]

filtered = data[condition]

print(filtered)

輸出結果

name age

1 Amy 18

2 John 21

import pandas as pd

data = pd.DataFrame({

'name':['Amy', 'Bob', 'John'],

'age':[18,19,21]

})

#print(data)

condition = data["age"]>=19

#condition = data['name']=='Bob'

filtered = data[condition]

print(filtered)

輸出結果

name age

1 Bob 19

2 John 21

常用操作

.size：回傳列資料數量
.shape：回傳列數與欄數
.index：回傳列索引
.columns：回傳欄位名稱
.describe()：回傳描述性統計
.head(n)：回傳前n筆資料
.tail(n)：回傳後n筆資料
.info()：回傳資料內容

取列資料

.iloc[1]：依據順序取得第二列資料，此為Series結構
.iloc[索引]：依據索引取得列資料，此為Series結構

取欄資料

data[欄位名稱]：此為Series結構

data = pd.DataFrame({

'name':['Amy', 'Bob', 'John'],

'age':[18,19,21]

}, index=['a','b','c'])

print(data)

print("------取得第二列資料-------")

print(data.iloc[1])

print("------依據索引取得列資料-------")

print(data.loc['c'])

print("------取得欄資料-------")

print(data['name'])

name = data['name']

print("------data['name']有o-------")

print(name.str.contains('o'))

print("------data['name']變大寫-------")

print(name.str.upper())

age = data['age']

print("------data['age']平均值-------")

print(age.mean())

輸出結果

name age

a Amy 18

b Bob 19

c John 21

------取得第二列資料-------

name Bob

age 19

Name: b, dtype: object

------依據索引取得列資料-------

name John

age 21

Name: c, dtype: object

------取得欄資料-------

a Amy

b Bob

c John

Name: name, dtype: object

------data['name']有o-------

a False

b True

c True

Name: name, dtype: bool

------data['name']變大寫-------

a AMY

b BOB

c JOHN

Name: name, dtype: object

------data['age']平均值-------

19.333333333333332

data = pd.DataFrame({

'name':['Amy', 'Bob', 'John'],

'age':[18,19,21]

}, index=['a','b','c'])

print("-----------size-------------")

print(data.size)

print("-----------shape-------------")

print(data.shape)

print("-----------index-------------")

print(data.index)

print("-----------columns-----------")

print(data.columns)

print("--------describe()-----------")

print(data.describe())

print("----------head(2)-------------")

print(data.head(2))

print("----------tail(2)-------------")

print(data.tail(2))

print("----------info()-------------")

print(data.info())

輸出結果

-----------size-------------

6

-----------shape-------------

(3, 2)

-----------index-------------

Index(['a', 'b', 'c'], dtype='object')

-----------columns-----------

Index(['name', 'age'], dtype='object')

--------describe()-----------

age

count 3.000000

mean 19.333333

std 1.527525

min 18.000000

25% 18.500000

50% 19.000000

75% 20.000000

max 21.000000

----------head(2)-------------

name age

a Amy 18

b Bob 19

----------tail(2)-------------

name age

b Bob 19

c John 21

----------info()-------------

Index: 3 entries, a to c

Data columns (total 2 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 name 3 non-null object

1 age 3 non-null int64

dtypes: int64(1), object(1)

memory usage: 72.0+ bytes

None

資料排序

.sort_index()：透過索引值做排序
.sort_values()：透過指定欄位的數值排序

data = pd.DataFrame({

'name':['Bob', 'Amy', 'John'],

'age':[19,18,21]

}, index=['a','b','c'])

byindex = data.sort_index(axis = 0, ascending = True) # 透過索引值做排序，axis 可以指定第幾欄，ascending 用於設定升冪或降冪 .

print(byindex)

byvalues = data.sort_values(by = 'name') #透過指定欄位的數值排序

print(byvalues)

輸出結果

name age

a Bob 19

b Amy 18

c John 21

name age

b Amy 18

a Bob 19

c John 21

建立新欄位資料

data[新欄位名稱] = 列表資料
data[新欄位名稱] = Series結構資料

data['height']=[170,175,185]

#data['weight']=pd.Series([55,60,65],index=['a','b','c'])

#data['BMI']=data['weight']/(data['height']/100)**2

print(data)

輸出結果

name age height

a Amy 18 170

b Bob 19 175

c John 21 185

資料為串列( list )

import pandas as pd

arr = groups = [["Amy", 18],["Bob", 19], ["John", 21]]

data = pd.DataFrame(arr, columns = ["name", "age"]) # 指定欄標籤名稱

print(data)

輸出結果

name age

0 Amy 18

1 Bob 19

2 John 21

指定函式作用於每個元素資料

.apply(函式)：將給定函式作用於整個DataFrame內的一維向量上(各欄：預設為axis=0 )，若要作用於各列需指定參數axis=1

.applymap(函式)：將給定函式作用於整個DataFrame內的每個元素

.map(函式)：將給定函式只作用於一個Series內的每個元素

#計算出DataFrame內每欄中最大元素減去最小元素

df = pd.DataFrame({

'n1':[10.256, 7.256, 8.256],

'n2':[3.256, 4.256, 9.256],

'n3':[10.256, 7.256, 1.256]

})

fun = lambda x: x.max() - x.min()

# def fun(x) : return x.max() - x.min()

df = df.apply(fun)

print(df)

輸出結果

n1 3.0

n2 6.0

n3 9.0

dtype: float64

#計算出DataFrame內每列中最大元素減去最小元素，axis=1

df = pd.DataFrame({

'n1':[10.256, 7.256, 8.256],

'n2':[3.256, 4.256, 9.256],

'n3':[10.256, 7.256, 1.256]

})

fun = lambda x: x.max() - x.min()

# def fun(x) : return x.max() - x.min()

df = df.apply(fun, axis=1)

print(df)

輸出結果

0 7.0

1 3.0

2 8.0

dtype: float64

#將DataFrame內每欄元素格式化為小數點第二位

df = pd.DataFrame({

'n1':[10.256, 7.256, 8.256],

'n2':[3.256, 4.256, 9.256],

'n3':[10.256, 7.256, 1.256]

})

# fun = lambda x: '{:.2f}'.format(x)

def fun(x) : return '{:.2f}'.format(x)

df = df.applymap(fun)

print(df)

輸出結果

n1 n2 n3

0 10.26 3.26 10.26

1 7.26 4.26 7.26

2 8.26 9.26 1.26

將類型更改為數字

使用to_numeric(串列或list等一維度資料)

import pandas as pd

string = ['8', '9', '10']

series = pd.Series(string)

num = pd.to_numeric(series)

print(num)

輸出結果

0 8

1 9

2 10

dtype: int64

使用參數errors

raise ：預設值，無效值會引發異常訊息
coerce：無效值轉為NaN
ignore ：無效值忽略不處理

import pandas as pd

string = ['8', '9', 'a']

series = pd.Series(string)

num = pd.to_numeric(series, errors='coerce')

print(num)

輸出結果

0 8.0

1 9.0

2 NaN

dtype: float64

使用astype(指定型態)

import pandas as pd

string = ['8', '9', '10']

series = pd.Series(string)

num = series.astype('int')

print(num)

輸出結果

0 8

1 9

2 10

dtype: int32

import pandas as pd

string = ['8', '9', '10']

series = pd.Series(string)

num = series.astype('int')

print(num)

num = series.astype('int64')

print(num)

num = series.astype('str')

print(num)

輸出結果

0 8

1 9

2 10

dtype: int32

0 8

1 9

2 10

dtype: int64

0 8

1 9

2 10

dtype: object

讀取csv與檔案資料

使用read_csv(檔案位置)

檔案位置：或URL路徑，例如：'磁碟機:\\路徑\\..\\檔案'
回傳DataFrame資料，列索引預設為0,1,2.，欄索引預設為第一列
參數
- header：預設為0，若第二列要成為欄索引可以設定header=1，若第一列為資料可以設定header=None，此時欄索引為0,1,2..
- index_col：預設為None，指定某欄資料當作列索引，若第一欄資料要設定為列索引，可以設定index_col=0

CSV檔案格式以逗號分隔資料，將以下資料使用記事本另存123.csv

c1,c2,c3

a,5,10

c,6,11

f,7,12

g,8,13

x,9,14

y,10,15

q,5,16

import pandas as pd

df = pd.read_csv('C:\\Users\\user\\Downloads\\123.csv', header=0, index_col=None)

#print(type(df))

print(df)

輸出結果

c1 c2 c3

0 a 5 10

1 c 6 11

2 f 7 12

3 g 8 13

4 x 9 14

5 y 10 15

6 q 5 16

讀取html中表格內資料

使用read_html(網址)

import pandas as pd

df = pd.read_html('http://rate.bot.com.tw/xrt?Lang=zh-TW')

print(df[0])

輸出結果

幣別現金匯率 \ 幣別 Unnamed: 1_level_1 本行買入本行賣出 0 美金 (USD) 美金 (USD) 29.645 30.315 29.995 1 港幣 (HKD) 港幣 (HKD) 3.72 3.924 3.846 2 英鎊 (GBP) 英鎊 (GBP) 36.29 38.41 37.29 3 澳幣 (AUD) 澳幣 (AUD) 18.7 19.48 18.97 4 加拿大幣 (CAD) 加拿大幣 (CAD) 20.86 21.77 21.25 5 新加坡幣 (SGD) 新加坡幣 (SGD) 20.56 21.47 21.05 6 瑞士法郎 (CHF) 瑞士法郎 (CHF) 30.24 31.44 30.9 7 日圓 (JPY) 日圓 (JPY) 0.2694 0.2822 0.2767 8 南非幣 (ZAR) 南非幣 (ZAR) - - 1.557 9 瑞典幣 (SEK) 瑞典幣 (SEK) 2.62 3.14 2.96 10 紐元 (NZD) 紐元 (NZD) 17.71 18.56 18.09 11 泰幣 (THB) 泰幣 (THB) 0.798 0.988 0.9104 12 菲國比索 (PHP) 菲國比索 (PHP) 0.5176 0.6506 - 13 印尼幣 (IDR) 印尼幣 (IDR) 0.00163 0.00233 - 14 歐元 (EUR) 歐元 (EUR) 31.84 33.18 32.46 15 韓元 (KRW) 韓元 (KRW) 0.02292 0.02682 - 16 越南盾 (VND) 越南盾 (VND) 0.00091 0.00141 - 17 馬來幣 (MYR) 馬來幣 (MYR) 5.748 7.373 - 18 人民幣 (CNY) 人民幣 (CNY) 4.144 4.306 4.216

import pandas as pd

df = pd.read_html('http://www.fina.fcu.edu.tw/wSite/lp?ctNode=34257&mp=415101&idPath=34054_34255')

print(df[1])

輸出結果

序號課程名稱授課教師開課班級

0 1.0 微積分(二) 許佳璵財金一甲

1 2.0 財富管理吳仰哲財金三合

2 3.0 統計學(二) 陳麗君財金二乙

3 4.0 計量經濟學陳麗君財金三甲

4 5.0 期貨與選擇權陳麗君財金三丙

5 6.0 會計學(二) 吳東憲財金一乙

6 7.0 會計學(二) 吳東憲財金一丙

7 8.0 體育(二) 黃嘉君財金一甲

8 9.0 班級活動陳清和財金一丙

9 10.0 貨幣銀行學陳清和財金二甲

10 11.0 貨幣銀行學陳清和財金二丙

11 12.0 金融APPS設計陳清和財金二合

12 13.0 金融機構經營與管理陳清和財金三乙

13 14.0 統計學(二) 李仁佑財金二丙

14 15.0 經濟學(二) 賴宗志財金一丙