内容学习自:
Python for Data Analysis, 2nd Edition
就是这本
纯英文学的很累,对不对取决于百度翻译了
前情提要:
各种方法贴:
内容提要:本次内容主要讲的是pands基本入门
一:pandas 主要有两种数据结构
Series,DataFrame
二: Series
1:定义:
Series是一种类似于一维数组的对象,它由一组数据(各种NumPy数据类型)以及一组与之相关的数据标签(即索引)组成
2:表现形式
Series的字符串表现形式为:索引在左边,值在右边。
3:创建一个一维数组
obj =pd.Series([4,5,6,7,8]) #创建一维数组print(obj)print(obj.index)print(obj.values)>>>>>>>>>0 41 52 63 74 8dtype: int64RangeIndex(start=0, stop=5, step=1)[4 5 6 7 8]
4:通过索引获得内容
1>:单索引
obj1 = pd.Series([4,6,-7,-8],index=['d','a','b','c']) #修改索引print(obj1) >>>>
#通过索引获得内容 print(obj1['d']) >>>>
d 4
a 6b -7c -8dtype: int644
2>:多索引
#多索引print(obj1[['d','a','c']])>>>>d 4a 6b -7c -8dtype: int64d 4a 6c -8dtype: int64
3>:布尔过滤
print(obj1[obj1<0]) >>>>
d 4
a 6b -7c -8dtype: int64b -7c -8dtype: int64
4>:应用乘法
print(obj1*2)>>>>>>>>>>d 4a 6b -7c -8dtype: int64d 8a 12b -14c -16dtype: int64
5>:应用级函数
print(np.exp(obj1))>>>>>d 4a 6b -7c -8dtype: int64d 54.598150a 403.428793b 0.000912c 0.000335dtype: float64
6>:索引的映射关系
print('b'in obj1)print('e'in obj1)>>>>>d 4a 6b -7c -8dtype: int64TrueFalse
5 :创建字典的Series:
1:>创建字典型Series
sdata ={ 'Ohio':35000,'Texas':71000,'Oregon':16000,'Utah':5000 }obj3 =pd.Series(sdata)print(obj3)>>>>Ohio 35000Texas 71000Oregon 16000Utah 5000dtype: int64
2:>Series 插入index 和valuse
sdata ={ 'Ohio':35000,'Texas':71000,'Oregon':16000,'Utah':5000 }obj3 =pd.Series(sdata)print(obj3)# 插入index 和valusestates =['California','Ohio','Oregon','Texas']obj4 =pd.Series(sdata,index=states)print(obj4)>>>>>>>>>>>>>>Ohio 35000Texas 71000Oregon 16000Utah 5000dtype: int64California NaNOhio 35000.0Oregon 16000.0Texas 71000.0dtype: float64
3>:检测数据是否缺失
l =pd.isnull(obj4)print(l)l2 =pd.notnull(obj4)print(l2)>>>>>>>>>>>>California TrueOhio FalseOregon FalseTexas Falsedtype: boolCalifornia FalseOhio TrueOregon TrueTexas Truedtype: bool
4>:赋予名字
obj4.name ='population'obj4.index.name ='state'print(obj4)>>>>>>>>>\stateCalifornia NaNOhio 35000.0Oregon 16000.0Texas 71000.0Name: population, dtype: float64
5>:修改索引,修改索引的名字
obj =pd.Series([4,7,-6,3])print(obj)obj.index=['bob','Steve','jeff','Ryan']print(obj)>>>>>>>>>0 41 72 -63 3dtype: int64bob 4Steve 7jeff -6Ryan 3dtype: int64
三:DataFrame
一:定义
data ={ 'state':['Ohio','Ohio','Ohio','Nevada','Nevada','Nevada'], 'year':[2000,2001,2002,2001,2002,2003], 'pop':[1.5,1.7,3.6,2.4,2.8,3.2] }frame =pd.DataFrame(data)print(frame)>>>>>>>>> state year pop0 Ohio 2000 1.51 Ohio 2001 1.72 Ohio 2002 3.63 Nevada 2001 2.44 Nevada 2002 2.85 Nevada 2003 3.2
print(frame.head())>>>>>>> state year pop0 Ohio 2000 1.51 Ohio 2001 1.72 Ohio 2002 3.63 Nevada 2001 2.44 Nevada 2002 2.8
print(pd.DataFrame(data,columns=['year','pop','state']))>>>>>>>> year pop state0 2000 1.5 Ohio1 2001 1.7 Ohio2 2002 3.6 Ohio3 2001 2.4 Nevada4 2002 2.8 Nevada5 2003 3.2 Nevada
2.3:拆入数据如果找不到,缺失值,则返回None
# #插入数据如果找不到,缺失值,则返回NaN#columns 列名#index 行名frame2 =pd.DataFrame(data,columns=['year','state','pop','debt'], index=['one','two','three','four','five','six'] )print(frame2)>>>>>>>>>>>> year state pop debtone 2000 Ohio 1.5 NaNtwo 2001 Ohio 1.7 NaNthree 2002 Ohio 3.6 NaNfour 2001 Nevada 2.4 NaNfive 2002 Nevada 2.8 NaNsix 2003 Nevada 3.2 NaN
2.4:返回columns 的值
print(frame2.columns)>>>>>>>>Index(['year', 'state', 'pop', 'debt'], dtype='object')
2.5:通过标记,或者属性的方式,获取某一列的值
# #单独获取某一列print(frame2['state'])print(frame2.year)print('>>>>>>>>>>>>>>>>>>')print(frame2['year'])>>>>>>>>>>>>>>one Ohiotwo Ohiothree Ohiofour Nevadafive Nevadasix NevadaName: state, dtype: objectone 2000two 2001three 2002four 2001five 2002six 2003Name: year, dtype: int64>>>>>>>>>>>>>>>>>>one 2000two 2001three 2002four 2001five 2002six 2003Name: year, dtype: int64
2.6:loc 属性获取行的所有内容
print(frame2.loc['three'])>>>>>>>>>>year 2002state Ohiopop 3.6debt NaNName: three, dtype: object
2.7:通过赋值的方式进行修改
frame2['debt']=16.5print(frame2)>>>>>>>> year state pop debtone 2000 Ohio 1.5 16.5two 2001 Ohio 1.7 16.5three 2002 Ohio 3.6 16.5four 2001 Nevada 2.4 16.5five 2002 Nevada 2.8 16.5six 2003 Nevada 3.2 16.5
2.8:以 范围内容生成赋值
frame2['dabt']=np.arange(6.)print(frame2) >>>>>>>>>>
year state pop debt dabt
one 2000 Ohio 1.5 NaN 0.0two 2001 Ohio 1.7 NaN 1.0three 2002 Ohio 3.6 NaN 2.0four 2001 Nevada 2.4 NaN 3.0five 2002 Nevada 2.8 NaN 4.0six 2003 Nevada 3.2 NaN 5.0
2.9:以Series的方式进行赋值
print(frame2)print(">>>>>>>>>>>>")val =pd.Series([-1.2,-1.5,-1.7],index =['two','four','five'])print(val)print(">>>>>>>>>>>>>>")frame2['debt'] =valprint(frame2)>>>>>>>>>>>>>>>>>>>>> year state pop debtone 2000 Ohio 1.5 NaNtwo 2001 Ohio 1.7 NaNthree 2002 Ohio 3.6 NaNfour 2001 Nevada 2.4 NaNfive 2002 Nevada 2.8 NaNsix 2003 Nevada 3.2 NaN>>>>>>>>>>>>two -1.2four -1.5five -1.7dtype: float64>>>>>>>>>>>>>> year state pop debtone 2000 Ohio 1.5 NaNtwo 2001 Ohio 1.7 -1.2three 2002 Ohio 3.6 NaNfour 2001 Nevada 2.4 -1.5five 2002 Nevada 2.8 -1.7six 2003 Nevada 3.2 NaN
2.10:布尔型运算
frame2['eastern'] =frame2.state =='Ohio'print(frame2)>>>>>>>> year state pop debt easternone 2000 Ohio 1.5 NaN Truetwo 2001 Ohio 1.7 NaN Truethree 2002 Ohio 3.6 NaN Truefour 2001 Nevada 2.4 NaN Falsefive 2002 Nevada 2.8 NaN Falsesix 2003 Nevada 3.2 NaN False