导图社区 Pandas手册
Pandas是一种常用的数据分析工具,本手册包含了该工具的用法总结,可以作为备忘单常放身边。
编辑于2020-03-26 15:56:34本资源名为“CPlusPlus编程语言基础”,由袁宵归纳整理出五千余条C 知识点,几乎涵盖了C 基础的所有知识,希望此资源能帮助C 初学者入门和C 使用者参考。
计算机专业同学福利!总结了6年的JavaScript 高级程序设计思维导图,非常全面非常细致。需要拿去!Java具有简单性、面向对象、分布式、健壮性、安全性、平台独立与可移植性、多线程、动态性等特点。Java可以编写桌面应用程序、Web应用程序、分布式系统和嵌入式系统应用程序等。还不快收藏学起来!
本资源名为“CPlusPlus入门”,系统性地介绍了有关C 编程语言的入门知识。希望此资源能帮助C 编程初学者入门,本资源旨在培养C 编程兴趣。
社区模板帮助中心,点此进入>>
本资源名为“CPlusPlus编程语言基础”,由袁宵归纳整理出五千余条C 知识点,几乎涵盖了C 基础的所有知识,希望此资源能帮助C 初学者入门和C 使用者参考。
计算机专业同学福利!总结了6年的JavaScript 高级程序设计思维导图,非常全面非常细致。需要拿去!Java具有简单性、面向对象、分布式、健壮性、安全性、平台独立与可移植性、多线程、动态性等特点。Java可以编写桌面应用程序、Web应用程序、分布式系统和嵌入式系统应用程序等。还不快收藏学起来!
本资源名为“CPlusPlus入门”,系统性地介绍了有关C 编程语言的入门知识。希望此资源能帮助C 编程初学者入门,本资源旨在培养C 编程兴趣。
Pandas手册
pandas是一种快速,强大,灵活且易于使用的开源数据分析和处理工具, 建立在Python编程语言之上。 通常,我们按以下方式导入numpy和pandas库: import numpy as np import pandas as pd 文中的例子df的内容如下: A B C D 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 2013-01-05 -0.424972 0.567020 0.276232 -1.087401 2013-01-06 -0.673690 0.113648 -1.478427 0.524988 说明: 1. 本文内容来自pandas官网的十分钟学pandas:https://pandas.pydata.org/docs/getting_started/10min.html 2. 再熟悉本文内容后建议阅读本文“开始”部分提供的超链接的内容,特别是pandas备忘单A和B
开始
安装:conda install pandas / pip install pandas
官网
入门
API参考
用户指南
Pandas备忘单A
Pandas备忘单B
文件读写
读写csv文件:df.to_csv("file_name.csv") / pd.read_csv("file_name.csv")
读写HDF5文件:df.to_hdf("file_name.h5", "key") / pd.read_csv("file_name.csv", "key"),key用来区分不同的DataFram,如果只有一个则可以不使用key
read_hdf(path_or_buf, key=None, mode:str='r', errors:str='strict', where=None, start:Union[int, NoneType]=None, stop:Union[int, NoneType]=None, columns=None, iterator=False, chunksize:Union[int, NoneType]=None, **kwargs) Read from the store, close it if we opened it. Retrieve pandas object stored in file, optionally based on where criteria Parameters ---------- path_or_buf : str, path object, pandas.HDFStore or file-like object Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. A local file could be: ``file://localhost/path/to/table.h5``. If you want to pass in a path object, pandas accepts any ``os.PathLike``. Alternatively, pandas accepts an open :class:`pandas.HDFStore` object. By file-like object, we refer to objects with a ``read()`` method, such as a file handler (e.g. via builtin ``open`` function) or ``StringIO``. .. versionadded:: 0.21.0 support for __fspath__ protocol. key : object, optional The group identifier in the store. Can be omitted if the HDF file contains a single pandas object. mode : {'r', 'r+', 'a'}, default 'r' Mode to use when opening the file. Ignored if path_or_buf is a :class:`pandas.HDFStore`. Default is 'r'. where : list, optional A list of Term (or convertible) objects. start : int, optional Row number to start selection. stop : int, optional Row number to stop selection. columns : list, optional A list of columns names to return. iterator : bool, optional Return an iterator object. chunksize : int, optional Number of rows to include in an iteration when using an iterator. errors : str, default 'strict' Specifies how encoding and decoding errors are to be handled. See the errors argument for :func:`open` for a full list of options. **kwargs Additional keyword arguments passed to HDFStore. Returns ------- item : object The selected object. Return type depends on the object stored. See Also -------- DataFrame.to_hdf : Write a HDF file from a DataFrame. HDFStore : Low-level access to HDF files. Examples -------- >>> df = pd.DataFrame([[1, 1.0, 'a']], columns=['x', 'y', 'z']) >>> df.to_hdf('./store.h5', 'data') >>> reread = pd.read_hdf('./store.h5')
读写Excel文件:df.to_excel('foo.xlsx', sheet_name='Sheet1') /pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])
创建对象
通过传入list值列表来创建Series对象:pd.Series(list)
通过传入list值列表来创建Series对象 例子: 创建具有默认的整数索引Series对象 In [3]: s = pd.Series([1, 3, 5, np.nan, 6, 8]) In [4]: s Out[4]: 0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64
通过传入Numpy array来创建DataFram对象:pd.DataFrame(np.array)
通过传入Numpy array来创建DataFram对象 例子: 创建一个以datetime为index,以ABCD为columns的DataFram对象。 In [5]: dates = pd.date_range('20130101', periods=6) In [6]: dates Out[6]: DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06'], dtype='datetime64[ns]', freq='D') In [7]: df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD')) In [8]: df Out[8]: A B C D 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 2013-01-05 -0.424972 0.567020 0.276232 -1.087401 2013-01-06 -0.673690 0.113648 -1.478427 0.524988
通过传入dict来创建DataFram对象:pd.DataFram(dict)
In [9]: df2 = pd.DataFrame({'A': 1., ...: 'B': pd.Timestamp('20130102'), ...: 'C': pd.Series(1, index=list(range(4)), dtype='float32'), ...: 'D': np.array([3] * 4, dtype='int32'), ...: 'E': pd.Categorical(["test", "train", "test", "train"]), ...: 'F': 'foo'}) ...: In [10]: df2 Out[10]: A B C D E F 0 1.0 2013-01-02 1.0 3 test foo 1 1.0 2013-01-02 1.0 3 train foo 2 1.0 2013-01-02 1.0 3 test foo 3 1.0 2013-01-02 1.0 3 train foo
查看数据
查看前n行/后n行:df.head(n)/df.tail(n),n默认值是5
查看前n行/后n行:df.head(n)/df.tail(n),n默认值是5 例子: In [13]: df.head() Out[13]: A B C D 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 2013-01-05 -0.424972 0.567020 0.276232 -1.087401 In [14]: df.tail(3) Out[14]: A B C D 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 2013-01-05 -0.424972 0.567020 0.276232 -1.087401 2013-01-06 -0.673690 0.113648 -1.478427 0.524988
查看index/columns:df.index()/df.columns()
查看index/columns:df.index()/df.columns() In [15]: df.index Out[15]: DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06'], dtype='datetime64[ns]', freq='D') In [16]: df.columns Out[16]: Index(['A', 'B', 'C', 'D'], dtype='object')
查看统计信息:df.describe()
查看统计信息:df.describe() In [19]: df.describe() Out[19]: A B C D count 6.000000 6.000000 6.000000 6.000000 mean 0.073711 -0.431125 -0.687758 -0.233103 std 0.843157 0.922818 0.779887 0.973118 min -0.861849 -2.104569 -1.509059 -1.135632 25% -0.611510 -0.600794 -1.368714 -1.076610 50% 0.022070 -0.228039 -0.767252 -0.386188 75% 0.658444 0.041933 -0.034326 0.461706 max 1.212112 0.567020 0.276232 1.071804
查看列均值/行均值:df.mean() / df.mean(1)
查看列均值/行均值:df.mean() / df.mean(1),操作通常排除丢失的数据。 In [61]: df.mean() Out[61]: A -0.004474 B -0.383981 C -0.687758 D 5.000000 F 3.000000 dtype: float64 In [62]: df.mean(1) Out[62]: 2013-01-01 0.872735 2013-01-02 1.431621 2013-01-03 0.707731 2013-01-04 1.395042 2013-01-05 1.883656 2013-01-06 1.592306 Freq: D, dtype: float64
查看Series中值属于每种类别的数量:Series.value_counts()
查看Series中值属于每种类别的数量:Series.value_counts() In [68]: s = pd.Series(np.random.randint(0, 7, size=10)) In [69]: s Out[69]: 0 4 1 2 2 1 3 2 4 6 5 4 6 4 7 6 8 4 9 4 dtype: int64 In [70]: s.value_counts() Out[70]: 4 5 6 2 2 2 1 1 dtype: int64
转置:df.T
转置:df.T In [20]: df.T Out[20]: 2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06 A 0.469112 1.212112 -0.861849 0.721555 -0.424972 -0.673690 B -0.282863 -0.173215 -2.104569 -0.706771 0.567020 0.113648 C -1.509059 0.119209 -0.494929 -1.039575 0.276232 -1.478427 D -1.135632 -1.044236 1.071804 0.271860 -1.087401 0.524988
转换为Numpy类型数据:DataFrame.to_numpy()
转换为Numpy类型数据:DataFrame.to_numpy() DataFrame.to_numpy() 将会丢失DataFram原有的索引和列标签 In [17]: df.to_numpy() Out[17]: array([[ 0.4691, -0.2829, -1.5091, -1.1356], [ 1.2121, -0.1732, 0.1192, -1.0442], [-0.8618, -2.1046, -0.4949, 1.0718], [ 0.7216, -0.7068, -1.0396, 0.2719], [-0.425 , 0.567 , 0.2762, -1.0874], [-0.6737, 0.1136, -1.4784, 0.525 ]]) In [18]: df2.to_numpy() Out[18]: array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'], [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'], [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'], [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']], dtype=object)
按列标签排序:df.sort_index(axis=1, ascending=False)
按列标签排序:df.sort_index(axis=1, ascending=False) In [21]: df.sort_index(axis=1, ascending=False) Out[21]: D C B A 2013-01-01 -1.135632 -1.509059 -0.282863 0.469112 2013-01-02 -1.044236 0.119209 -0.173215 1.212112 2013-01-03 1.071804 -0.494929 -2.104569 -0.861849 2013-01-04 0.271860 -1.039575 -0.706771 0.721555 2013-01-05 -1.087401 0.276232 0.567020 -0.424972 2013-01-06 0.524988 -1.478427 0.113648 -0.673690
按某个列的值排序(默认从小到大):df.sort_values(by="B", ascending=True)
按某个列的值排序:df.sort_values(by="B") In [22]: df.sort_values(by='B') Out[22]: A B C D 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 2013-01-06 -0.673690 0.113648 -1.478427 0.524988 2013-01-05 -0.424972 0.567020 0.276232 -1.087401
选择数据
通过列名选择单列:df["A"]/df.A,会生成一个Series
选择单列:df["A"]/df.A,会生成一个Series In [23]: df['A'] Out[23]: 2013-01-01 0.469112 2013-01-02 1.212112 2013-01-03 -0.861849 2013-01-04 0.721555 2013-01-05 -0.424972 2013-01-06 -0.673690 Freq: D, Name: A, dtype: float64
通过行号/index切片选择多行(包含切片末尾位置):df[0:3]/df[index1:index3]
通过行号/index切片选择多行(包含切片末尾位置):df[0:3]/df[index1:index3] In [24]: df[0:3] Out[24]: A B C D 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 In [25]: df['20130102':'20130104'] Out[25]: A B C D 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 2013-01-04 0.721555 -0.706771 -1.039575 0.271860
通过index标签选择某行:df.loc[row_index_label]
通过index标签选择某行:df.loc[row_index_label] In [26]: df.loc[dates[0]] Out[26]: A 0.469112 B -0.282863 C -1.509059 D -1.135632 Name: 2013-01-01 00:00:00, dtype: float64
通过列标签选择多列:df.loc[:, [column_A, column_B]]
通过列标签选择多列:df.loc[:, [column_A, column_B]] In [27]: df.loc[:, ['A', 'B']] Out[27]: A B 2013-01-01 0.469112 -0.282863 2013-01-02 1.212112 -0.173215 2013-01-03 -0.861849 -2.104569 2013-01-04 0.721555 -0.706771 2013-01-05 -0.424972 0.567020 2013-01-06 -0.673690 0.113648
通过index标签切片和列标签选择某个区域(包含切片末尾位置):df.loc[index1:index3, [column_A, column_B]]:
通过index标签切片和列标签选择某个区域(包含切片末尾位置)df.loc[index1:index3, [column_A, column_B]]: In [28]: df.loc['20130102':'20130104', ['A', 'B']] Out[28]: A B 2013-01-02 1.212112 -0.173215 2013-01-03 -0.861849 -2.104569 2013-01-04 0.721555 -0.706771
通过index标签列标签选择某个值:df.loc[index1, column_A] / df.at[index1, column_A]
In [30]: df.loc[dates[0], 'A'] Out[30]: 0.4691122999071863 In [31]: df.at[dates[0], 'A'] Out[31]: 0.4691122999071863
通过标签来选择数据:df.loc
通过行号选择某行:df.iloc[3]
通过行号选择某行:df.iloc[3] In [32]: df.iloc[3] Out[32]: A 0.721555 B -0.706771 C -1.039575 D 0.271860 Name: 2013-01-04 00:00:00, dtype: float64
通过行号切片选择多行:df.iloc[1:3, :]
通过行号切片选择多行:df.iloc[1:3, :] In [35]: df.iloc[1:3, :] Out[35]: A B C D 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804
通过列号切片选择多列:df.iloc[:, 1:3]
通过列号切片选择多列:df.iloc[:, 1:3] In [36]: df.iloc[:, 1:3] Out[36]: B C 2013-01-01 -0.282863 -1.509059 2013-01-02 -0.173215 0.119209 2013-01-03 -2.104569 -0.494929 2013-01-04 -0.706771 -1.039575 2013-01-05 0.567020 0.276232 2013-01-06 0.113648 -1.478427
通过行号切片和列号切片选择某个区域:df.iloc[3:5, 0:2]
通过行号切片和列号切片选择某个区域:df.iloc[3:5, 0:2] In [33]: df.iloc[3:5, 0:2] Out[33]: A B 2013-01-04 0.721555 -0.706771 2013-01-05 -0.424972 0.567020
通过行号列表和列号列表选择某些数据:df.iloc[[1, 2, 4], [0, 2]]
通过行号列表和列号列表选择某些数据:df.iloc[[1, 2, 4], [0, 2]] In [34]: df.iloc[[1, 2, 4], [0, 2]] Out[34]: A C 2013-01-02 1.212112 0.119209 2013-01-03 -0.861849 -0.494929 2013-01-05 -0.424972 0.276232
通过行号和列号选择某个值:df.iloc[1, 1] / df.iat[1, 1]
通过行号和列号选择某个值:df.iloc[1, 1] / df.iat[1, 1] In [37]: df.iloc[1, 1] Out[37]: -0.17321464905330858 In [38]: df.iat[1, 1] Out[38]: -0.17321464905330858
通过位置来选择数据:df.iloc
通过列值选择满足要求的数据:df[df['A'] > 0]
通过列值选择满足要求的数据:df[df['A'] > 0] In [39]: df[df['A'] > 0] Out[39]: A B C D 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 2013-01-04 0.721555 -0.706771 -1.039575 0.271860
对所有值进行判断,选择满足要求的数据:df[df > 0]
对所有值进行判断,选择满足要求的数据:df[df > 0] In [40]: df[df > 0] Out[40]: A B C D 2013-01-01 0.469112 NaN NaN NaN 2013-01-02 1.212112 NaN 0.119209 NaN 2013-01-03 NaN NaN NaN 1.071804 2013-01-04 0.721555 NaN NaN 0.271860 2013-01-05 NaN 0.567020 0.276232 NaN 2013-01-06 NaN 0.113648 NaN 0.524988
通过isin()方法,选择满足要求的数据:df2[df2['E'].isin(['two', 'four'])]
通过isin()方法,选择满足要求的数据:df2[df2['E'].isin(['two', 'four'])] In [41]: df2 = df.copy() In [42]: df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three'] In [43]: df2 Out[43]: A B C D E 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 one 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 one 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 two 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 three 2013-01-05 -0.424972 0.567020 0.276232 -1.087401 four 2013-01-06 -0.673690 0.113648 -1.478427 0.524988 three In [44]: df2[df2['E'].isin(['two', 'four'])] Out[44]: A B C D E 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 two 2013-01-05 -0.424972 0.567020 0.276232 -1.087401 four
通过布尔索引来选择数据:True/False->选择/不选择
columns操作
选择某列:df["columns_A"] / df.columns_A
选择某些列:df[ ["columns_A", "columns_B"]]
通过正则表达式来选择某些列:df.filter(regex='regex')
重命名列:df.rename(colums={"y": "year"})
删除列:df.drop(columns = ["columns_A", "columns_B"])
设置数据
通过标签来设置某个数据的值:df.at[dates[0], 'A'] = 0
通过位置来设置某个数据的值:df.iat[0, 1] = 0
设置某列数据的值:df.loc[:, 'D'] = np.array([5] * len(df))
缺失数据
pandas主要使用值np.nan表示丢失的数据。
删除所有含有缺失值的行:df1.dropna(how='any')
删除所有含有缺失值的行:df1.dropna(how='any') In [57]: df1 Out[57]: A B C D F E 2013-01-01 0.000000 0.000000 -1.509059 5 NaN 1.0 2013-01-02 1.212112 -0.173215 0.119209 5 1.0 1.0 2013-01-03 -0.861849 -2.104569 -0.494929 5 2.0 NaN 2013-01-04 0.721555 -0.706771 -1.039575 5 3.0 NaN In [58]: df1.dropna(how='any') Out[58]: A B C D F E 2013-01-02 1.212112 -0.173215 0.119209 5 1.0 1.0
用指定值填充所有的缺失值:df1.fillna(value=5)
用指定值填充所有的缺失值:df1.fillna(value=5) In [57]: df1 Out[57]: A B C D F E 2013-01-01 0.000000 0.000000 -1.509059 5 NaN 1.0 2013-01-02 1.212112 -0.173215 0.119209 5 1.0 1.0 2013-01-03 -0.861849 -2.104569 -0.494929 5 2.0 NaN 2013-01-04 0.721555 -0.706771 -1.039575 5 3.0 NaN In [59]: df1.fillna(value=5) Out[59]: A B C D F E 2013-01-01 0.000000 0.000000 -1.509059 5 5.0 1.0 2013-01-02 1.212112 -0.173215 0.119209 5 1.0 1.0 2013-01-03 -0.861849 -2.104569 -0.494929 5 2.0 5.0 2013-01-04 0.721555 -0.706771 -1.039575 5 3.0 5.0
获取每个值是否是缺失值的布尔掩码:pd.isna(df1)
获取每个值是否是缺失值的布尔掩码:pd.isna(df1) In [57]: df1 Out[57]: A B C D F E 2013-01-01 0.000000 0.000000 -1.509059 5 NaN 1.0 2013-01-02 1.212112 -0.173215 0.119209 5 1.0 1.0 2013-01-03 -0.861849 -2.104569 -0.494929 5 2.0 NaN 2013-01-04 0.721555 -0.706771 -1.039575 5 3.0 NaN In [60]: pd.isna(df1) Out[60]: A B C D F E 2013-01-01 False False False False True False 2013-01-02 False False False False False False 2013-01-03 False False False False False True 2013-01-04 False False False False False True
杂项操作
重新索引 df.reindex
重新索引允许您更改/添加/删除指定轴上的索引。这将返回数据的副本。
减法:df.sub(Series, axis='index')
减法: In [63]: s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2) In [51]: df Out[51]: A B C D F 2013-01-01 0.000000 0.000000 -1.509059 5 NaN 2013-01-02 1.212112 -0.173215 0.119209 5 1.0 2013-01-03 -0.861849 -2.104569 -0.494929 5 2.0 2013-01-04 0.721555 -0.706771 -1.039575 5 3.0 2013-01-05 -0.424972 0.567020 0.276232 5 4.0 2013-01-06 -0.673690 0.113648 -1.478427 5 5.0 In [64]: s Out[64]: 2013-01-01 NaN 2013-01-02 NaN 2013-01-03 1.0 2013-01-04 3.0 2013-01-05 5.0 2013-01-06 NaN Freq: D, dtype: float64 In [65]: df.sub(s, axis='index') Out[65]: A B C D F 2013-01-01 NaN NaN NaN NaN NaN 2013-01-02 NaN NaN NaN NaN NaN 2013-01-03 -1.861849 -3.104569 -1.494929 4.0 1.0 2013-01-04 -2.278445 -3.706771 -4.039575 2.0 0.0 2013-01-05 -5.424972 -4.432980 -4.723768 0.0 -1.0 2013-01-06 NaN NaN NaN NaN NaN
自定义函数:df.apply(f)
自定义函数:df.apply(f) In [54]: df2 Out[54]: A B C D F 2013-01-01 0.000000 0.000000 -1.509059 -5 NaN 2013-01-02 -1.212112 -0.173215 -0.119209 -5 -1.0 2013-01-03 -0.861849 -2.104569 -0.494929 -5 -2.0 2013-01-04 -0.721555 -0.706771 -1.039575 -5 -3.0 2013-01-05 -0.424972 -0.567020 -0.276232 -5 -4.0 2013-01-06 -0.673690 -0.113648 -1.478427 -5 -5.0 In [66]: df.apply(np.cumsum) Out[66]: A B C D F 2013-01-01 0.000000 0.000000 -1.509059 5 NaN 2013-01-02 1.212112 -0.173215 -1.389850 10 1.0 2013-01-03 0.350263 -2.277784 -1.884779 15 3.0 2013-01-04 1.071818 -2.984555 -2.924354 20 6.0 2013-01-05 0.646846 -2.417535 -2.648122 25 10.0 2013-01-06 -0.026844 -2.303886 -4.126549 30 15.0 In [67]: df.apply(lambda x: x.max() - x.min()) Out[67]: A 2.073961 B 2.671590 C 1.785291 D 0.000000 F 4.000000 dtype: float64
Series字符串方法:Series.str.lower()
Series在str属性中配备了一组字符串处理方法,这些方法使您可以轻松地对数组的每个元素进行操作,如下面的代码片段所示。 请注意,str中的模式匹配通常默认情况下使用正则表达式(在某些情况下始终使用正则表达式)。 In [71]: s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat']) In [72]: s.str.lower() Out[72]: 0 a 1 b 2 c 3 aaba 4 baca 5 NaN 6 caba 7 dog 8 cat dtype: object
数据表透视:pd.pivot_table(df, value="D", index=["A", "B"], columns=["C"])
数据表透视:pd.pivot_table(df, value="D", index=["A", "B"], columns=["C"]) In [101]: df = pd.DataFrame({'A': ['one', 'one', 'two', 'three'] * 3, .....: 'B': ['A', 'B', 'C'] * 4, .....: 'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2, .....: 'D': np.random.randn(12), .....: 'E': np.random.randn(12)}) .....: In [102]: df Out[102]: A B C D E 0 one A foo -1.202872 0.047609 1 one B foo -1.814470 -0.136473 2 two C foo 1.018601 -0.561757 3 three A bar -0.595447 -1.623033 4 one B bar 1.395433 0.029399 5 one C bar -0.392670 -0.542108 6 two A foo 0.007207 0.282696 7 three B foo 1.928123 -0.087302 8 one C foo -0.055224 -1.575170 9 one A bar 2.395985 1.771208 10 two B bar 1.552825 0.816482 11 three C bar 0.166599 1.100230 In [103]: pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C']) Out[103]: C bar foo A B one A 2.395985 -1.202872 B 1.395433 -1.814470 C -0.392670 -0.055224 three A -0.595447 NaN B NaN 1.928123 C 0.166599 NaN two A NaN 0.007207 B 1.552825 NaN C NaN 1.018601
合并与分组
将列添加到DataFrame相对较快。 但是,添加一行需要一个副本,并且可能很昂贵。 我们建议将预构建的记录列表传递给DataFrame构造函数,而不是通过迭代地将记录追加(append)到其来构建DataFrame。 合并的区别: concat:可以沿一条轴将多个对象连接到一起 merge:可以根据一个或多个键将不同的DataFrame中的行连接起来。 join:inner是交集,outer是并集。 “分组依据”是指涉及以下一个或多个步骤的过程: 根据某些条件将数据分成几组 将功能独立应用于每个组 将结果合并为数据结构
concat多个子DataFram行:pd.concat([df1, df2, df3])
合并多个子DataFram:pd.concat([df1, df2, df3]) In [73]: df = pd.DataFrame(np.random.randn(10, 4)) In [74]: df Out[74]: 0 1 2 3 0 -0.548702 1.467327 -1.015962 -0.483075 1 1.637550 -1.217659 -0.291519 -1.745505 2 -0.263952 0.991460 -0.919069 0.266046 3 -0.709661 1.669052 1.037882 -1.705775 4 -0.919854 -0.042379 1.247642 -0.009920 5 0.290213 0.495767 0.362949 1.548106 6 -1.131345 -0.089329 0.337863 -0.945867 7 -0.932132 1.956030 0.017587 -0.016692 8 -0.575247 0.254161 -1.143704 0.215897 9 1.193555 -0.077118 -0.408530 -0.862495 # break it into pieces In [75]: pieces = [df[:3], df[3:7], df[7:]] In [76]: pd.concat(pieces) Out[76]: 0 1 2 3 0 -0.548702 1.467327 -1.015962 -0.483075 1 1.637550 -1.217659 -0.291519 -1.745505 2 -0.263952 0.991460 -0.919069 0.266046 3 -0.709661 1.669052 1.037882 -1.705775 4 -0.919854 -0.042379 1.247642 -0.009920 5 0.290213 0.495767 0.362949 1.548106 6 -1.131345 -0.089329 0.337863 -0.945867 7 -0.932132 1.956030 0.017587 -0.016692 8 -0.575247 0.254161 -1.143704 0.215897 9 1.193555 -0.077118 -0.408530 -0.862495
concat多个子DataFram列:pd.concat([df1, df2,df3], axis=1)
根据key来join多个子DataFram:pd.join(df1, df2, on="key"),SQL样式合并。
join多个子DataFram:pd.join(df1, df2, on="key"),SQL样式合并。 例子1: In [77]: left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]}) In [78]: right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]}) In [79]: left Out[79]: key lval 0 foo 1 1 foo 2 In [80]: right Out[80]: key rval 0 foo 4 1 foo 5 In [81]: pd.merge(left, right, on='key') Out[81]: key lval rval 0 foo 1 4 1 foo 1 5 2 foo 2 4 3 foo 2 5 例子2: In [82]: left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]}) In [83]: right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]}) In [84]: left Out[84]: key lval 0 foo 1 1 bar 2 In [85]: right Out[85]: key rval 0 foo 4 1 bar 5 In [86]: pd.merge(left, right, on='key') Out[86]: key lval rval 0 foo 1 4 1 bar 2 5
根据某(几)列来进行分组:df.groupby('A').sum()
In [87]: df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar', ....: 'foo', 'bar', 'foo', 'foo'], ....: 'B': ['one', 'one', 'two', 'three', ....: 'two', 'two', 'one', 'three'], ....: 'C': np.random.randn(8), ....: 'D': np.random.randn(8)}) ....: In [88]: df Out[88]: A B C D 0 foo one 1.346061 -1.577585 1 bar one 1.511763 0.396823 2 foo two 1.627081 -0.105381 3 bar three -0.990582 -0.532532 4 foo two -0.441652 1.453749 5 bar two 1.211526 1.208843 6 foo one 0.268520 -0.080952 7 foo three 0.024580 -0.264610 In [89]: df.groupby('A').sum() Out[89]: C D A bar 1.732707 1.073134 foo 2.824590 -0.574779 In [90]: df.groupby(['A', 'B']).sum() Out[90]: C D A B bar one 1.511763 0.396823 three -0.990582 -0.532532 two 1.211526 1.208843 foo one 1.614581 -1.658537 three 0.024580 -0.264610 two 1.185429 1.348368
时间序列
pands具有简单,强大和高效的用于时间序列的方法,可以在频率转换期间执行重采样操作(例如,将秒数据转换为5分钟重采样数据)。 这在金融应用程序中非常普遍,但不仅限于此。
重采样:Series.resample
重采样:Series.resample In [104]: rng = pd.date_range('1/1/2012', periods=100, freq='S') In [105]: ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng) In [106]: ts.resample('5Min').sum() Out[106]: 2012-01-01 24182 Freq: 5T, dtype: int64
时区转换:Series.tz_convert()
In [107]: rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D') In [108]: ts = pd.Series(np.random.randn(len(rng)), rng) In [109]: ts Out[109]: 2012-03-06 1.857704 2012-03-07 -1.193545 2012-03-08 0.677510 2012-03-09 -0.153931 2012-03-10 0.520091 Freq: D, dtype: float64 In [110]: ts_utc = ts.tz_localize('UTC') In [111]: ts_utc Out[111]: 2012-03-06 00:00:00+00:00 1.857704 2012-03-07 00:00:00+00:00 -1.193545 2012-03-08 00:00:00+00:00 0.677510 2012-03-09 00:00:00+00:00 -0.153931 2012-03-10 00:00:00+00:00 0.520091 Freq: D, dtype: float64 In [112]: ts_utc.tz_convert('US/Eastern') Out[112]: 2012-03-05 19:00:00-05:00 1.857704 2012-03-06 19:00:00-05:00 -1.193545 2012-03-07 19:00:00-05:00 0.677510 2012-03-08 19:00:00-05:00 -0.153931 2012-03-09 19:00:00-05:00 0.520091 Freq: D, dtype: float64
分类数据
pandas可以在DataFrame中包含分类数据。
设置分类数据:df["raw_grade"].astype("category")
设置分类数据:df["raw_grade"].astype("category") In [123]: df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6], .....: "raw_grade": ['a', 'b', 'b', 'a', 'a', 'e']}) .....: In [124]: df["grade"] = df["raw_grade"].astype("category") In [125]: df["grade"] Out[125]: 0 a 1 b 2 b 3 a 4 a 5 e Name: grade, dtype: category Categories (3, object): [a, b, e]
给类别命名:df["grade"].cat.categories = ["very good", "good", "very bad"]
给类别命名:df["grade"].cat.categories = ["very good", "good", "very bad"] 将类别重命名为更有意义的名称(就地分配给Series.cat.categories!)。 In [127]: df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", .....: "good", "very good"]) .....: In [128]: df["grade"] Out[128]: 0 very good 1 good 2 good 3 very good 4 very good 5 very bad Name: grade, dtype: category Categories (5, object): [very bad, bad, medium, good, very good]
增加新的类别名称:df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
增加新的类别名称:df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"]) In [127]: df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", .....: "good", "very good"]) .....: In [128]: df["grade"] Out[128]: 0 very good 1 good 2 good 3 very good 4 very good 5 very bad Name: grade, dtype: category Categories (5, object): [very bad, bad, medium, good, very good]
按照命名的类别顺序排序(不是词汇顺序):df.sort_values(by="grade")
按照命名的类别顺序排序(不是词汇顺序):df.sort_values(by="grade") In [129]: df.sort_values(by="grade") Out[129]: id raw_grade grade 5 6 e very bad 1 2 b good 2 3 b good 0 1 a very good 3 4 a very good 4 5 a very good
给类别数据分组:df.groupby("grade").size()
给类别数据分组:df.groupby("grade").size() In [130]: df.groupby("grade").size() Out[130]: grade very bad 1 bad 0 medium 0 good 2 very good 3 dtype: int64
绘图
绘制折线图:pd.plot()
绘制直方图:pd.plot.hist()
绘制散点图:df.plot.scatter(x="columns_A", y ="columns_B")