呆鸟云:“翻译不易,要么是一个词反复思索,要么是上万字一遍遍校稿修改,只为给大家翻译更准确、阅读更舒适的感受,呆鸟也不求啥,就是希望各位看官如果觉得本文有用,能给点个在看或分享给有需要的朋友,这就是对呆鸟辛苦翻译的最大鼓励。”
Series 与 DataFrame 支持大量计算描述性统计的方法与操作。这些方法大部分都是 sum()、mean()、quantile() 等聚合函数,其输出结果比原始数据集小;此外,还有输出结果与原始数据集同样大小的 cumsum() 、 cumprod() 等函数。这些方法都基本上都接受 axis 参数,如, ndarray.{sum,std,…},但这里的 axis 可以用名称或整数指定:
Series:无需 axis 参数
DataFrame:
"index",即 axis=0,默认值
"columns", 即 axis=1
示例如下:
In [77]: df Out[77]: one two three a 1.394981 1.772517 NaN b 0.343054 1.912123 -0.050390 c 0.695246 1.478369 1.227435 d NaN 0.279344 -0.613172 In [78]: df.mean(0) Out[78]: one 0.811094 two 1.360588 three 0.187958 dtype: float64 In [79]: df.mean(1) Out[79]: a 1.583749 b 0.734929 c 1.133683 d -0.166914 dtype: float64这些方法都支持 skipna,这个关键字指定是否要把缺失数据排除在外,默认值为 True。
In [80]: df.sum(0, skipna=False) Out[80]: one NaN two 5.442353 three NaN dtype: float64 In [81]: df.sum(axis=1, skipna=True) Out[81]: a 3.167498 b 2.204786 c 3.401050 d -0.333828 dtype: float64结合广播机制或算数操作,可以描述不同统计过程,比如标准化,即渲染数据零均值与标准差 1,这种操作非常简单:
In [82]: ts_stand = (df - df.mean()) / df.std() In [83]: ts_stand.std() Out[83]: one 1.0 two 1.0 three 1.0 dtype: float64 In [84]: xs_stand = df.sub(df.mean(1), axis=0).div(df.std(1), axis=0) In [85]: xs_stand.std(1) Out[85]: a 1.0 b 1.0 c 1.0 d 1.0 dtype: float64注 : cumsum() 与 cumprod() 等方法保留 NaN 值的位置。这与 expanding() 和 rolling() 略显不同,详情请参阅本文。
In [86]: df.cumsum() Out[86]: one two three a 1.394981 1.772517 NaN b 1.738035 3.684640 -0.050390 c 2.433281 5.163008 1.177045 d NaN 5.442353 0.563873下面是常用函数汇总表。每个函数都支持 level 参数,仅在数据对象为结构化 Index 时使用。
函数描述count统计非空值数量sum汇总值mean平均值mad平均绝对偏差median算数中位数min最小值max最大值mode众数abs绝对值prod乘积std贝塞尔校正的样本标准偏差var无偏方差sem平均值的标准误差skew样本偏度 (第三阶)kurt样本峰度 (第四阶)quantile样本分位数 (不同 % 的值)cumsum累加cumprod累乘cummax累积最大值cummin累积最小值注意:Numpy 的 mean、std、sum 等方法默认不统计 Series 里的空值。
In [87]: np.mean(df['one']) Out[87]: 0.8110935116651192 In [88]: np.mean(df['one'].to_numpy()) Out[88]: nanSeries.nunique() 返回 Series 里所有非空值的唯一值。
In [89]: series = pd.Series(np.random.randn(500)) In [90]: series[20:500] = np.nan In [91]: series[10:20] = 5 In [92]: series.nunique() Out[92]: 11describe() 函数计算 Series 与 DataFrame 数据列的各种数据统计量,注意,这里排除了空值。
In [93]: series = pd.Series(np.random.randn(1000)) In [94]: series[::2] = np.nan In [95]: series.describe() Out[95]: count 500.000000 mean -0.021292 std 1.015906 min -2.683763 25% -0.699070 50% -0.069718 75% 0.714483 max 3.160915 dtype: float64 In [96]: frame = pd.DataFrame(np.random.randn(1000, 5), ....: columns=['a', 'b', 'c', 'd', 'e']) ....: In [97]: frame.iloc[::2] = np.nan In [98]: frame.describe() Out[98]: a b c d e count 500.000000 500.000000 500.000000 500.000000 500.000000 mean 0.033387 0.030045 -0.043719 -0.051686 0.005979 std 1.017152 0.978743 1.025270 1.015988 1.006695 min -3.000951 -2.637901 -3.303099 -3.159200 -3.188821 25% -0.647623 -0.576449 -0.712369 -0.691338 -0.691115 50% 0.047578 -0.021499 -0.023888 -0.032652 -0.025363 75% 0.729907 0.775880 0.618896 0.670047 0.649748 max 2.740139 2.752332 3.004229 2.728702 3.240991此外,还可以指定输出结果包含的分位数:
In [99]: series.describe(percentiles=[.05, .25, .75, .95]) Out[99]: count 500.000000 mean -0.021292 std 1.015906 min -2.683763 5% -1.645423 25% -0.699070 50% -0.069718 75% 0.714483 95% 1.711409 max 3.160915 dtype: float64一般情况下,默认值包含中位数。
对于非数值型 Series 对象, describe() 返回值的总数、唯一值数量、出现次数最多的值及出现的次数。
In [100]: s = pd.Series(['a', 'a', 'b', 'b', 'a', 'a', np.nan, 'c', 'd', 'a']) In [101]: s.describe() Out[101]: count 9 unique 4 top a freq 5 dtype: object注意:对于混合型的 DataFrame 对象, describe() 只返回数值列的汇总统计量,如果没有数值列,则只显示类别型的列。
In [102]: frame = pd.DataFrame({'a': ['Yes', 'Yes', 'No', 'No'], 'b': range(4)}) In [103]: frame.describe() Out[103]: b count 4.000000 mean 1.500000 std 1.290994 min 0.000000 25% 0.750000 50% 1.500000 75% 2.250000 max 3.000000include/exclude 参数的值为列表,用该参数可以控制包含或排除的数据类型。这里还有一个特殊值,all:
In [104]: frame.describe(include=['object']) Out[104]: a count 4 unique 2 top Yes freq 2 In [105]: frame.describe(include=['number']) Out[105]: b count 4.000000 mean 1.500000 std 1.290994 min 0.000000 25% 0.750000 50% 1.500000 75% 2.250000 max 3.000000 In [106]: frame.describe(include='all') Out[106]: a b count 4 4.000000 unique 2 NaN top Yes NaN freq 2 NaN mean NaN 1.500000 std NaN 1.290994 min NaN 0.000000 25% NaN 0.750000 50% NaN 1.500000 75% NaN 2.250000 max NaN 3.000000本功能依托于 select_dtypes,要了解该参数接受哪些输入内容请参阅本文。
Series 与 DataFrame 的 idxmax() 与 idxmin() 函数计算最大值与最小值对应的索引。
In [107]: s1 = pd.Series(np.random.randn(5)) In [108]: s1 Out[108]: 0 1.118076 1 -0.352051 2 -1.242883 3 -1.277155 4 -0.641184 dtype: float64 In [109]: s1.idxmin(), s1.idxmax() Out[109]: (3, 0) In [110]: df1 = pd.DataFrame(np.random.randn(5, 3), columns=['A', 'B', 'C']) In [111]: df1 Out[111]: A B C 0 -0.327863 -0.946180 -0.137570 1 -0.186235 -0.257213 -0.486567 2 -0.507027 -0.871259 -0.111110 3 2.000339 -2.430505 0.089759 4 -0.321434 -0.033695 0.096271 In [112]: df1.idxmin(axis=0) Out[112]: A 2 B 3 C 1 dtype: int64 In [113]: df1.idxmax(axis=1) Out[113]: 0 C 1 A 2 C 3 A 4 C dtype: object多行或多列中存在多个最大值或最小值时,idxmax() 与 idxmin() 只返回匹配到的第一个值的 Index:
In [114]: df3 = pd.DataFrame([2, 1, 1, 3, np.nan], columns=['A'], index=list('edcba')) In [115]: df3 Out[115]: A e 2.0 d 1.0 c 1.0 b 3.0 a NaN In [116]: df3['A'].idxmin() Out[116]: 'd'::: tip 注意
idxmin 与 idxmax 对应 Numpy 里的 argmin 与 argmax。
:::
Series 的 value_counts() 方法及顶级函数计算一维数组中数据值的直方图,还可以用作常规数组的函数:
In [117]: data = np.random.randint(0, 7, size=50) In [118]: data Out[118]: array([6, 6, 2, 3, 5, 3, 2, 5, 4, 5, 4, 3, 4, 5, 0, 2, 0, 4, 2, 0, 3, 2, 2, 5, 6, 5, 3, 4, 6, 4, 3, 5, 6, 4, 3, 6, 2, 6, 6, 2, 3, 4, 2, 1, 6, 2, 6, 1, 5, 4]) In [119]: s = pd.Series(data) In [120]: s.value_counts() Out[120]: 6 10 2 10 4 9 5 8 3 8 0 3 1 2 dtype: int64 In [121]: pd.value_counts(data) Out[121]: 6 10 2 10 4 9 5 8 3 8 0 3 1 2 dtype: int64与上述操作类似,还可以统计 Series 或 DataFrame 的众数,即出现频率最高的值:
In [122]: s5 = pd.Series([1, 1, 3, 3, 3, 5, 5, 7, 7, 7]) In [123]: s5.mode() Out[123]: 0 3 1 7 dtype: int64 In [124]: df5 = pd.DataFrame({"A": np.random.randint(0, 7, size=50), .....: "B": np.random.randint(-10, 15, size=50)}) .....: In [125]: df5.mode() Out[125]: A B 0 1.0 -9 1 NaN 10 2 NaN 13cut()函数(以值为依据实现分箱)及 qcut()函数(以样本分位数为依据实现分箱)用于连续值的离散化:
In [126]: arr = np.random.randn(20) In [127]: factor = pd.cut(arr, 4) In [128]: factor Out[128]: [(-0.251, 0.464], (-0.968, -0.251], (0.464, 1.179], (-0.251, 0.464], (-0.968, -0.251], ..., (-0.251, 0.464], (-0.968, -0.251], (-0.968, -0.251], (-0.968, -0.251], (-0.968, -0.251]] Length: 20 Categories (4, interval[float64]): [(-0.968, -0.251] < (-0.251, 0.464] < (0.464, 1.179] < (1.179, 1.893]] In [129]: factor = pd.cut(arr, [-5, -1, 0, 1, 5]) In [130]: factor Out[130]: [(0, 1], (-1, 0], (0, 1], (0, 1], (-1, 0], ..., (-1, 0], (-1, 0], (-1, 0], (-1, 0], (-1, 0]] Length: 20 Categories (4, interval[int64]): [(-5, -1] < (-1, 0] < (0, 1] < (1, 5]]qcut() 计算样本分位数。比如,下列代码按等距分位数分割正态分布的数据:
In [131]: arr = np.random.randn(30) In [132]: factor = pd.qcut(arr, [0, .25, .5, .75, 1]) In [133]: factor Out[133]: [(0.569, 1.184], (-2.278, -0.301], (-2.278, -0.301], (0.569, 1.184], (0.569, 1.184], ..., (-0.301, 0.569], (1.184, 2.346], (1.184, 2.346], (-0.301, 0.569], (-2.278, -0.301]] Length: 30 Categories (4, interval[float64]): [(-2.278, -0.301] < (-0.301, 0.569] < (0.569, 1.184] < (1.184, 2.346]] In [134]: pd.value_counts(factor) Out[134]: (1.184, 2.346] 8 (-2.278, -0.301] 8 (0.569, 1.184] 7 (-0.301, 0.569] 7 dtype: int64定义分箱时,还可以传递无穷值:
In [135]: arr = np.random.randn(20) In [136]: factor = pd.cut(arr, [-np.inf, 0, np.inf]) In [137]: factor Out[137]: [(-inf, 0.0], (0.0, inf], (0.0, inf], (-inf, 0.0], (-inf, 0.0], ..., (-inf, 0.0], (-inf, 0.0], (-inf, 0.0], (0.0, inf], (0.0, inf]] Length: 20 Categories (2, interval[float64]): [(-inf, 0.0] < (0.0, inf]]