Pandas 中文官档 ~ 基础用法4

mac2026-02-21 19

重置索引与更换标签

reindex() 是 pandas 里实现数据对齐的基本方法，该方法执行几乎所有功能都要用到的标签对齐功能。 reindex 指的是沿着指定轴，让数据与给定的一组标签进行匹配。该功能完成以下几项操作：

让现有数据匹配一组新标签，并重新排序；

在无数据但有标签的位置插入缺失值（NA）标记；

如果指定，则按逻辑填充无标签的数据，该操作多见于时间序列数据。

示例如下：

In [196]: s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e']) In [197]: s Out[197]: a 1.695148 b 1.328614 c 1.234686 d -0.385845 e -1.326508 dtype: float64 In [198]: s.reindex(['e', 'b', 'f', 'd']) Out[198]: e -1.326508 b 1.328614 f NaN d -0.385845 dtype: float64

本例中，原 Series 里没有标签 f ，因此，输出结果里 f 对应的值为 NaN。

DataFrame 支持同时 reindex 索引与列：

In [199]: df Out[199]: one two three a 1.394981 1.772517 NaN b 0.343054 1.912123 -0.050390 c 0.695246 1.478369 1.227435 d NaN 0.279344 -0.613172 In [200]: df.reindex(index=['c', 'f', 'b'], columns=['three', 'two', 'one']) Out[200]: three two one c 1.227435 1.478369 0.695246 f NaN NaN NaN b -0.050390 1.912123 0.343054

reindex 还支持 axis 关键字：

In [201]: df.reindex(['c', 'f', 'b'], axis='index') Out[201]: one two three c 0.695246 1.478369 1.227435 f NaN NaN NaN b 0.343054 1.912123 -0.050390

注意：不同对象可以共享 Index 包含的轴标签。比如，有一个 Series，还有一个 DataFrame，可以执行下列操作：

In [202]: rs = s.reindex(df.index) In [203]: rs Out[203]: a 1.695148 b 1.328614 c 1.234686 d -0.385845 dtype: float64 In [204]: rs.index is df.index Out[204]: True

这里指的是，重置后，Series 的索引与 DataFrame 的索引是同一个 Python 对象。

0.21.0 版新增。

DataFrame.reindex() 还支持 “轴样式”调用习语，可以指定单个 labels 参数，并指定应用于哪个 axis。

In [205]: df.reindex(['c', 'f', 'b'], axis='index') Out[205]: one two three c 0.695246 1.478369 1.227435 f NaN NaN NaN b 0.343054 1.912123 -0.050390 In [206]: df.reindex(['three', 'two', 'one'], axis='columns') Out[206]: three two one a NaN 1.772517 1.394981 b -0.050390 1.912123 0.343054 c 1.227435 1.478369 0.695246 d -0.613172 0.279344 NaN

::: tip 注意

多重索引与高级索引介绍了怎样用更简洁的方式重置索引。

:::

::: tip 注意

编写注重性能的代码时，最好花些时间深入理解 reindex：预对齐数据后，操作会更快。两个未对齐的 DataFrame 相加，后台操作会执行 reindex。探索性分析时很难注意到这点有什么不同，这是因为 reindex 已经进行了高度优化，但需要注重 CPU 周期时，显式调用 reindex 还是有一些影响的。

:::

重置索引，并与其它对象对齐

提取一个对象，并用另一个具有相同标签的对象 reindex 该对象的轴。这种操作的语法虽然简单，但未免有些啰嗦。这时，最好用 reindex_like() 方法，这是一种既有效，又简单的方式：

In [207]: df2 Out[207]: one two a 1.394981 1.772517 b 0.343054 1.912123 c 0.695246 1.478369 In [208]: df3 Out[208]: one two a 0.583888 0.051514 b -0.468040 0.191120 c -0.115848 -0.242634 In [209]: df.reindex_like(df2) Out[209]: one two a 1.394981 1.772517 b 0.343054 1.912123 c 0.695246 1.478369

用 align 对齐多个对象

align() 方法是对齐两个对象最快的方式，该方法支持 join 参数（请参阅 joining 与 merging）：

join='outer'：使用两个对象索引的合集，默认值

join='left'：使用左侧调用对象的索引

join='right'：使用右侧传递对象的索引

join='inner'：使用两个对象索引的交集

该方法返回重置索引后的两个 Series 元组：

In [210]: s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e']) In [211]: s1 = s[:4] In [212]: s2 = s[1:] In [213]: s1.align(s2) Out[213]: (a -0.186646 b -1.692424 c -0.303893 d -1.425662 e NaN dtype: float64, a NaN b -1.692424 c -0.303893 d -1.425662 e 1.114285 dtype: float64) In [214]: s1.align(s2, join='inner') Out[214]: (b -1.692424 c -0.303893 d -1.425662 dtype: float64, b -1.692424 c -0.303893 d -1.425662 dtype: float64) In [215]: s1.align(s2, join='left') Out[215]: (a -0.186646 b -1.692424 c -0.303893 d -1.425662 dtype: float64, a NaN b -1.692424 c -0.303893 d -1.425662 dtype: float64)

默认条件下， join 方法既应用于索引，也应用于列：

In [216]: df.align(df2, join='inner') Out[216]: ( one two a 1.394981 1.772517 b 0.343054 1.912123 c 0.695246 1.478369, one two a 1.394981 1.772517 b 0.343054 1.912123 c 0.695246 1.478369)

align 方法还支持 axis 选项，用来指定要对齐的轴：

In [217]: df.align(df2, join='inner', axis=0) Out[217]: ( one two three a 1.394981 1.772517 NaN b 0.343054 1.912123 -0.050390 c 0.695246 1.478369 1.227435, one two a 1.394981 1.772517 b 0.343054 1.912123 c 0.695246 1.478369)

如果把 Series 传递给 DataFrame.align()，可以用 axis 参数选择是在 DataFrame 的索引，还是列上对齐两个对象：

In [218]: df.align(df2.iloc[0], axis=1) Out[218]: ( one three two a 1.394981 NaN 1.772517 b 0.343054 -0.050390 1.912123 c 0.695246 1.227435 1.478369 d NaN -0.613172 0.279344, one 1.394981 three NaN two 1.772517 Name: a, dtype: float64) 方法动作pad / ffill先前填充bfill / backfill向后填充nearest从最近的索引值填充

下面用一个简单的 Series 展示 fill 方法：

In [219]: rng = pd.date_range('1/3/2000', periods=8) In [220]: ts = pd.Series(np.random.randn(8), index=rng) In [221]: ts2 = ts[[0, 3, 6]] In [222]: ts Out[222]: 2000-01-03 0.183051 2000-01-04 0.400528 2000-01-05 -0.015083 2000-01-06 2.395489 2000-01-07 1.414806 2000-01-08 0.118428 2000-01-09 0.733639 2000-01-10 -0.936077 Freq: D, dtype: float64 In [223]: ts2 Out[223]: 2000-01-03 0.183051 2000-01-06 2.395489 2000-01-09 0.733639 dtype: float64 In [224]: ts2.reindex(ts.index) Out[224]: 2000-01-03 0.183051 2000-01-04 NaN 2000-01-05 NaN 2000-01-06 2.395489 2000-01-07 NaN 2000-01-08 NaN 2000-01-09 0.733639 2000-01-10 NaN Freq: D, dtype: float64 In [225]: ts2.reindex(ts.index, method='ffill') Out[225]: 2000-01-03 0.183051 2000-01-04 0.183051 2000-01-05 0.183051 2000-01-06 2.395489 2000-01-07 2.395489 2000-01-08 2.395489 2000-01-09 0.733639 2000-01-10 0.733639 Freq: D, dtype: float64 In [226]: ts2.reindex(ts.index, method='bfill') Out[226]: 2000-01-03 0.183051 2000-01-04 2.395489 2000-01-05 2.395489 2000-01-06 2.395489 2000-01-07 0.733639 2000-01-08 0.733639 2000-01-09 0.733639 2000-01-10 NaN Freq: D, dtype: float64 In [227]: ts2.reindex(ts.index, method='nearest') Out[227]: 2000-01-03 0.183051 2000-01-04 0.183051 2000-01-05 2.395489 2000-01-06 2.395489 2000-01-07 2.395489 2000-01-08 0.733639 2000-01-09 0.733639 2000-01-10 0.733639 Freq: D, dtype: float64

上述操作要求索引按递增或递减排序。

注意：除了 method='nearest'，用 fillna 或 interpolate 也能实现同样的效果：

In [228]: ts2.reindex(ts.index).fillna(method='ffill') Out[228]: 2000-01-03 0.183051 2000-01-04 0.183051 2000-01-05 0.183051 2000-01-06 2.395489 2000-01-07 2.395489 2000-01-08 2.395489 2000-01-09 0.733639 2000-01-10 0.733639 Freq: D, dtype: float64

如果索引不是按递增或递减排序，reindex() 会触发 ValueError 错误。fillna() 与 interpolate() 则不检查索引的排序。

重置索引填充的限制

limit 与 tolerance 参数可以控制 reindex 的填充操作。limit 限定了连续匹配的最大数量：

In [229]: ts2.reindex(ts.index, method='ffill', limit=1) Out[229]: 2000-01-03 0.183051 2000-01-04 0.183051 2000-01-05 NaN 2000-01-06 2.395489 2000-01-07 2.395489 2000-01-08 NaN 2000-01-09 0.733639 2000-01-10 0.733639 Freq: D, dtype: float64

反之，tolerance 限定了索引与索引器值之间的最大距离：

In [230]: ts2.reindex(ts.index, method='ffill', tolerance='1 day') Out[230]: 2000-01-03 0.183051 2000-01-04 0.183051 2000-01-05 NaN 2000-01-06 2.395489 2000-01-07 2.395489 2000-01-08 NaN 2000-01-09 0.733639 2000-01-10 0.733639 Freq: D, dtype: float64

注意：索引为 DatetimeIndex、TimedeltaIndex 或 PeriodIndex 时，tolerance 会尽可能将这些索引强制转换为 Timedelta，这里要求用户用恰当的字符串设定 tolerance 参数。

去掉轴上的标签

drop() 函数与 reindex 经常配合使用，该函数用于删除轴上的一组标签：

In [231]: df Out[231]: one two three a 1.394981 1.772517 NaN b 0.343054 1.912123 -0.050390 c 0.695246 1.478369 1.227435 d NaN 0.279344 -0.613172 In [232]: df.drop(['a', 'd'], axis=0) Out[232]: one two three b 0.343054 1.912123 -0.050390 c 0.695246 1.478369 1.227435 In [233]: df.drop(['one'], axis=1) Out[233]: two three a 1.772517 NaN b 1.912123 -0.050390 c 1.478369 1.227435 d 0.279344 -0.613172

注意：下面的代码可以运行，但不够清晰：

In [234]: df.reindex(df.index.difference(['a', 'd'])) Out[234]: one two three b 0.343054 1.912123 -0.050390 c 0.695246 1.478369 1.227435

重命名或映射标签

rename() 方法支持按不同的轴基于映射（字典或 Series）调整标签。

In [235]: s Out[235]: a -0.186646 b -1.692424 c -0.303893 d -1.425662 e 1.114285 dtype: float64 In [236]: s.rename(str.upper) Out[236]: A -0.186646 B -1.692424 C -0.303893 D -1.425662 E 1.114285 dtype: float64

如果调用的是函数，该函数在处理标签时，必须返回一个值，而且生成的必须是一组唯一值。此外，rename() 还可以调用字典或 Series。

In [237]: df.rename(columns={'one': 'foo', 'two': 'bar'}, .....: index={'a': 'apple', 'b': 'banana', 'd': 'durian'}) .....: Out[237]: foo bar three apple 1.394981 1.772517 NaN banana 0.343054 1.912123 -0.050390 c 0.695246 1.478369 1.227435 durian NaN 0.279344 -0.613172

pandas 不会重命名标签未包含在映射里的列或索引。注意，映射里多出的标签不会触发错误。

0.21.0 版新增。

DataFrame.rename() 还支持“轴式”习语，用这种方式可以指定单个 mapper，及执行映射的 axis。

In [238]: df.rename({'one': 'foo', 'two': 'bar'}, axis='columns') Out[238]: foo bar three a 1.394981 1.772517 NaN b 0.343054 1.912123 -0.050390 c 0.695246 1.478369 1.227435 d NaN 0.279344 -0.613172 In [239]: df.rename({'a': 'apple', 'b': 'banana', 'd': 'durian'}, axis='index') Out[239]: one two three apple 1.394981 1.772517 NaN banana 0.343054 1.912123 -0.050390 c 0.695246 1.478369 1.227435 durian NaN 0.279344 -0.613172

rename() 方法还提供了 inplace 命名参数，默认为 False，并会复制底层数据。inplace=True 时，会直接在原数据上重命名。

0.18.0 版新增。

rename() 还支持用标量或列表更改 Series.name 属性。

In [240]: s.rename("scalar-name") Out[240]: a -0.186646 b -1.692424 c -0.303893 d -1.425662 e 1.114285 Name: scalar-name, dtype: float64

0.24.0 版新增。

rename_axis() 方法支持指定多重索引名称，与标签相对应。

In [241]: df = pd.DataFrame({'x': [1, 2, 3, 4, 5, 6], .....: 'y': [10, 20, 30, 40, 50, 60]}, .....: index=pd.MultiIndex.from_product([['a', 'b', 'c'], [1, 2]], .....: names=['let', 'num'])) .....: In [242]: df Out[242]: x y let num a 1 1 10 2 2 20 b 1 3 30 2 4 40 c 1 5 50 2 6 60 In [243]: df.rename_axis(index={'let': 'abc'}) Out[243]: x y abc num a 1 1 10 2 2 20 b 1 3 30 2 4 40 c 1 5 50 2 6 60 In [244]: df.rename_axis(index=str.upper) Out[244]: x y LET NUM a 1 1 10 2 2 20 b 1 3 30 2 4 40 c 1 5 50 2 6 60

迭代

pandas 对象基于类型进行迭代操作。Series 迭代时被视为数组，基础迭代生成值。DataFrame 则遵循字典式习语，用对象的 key 实现迭代操作。

简言之，基础迭代（for i in object）生成：

Series ：值

DataFrame：列标签

例如，DataFrame 迭代时输出列名：

In [245]: df = pd.DataFrame({'col1': np.random.randn(3), .....: 'col2': np.random.randn(3)}, index=['a', 'b', 'c']) .....: In [246]: for col in df: .....: print(col) .....: col1 col2

Pandas 对象还支持字典式的 items() 方法，通过键值对迭代。

用下列方法可以迭代 DataFrame 里的行：

iterrows()：把 DataFrame 里的行当作（index， Series）对进行迭代。该操作把行转为 Series，同时改变数据类型，并对性能有影响。

`itertuples()` 把 DataFrame 的行当作值的命名元组进行迭代。该操作比 `iterrows()` 快的多，建议尽量用这种方法迭代 DataFrame 的值。

::: danger 警告

Pandas 对象迭代的速度较慢。大部分情况下，没必要对行执行迭代操作，建议用以下几种替代方式：

矢量化：很多操作可以用内置方法或 Numpy 函数，布尔索引……

调用的函数不能在完整的 DataFrame / Series 上运行时，最好用 `apply()`，不要对值进行迭代操作。请参阅函数应用文档。

如果必须对值进行迭代，请务必注意代码的性能，建议在 cython 或 numba 环境下实现内循环。参阅增强性能一节，查看这种操作方法的示例。

:::

::: danger 警告

永远不要修改迭代的内容，这种方式不能确保所有操作都能正常运作。基于数据类型，迭代器返回的是复制（copy）的结果，不是视图（view），这种写入可能不会生效！

下例中的赋值就不会生效：

In [247]: df = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']}) In [248]: for index, row in df.iterrows(): .....: row['a'] = 10 .....: In [249]: df Out[249]: a b 0 1 a 1 2 b 2 3 c

:::

项目（items）

与字典型接口类似，items() 通过键值对进行迭代：

Series：（Index，标量值）对

DataFrame：（列，Series）对

示例如下：

In [250]: for label, ser in df.items(): .....: print(label) .....: print(ser) .....: a 0 1 1 2 2 3 Name: a, dtype: int64 b 0 a 1 b 2 c Name: b, dtype: object

iterrows

iterrows() 迭代 DataFrame 或 Series 里的每一行数据。这个操作返回一个迭代器，生成索引值及包含每行数据的 Series：

In [251]: for row_index, row in df.iterrows(): .....: print(row_index, row, sep='\n') .....: 0 a 1 b a Name: 0, dtype: object 1 a 2 b b Name: 1, dtype: object 2 a 3 b c Name: 2, dtype: object

::: tip 注意

iterrows() 返回的是 Series 里的每一行数据，该操作不会保留每行数据的数据类型，因为数据类型是通过 DataFrame 的列界定的。

示例如下：

In [252]: df_orig = pd.DataFrame([[1, 1.5]], columns=['int', 'float']) In [253]: df_orig.dtypes Out[253]: int int64 float float64 dtype: object In [254]: row = next(df_orig.iterrows())[1] In [255]: row Out[255]: int 1.0 float 1.5 Name: 0, dtype: float64

row 里的值以 Series 形式返回，并被转换为浮点数，原始的整数值则在列 X：

In [256]: row['int'].dtype Out[256]: dtype('float64') In [257]: df_orig['int'].dtype Out[257]: dtype('int64')

要想在行迭代时保存数据类型，最好用 itertuples()，这个函数返回值的命名元组，总的来说，该操作比 iterrows() 速度更快。

:::

下例展示了怎样转置 DataFrame：

In [258]: df2 = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]}) In [259]: print(df2) x y 0 1 4 1 2 5 2 3 6 In [260]: print(df2.T) 0 1 2 x 1 2 3 y 4 5 6 In [261]: df2_t = pd.DataFrame({idx: values for idx, values in df2.iterrows()}) In [262]: print(df2_t) 0 1 2 x 1 2 3 y 4 5 6

itertuples

itertuples() 方法返回为 DataFrame 里每行数据生成命名元组的迭代器。该元组的第一个元素是行的索引值，其余的值则是行的值。

示例如下：

In [263]: for row in df.itertuples(): .....: print(row) .....: Pandas(Index=0, a=1, b='a') Pandas(Index=1, a=2, b='b') Pandas(Index=2, a=3, b='c')

该方法不会把行转换为 Series，只是返回命名元组里的值。itertuples() 保存值的数据类型，而且比 iterrows() 快。

::: tip 注意

包含无效 Python 识别符的列名、重复的列名及以下划线开头的列名，会被重命名为位置名称。如果列数较大，比如大于 255 列，则返回正则元组。

:::

呆鸟云：“翻译不易，五天翻译，四处求证，三番校稿，二次排版，只求一秒点赞。”

精选好文：

最新回复(0)