Dask DataFrame API 和逻辑查询规划

Dask DataFrame API 和逻辑查询规划¶

DataFrame¶

`DataFrame`(expr)	DataFrame 类似的 Expr 集合。
`DataFrame.abs`()	返回每个元素的绝对数值的 Series/DataFrame。
`DataFrame.add`(other[, axis, level, fill_value])
`DataFrame.align`(other[, join, axis, fill_value])	使用指定的 join 方法对齐两个对象的轴。
`DataFrame.all`([axis, skipna, split_every])	返回所有元素是否为 True，可以指定轴。
`DataFrame.any`([axis, skipna, split_every])	返回是否存在任何元素为 True，可以指定轴。
`DataFrame.apply`(function, *args[, meta, axis])	pandas.DataFrame.apply 的并行版本
`DataFrame.assign`(**pairs)	为 DataFrame 分配新列。
`DataFrame.astype`(dtypes)	将 pandas 对象强制转换为指定的 dtype `dtype`。
`DataFrame.bfill`([axis, limit])	使用下一个有效观测值填充 NA/NaN 值。
`DataFrame.categorize`([columns, index, ...])	将 DataFrame 的列转换为 category dtype。
`DataFrame.columns`
`DataFrame.compute`(**kwargs)	计算此 dask 集合
`DataFrame.copy`([deep])	创建 dataframe 的副本
`DataFrame.corr`([method, min_periods, ...])	计算列的成对相关性，不包括 NA/null 值。
`DataFrame.count`([axis, numeric_only, ...])	计算每列或每行的非 NA 单元格数。
`DataFrame.cov`([min_periods, numeric_only, ...])	计算列的成对协方差，不包括 NA/null 值。
`DataFrame.cummax`([axis, skipna])	返回 DataFrame 或 Series 轴上的累积最大值。
`DataFrame.cummin`([axis, skipna])	返回 DataFrame 或 Series 轴上的累积最小值。
`DataFrame.cumprod`([axis, skipna])	返回 DataFrame 或 Series 轴上的累积乘积。
`DataFrame.cumsum`([axis, skipna])	返回 DataFrame 或 Series 轴上的累积总和。
`DataFrame.describe`([split_every, ...])	生成描述性统计信息。
`DataFrame.diff`([periods, axis])	元素的第一个离散差分。
`DataFrame.div`(other[, axis, level, fill_value])
`DataFrame.divide`(other[, axis, level, ...])
`DataFrame.divisions`	一个元组，包含 `npartitions + 1` 个值，按升序排列，标记每个分区的索引的下限/上限。
`DataFrame.drop`([labels, axis, columns, errors])	从行或列中删除指定的标签。
`DataFrame.drop_duplicates`([subset, ...])	返回删除了重复行的 DataFrame。
`DataFrame.dropna`([how, subset, thresh])	删除缺失值。
`DataFrame.dtypes`	返回数据类型
`DataFrame.eq`(other[, level, axis])
`DataFrame.eval`(expr, **kwargs)	评估描述 DataFrame 列操作的字符串。
`DataFrame.explode`(column)	将类列表的每个元素转换为一行，并复制索引值。
`DataFrame.ffill`([axis, limit])	通过将最后一个有效观测值传播到下一个有效位置来填充 NA/NaN 值。
`DataFrame.fillna`([value, axis])	使用指定方法填充 NA/NaN 值。
`DataFrame.floordiv`(other[, axis, level, ...])
`DataFrame.ge`(other[, level, axis])
`DataFrame.get_partition`(n)	获取代表第 nth 个分区的 dask DataFrame/Series。
`DataFrame.groupby`(by[, group_keys, sort, ...])	使用映射器或按一系列列对 DataFrame 进行分组。
`DataFrame.gt`(other[, level, axis])
`DataFrame.head`([n, npartitions, compute])	数据集的前 n 行
`DataFrame.idxmax`([axis, skipna, ...])	返回请求轴上最大值的第一个出现位置的索引。
`DataFrame.idxmin`([axis, skipna, ...])	返回请求轴上最小值的第一个出现位置的索引。
`DataFrame.iloc`	用于按位置选择的纯整数位置索引。
`DataFrame.index`	返回 dask Index 实例
`DataFrame.info`([buf, verbose, memory_usage])	Dask DataFrame 的简洁摘要
`DataFrame.isin`(values)	DataFrame 中的每个元素是否包含在 values 中。
`DataFrame.isna`()	检测缺失值。
`DataFrame.isnull`()	DataFrame.isnull 是 DataFrame.isna 的别名。
`DataFrame.items`()	迭代 (列名, Series) 对。
`DataFrame.iterrows`()	迭代 DataFrame 行，以 (索引, Series) 对的形式。
`DataFrame.itertuples`([index, name])	迭代 DataFrame 行，以命名元组 (namedtuples) 的形式。
`DataFrame.join`(other[, on, how, lsuffix, ...])	连接另一个 DataFrame 的列。
`DataFrame.known_divisions`	分区是否已知。
`DataFrame.le`(other[, level, axis])
`DataFrame.loc`	用于按标签选择的纯标签位置索引器。
`DataFrame.lt`(other[, level, axis])
`DataFrame.map_partitions`(func, *args[, ...])	将 Python 函数应用于每个分区
`DataFrame.mask`(cond[, other])	替换条件为 True 的值。
`DataFrame.max`([axis, skipna, numeric_only, ...])	返回请求轴上的最大值。
`DataFrame.mean`([axis, skipna, numeric_only, ...])	返回请求轴上的平均值。
`DataFrame.median`([axis, numeric_only])	返回请求轴上的中位数。
`DataFrame.median_approximate`([axis, method, ...])	返回请求轴上的近似中位数。
`DataFrame.melt`([id_vars, value_vars, ...])	将 DataFrame 从宽格式转换为长格式，可以选择保留标识符。
`DataFrame.memory_usage`([deep, index])	以字节为单位返回每列的内存使用量。
`DataFrame.memory_usage_per_partition`([...])	返回每个分区的内存使用量
`DataFrame.merge`(right[, how, on, left_on, ...])	将 DataFrame 与另一个 DataFrame 合并
`DataFrame.min`([axis, skipna, numeric_only, ...])	返回请求轴上的最小值。
`DataFrame.mod`(other[, axis, level, fill_value])
`DataFrame.mode`([dropna, split_every, ...])	获取沿选定轴的每个元素的众数。
`DataFrame.mul`(other[, axis, level, fill_value])
`DataFrame.ndim`	返回维度
`DataFrame.ne`(other[, level, axis])
`DataFrame.nlargest`([n, columns, split_every])	返回按 columns 降序排列的前 n 行。
`DataFrame.npartitions`	返回分区数量
`DataFrame.nsmallest`([n, columns, split_every])	返回按 columns 升序排列的前 n 行。
`DataFrame.partitions`	按分区切片 DataFrame
`DataFrame.persist`([fuse])	将此 Dask 集合持久化到内存中
`DataFrame.pivot_table`(index, columns, values)	创建一个类似电子表格的透视表作为 DataFrame。
`DataFrame.pop`(item)	返回项并从框架中删除。
`DataFrame.pow`(other[, axis, level, fill_value])
`DataFrame.prod`([axis, skipna, numeric_only, ...])	返回沿请求轴的值的乘积。
`DataFrame.quantile`([q, axis, numeric_only, ...])	DataFrame 的近似行级和精确列级分位数
`DataFrame.query`(expr, **kwargs)	使用复杂表达式过滤 DataFrame
`DataFrame.radd`(other[, axis, level, fill_value])
`DataFrame.random_split`(frac[, random_state, ...])	按行伪随机地将 DataFrame 分割成不同的部分
`DataFrame.rdiv`(other[, axis, level, fill_value])
`DataFrame.rename`([index, columns])	重命名列或索引标签。
`DataFrame.rename_axis`([mapper, index, ...])	设置索引或列的轴名称。
`DataFrame.repartition`([divisions, ...])	重新分区集合
`DataFrame.replace`([to_replace, value, regex])	用 value 替换 to_replace 中给定的值。
`DataFrame.resample`(rule[, closed, label])	对时间序列数据进行重采样。
`DataFrame.reset_index`([drop])	将索引重置为默认索引。
`DataFrame.rfloordiv`(other[, axis, level, ...])
`DataFrame.rmod`(other[, axis, level, fill_value])
`DataFrame.rmul`(other[, axis, level, fill_value])
`DataFrame.round`([decimals])	将 DataFrame 四舍五入到可变数量的小数位。
`DataFrame.rpow`(other[, axis, level, fill_value])
`DataFrame.rsub`(other[, axis, level, fill_value])
`DataFrame.rtruediv`(other[, axis, level, ...])
`DataFrame.sample`([n, frac, replace, ...])	随机采样项
`DataFrame.select_dtypes`([include, exclude])	根据列的数据类型返回 DataFrame 列的子集。
`DataFrame.sem`([axis, skipna, ddof, ...])	返回沿请求轴的均值的无偏标准误差。
`DataFrame.set_index`(other[, drop, sorted, ...])	使用现有列设置 DataFrame 索引（行标签）。
`DataFrame.shape`
`DataFrame.shuffle`([on, ignore_index, ...])	将 DataFrame 重新排列到新的分区中
`DataFrame.size`	作为 Delayed 对象的 Series 或 DataFrame 的大小。
`DataFrame.sort_values`(by[, npartitions, ...])	按单列对数据集进行排序。
`DataFrame.squeeze`([axis])	将一维轴对象压缩为标量。
`DataFrame.std`([axis, skipna, ddof, ...])	返回沿请求轴的样本标准差。
`DataFrame.sub`(other[, axis, level, fill_value])
`DataFrame.sum`([axis, skipna, numeric_only, ...])	返回沿请求轴的值的总和。
`DataFrame.tail`([n, compute])	数据集的最后 n 行
`DataFrame.to_backend`([backend])	移动到新的 DataFrame 后端
`DataFrame.to_bag`([index, format])	从 Series 创建 Dask Bag
`DataFrame.to_csv`(filename, **kwargs)	更多信息请参阅 dd.to_csv 的文档字符串
`DataFrame.to_dask_array`([lengths, meta, ...])	将 Dask DataFrame 转换为 Dask array。
`DataFrame.to_delayed`([optimize_graph])	转换为 `dask.delayed` 对象列表，每个分区一个。
`DataFrame.to_hdf`(path_or_buf, key[, mode, ...])	更多信息请参阅 dd.to_hdf 的文档字符串
`DataFrame.to_html`([max_rows])	将 DataFrame 渲染为 HTML 表格。
`DataFrame.to_json`(filename, args, *kwargs)	更多信息请参阅 dd.to_json 的文档字符串
`DataFrame.to_orc`(path, args, *kwargs)	更多信息请参阅 dd.to_orc 的文档字符串
`DataFrame.to_parquet`(path, **kwargs)
`DataFrame.to_records`([index, lengths])
`DataFrame.to_string`([max_rows])	将 DataFrame 渲染为控制台友好的表格输出。
`DataFrame.to_sql`(name, uri[, schema, ...])
`DataFrame.to_timestamp`([freq, how])	转换为时间戳的 DatetimeIndex，位于周期的开始。
`DataFrame.truediv`(other[, axis, level, ...])
`DataFrame.values`	返回此 DataFrame 值的 Dask array。
`DataFrame.var`([axis, skipna, ddof, ...])	返回沿请求轴的无偏方差。
`DataFrame.visualize`([tasks])	可视化表达式或任务图
`DataFrame.where`(cond[, other])	在条件为 False 的地方替换值。

Series¶

`Series`(expr)	类似 Series 的 Expr 集合。
`Series.add`(other[, level, fill_value, axis])
`Series.align`(other[, join, axis, fill_value])	使用指定的 join 方法对齐两个对象的轴。
`Series.all`([axis, skipna, split_every])	返回所有元素是否为 True，可以指定轴。
`Series.any`([axis, skipna, split_every])	返回是否存在任何元素为 True，可以指定轴。
`Series.apply`(function, *args[, meta, axis])	pandas.Series.apply 的并行版本
`Series.astype`(dtypes)	将 pandas 对象强制转换为指定的 dtype `dtype`。
`Series.autocorr`([lag, split_every])	计算滞后 N 的自相关。
`Series.between`(left, right[, inclusive])	返回相当于 left <= series <= right 的布尔 Series。
`Series.bfill`([axis, limit])	使用下一个有效观测值填充 NA/NaN 值。
`Series.clear_divisions`()	清除分区信息。
`Series.clip`([lower, upper, axis])	修剪输入阈值处的值。
`Series.compute`(**kwargs)	计算此 dask 集合
`Series.copy`([deep])	创建 dataframe 的副本
`Series.corr`(other[, method, min_periods, ...])	计算与 other Series 的相关性，不包括缺失值。
`Series.count`([axis, numeric_only, split_every])	计算每列或每行的非 NA 单元格数。
`Series.cov`(other[, min_periods, split_every])	计算与 Series 的协方差，不包括缺失值。
`Series.cummax`([axis, skipna])	返回 DataFrame 或 Series 轴上的累积最大值。
`Series.cummin`([axis, skipna])	返回 DataFrame 或 Series 轴上的累积最小值。
`Series.cumprod`([axis, skipna])	返回 DataFrame 或 Series 轴上的累积乘积。
`Series.cumsum`([axis, skipna])	返回 DataFrame 或 Series 轴上的累积总和。
`Series.describe`([split_every, percentiles, ...])	生成描述性统计信息。
`Series.diff`([periods, axis])	元素的第一个离散差分。
`Series.div`(other[, level, fill_value, axis])
`Series.drop_duplicates`([ignore_index, ...])
`Series.dropna`()	返回一个移除了缺失值的新 Series。
`Series.dtype`
`Series.eq`(other[, level, fill_value, axis])
`Series.explode`()	将类似列表的每个元素转换为一行。
`Series.ffill`([axis, limit])	通过将最后一个有效观测值传播到下一个有效位置来填充 NA/NaN 值。
`Series.fillna`([value, axis])	使用指定方法填充 NA/NaN 值。
`Series.floordiv`(other[, level, fill_value, axis])
`Series.ge`(other[, level, fill_value, axis])
`Series.get_partition`(n)	获取代表第 nth 个分区的 dask DataFrame/Series。
`Series.groupby`(by, **kwargs)	使用映射器或列 Series 对 Series 进行分组。
`Series.gt`(other[, level, fill_value, axis])
`Series.head`([n, npartitions, compute])	数据集的前 n 行
`Series.idxmax`([axis, skipna, numeric_only, ...])	返回请求轴上最大值的第一个出现位置的索引。
`Series.idxmin`([axis, skipna, numeric_only, ...])	返回请求轴上最小值的第一个出现位置的索引。
`Series.isin`(values)	DataFrame 中的每个元素是否包含在 values 中。
`Series.isna`()	检测缺失值。
`Series.isnull`()	DataFrame.isnull 是 DataFrame.isna 的别名。
`Series.known_divisions`	分区是否已知。
`Series.le`(other[, level, fill_value, axis])
`Series.loc`	用于按标签选择的纯标签位置索引器。
`Series.lt`(other[, level, fill_value, axis])
`Series.map`(arg[, na_action, meta])	根据输入映射或函数映射 Series 的值。
`Series.map_overlap`(func, before, after, *args)	将函数应用于每个分区，与相邻分区共享行。
`Series.map_partitions`(func, *args[, meta, ...])	将 Python 函数应用于每个分区
`Series.mask`(cond[, other])	替换条件为 True 的值。
`Series.max`([axis, skipna, numeric_only, ...])	返回请求轴上的最大值。
`Series.mean`([axis, skipna, numeric_only, ...])	返回请求轴上的平均值。
`Series.median`()	返回请求轴上的中位数。
`Series.median_approximate`([method])	返回请求轴上的近似中位数。
`Series.memory_usage`([deep, index])	返回 Series 的内存使用情况。
`Series.memory_usage_per_partition`([index, deep])	返回每个分区的内存使用量
`Series.min`([axis, skipna, numeric_only, ...])	返回请求轴上的最小值。
`Series.mod`(other[, level, fill_value, axis])
`Series.mul`(other[, level, fill_value, axis])
`Series.nbytes`	字节数
`Series.ndim`	返回维度
`Series.ne`(other[, level, fill_value, axis])
`Series.nlargest`([n, split_every])	返回最大的 n 个元素。
`Series.notnull`()	DataFrame.notnull 是 DataFrame.notna 的别名。
`Series.nsmallest`([n, split_every])	返回最小的 n 个元素。
`Series.nunique`([dropna, split_every, split_out])	返回对象中唯一元素的数量。
`Series.nunique_approx`([split_every])	近似唯一元素数。
`Series.persist`([fuse])	将此 Dask 集合持久化到内存中
`Series.pipe`(func, args, *kwargs)	应用期望 Series 或 DataFrame 的可链式函数。
`Series.pow`(other[, level, fill_value, axis])
`Series.prod`([axis, skipna, numeric_only, ...])	返回沿请求轴的值的乘积。
`Series.quantile`([q, method])	Series 的近似分位数
`Series.radd`(other[, level, fill_value, axis])
`Series.random_split`(frac[, random_state, ...])	按行伪随机地将 DataFrame 分割成不同的部分
`Series.rdiv`(other[, level, fill_value, axis])
`Series.repartition`([divisions, npartitions, ...])	重新分区集合
`Series.replace`([to_replace, value, regex])	用 value 替换 to_replace 中给定的值。
`Series.rename`(index[, sorted_index])	修改 Series 索引标签或名称
`Series.resample`(rule[, closed, label])	对时间序列数据进行重采样。
`Series.reset_index`([drop])	将索引重置为默认索引。
`Series.rolling`(window, **kwargs)	提供滚动变换。
`Series.round`([decimals])	将 DataFrame 四舍五入到可变数量的小数位。
`Series.sample`([n, frac, replace, random_state])	随机采样项
`Series.sem`([axis, skipna, ddof, ...])	返回沿请求轴的均值的无偏标准误差。
`Series.shape`	返回表示 DataFrame 维度的元组。
`Series.shift`([periods, freq, axis])	按所需的周期数（可选带时间 freq）移动索引。
`Series.size`	作为 Delayed 对象的 Series 或 DataFrame 的大小。
`Series.std`([axis, skipna, ddof, ...])	返回沿请求轴的样本标准差。
`Series.sub`(other[, level, fill_value, axis])
`Series.sum`([axis, skipna, numeric_only, ...])	返回沿请求轴的值的总和。
`Series.to_backend`([backend])	移动到新的 DataFrame 后端
`Series.to_bag`([index, format])	从 Series 创建 Dask Bag
`Series.to_csv`(filename, **kwargs)	更多信息请参阅 dd.to_csv 的文档字符串
`Series.to_dask_array`([lengths, meta, optimize])	将 Dask DataFrame 转换为 Dask array。
`Series.to_delayed`([optimize_graph])	转换为 `dask.delayed` 对象列表，每个分区一个。
`Series.to_frame`([name])	将 Series 转换为 DataFrame。
`Series.to_hdf`(path_or_buf, key[, mode, append])	更多信息请参阅 dd.to_hdf 的文档字符串
`Series.to_string`([max_rows])	渲染 Series 的字符串表示。
`Series.to_timestamp`([freq, how])	转换为时间戳的 DatetimeIndex，位于周期的开始。
`Series.truediv`(other[, level, fill_value, axis])
`Series.unique`([split_every, split_out, ...])	返回对象中唯一值的 Series。
`Series.value_counts`([sort, ascending, ...])	返回一个包含唯一值计数的 Series。
`Series.values`	返回此 DataFrame 值的 Dask array。
`Series.var`([axis, skipna, ddof, ...])	返回沿请求轴的无偏方差。
`Series.visualize`([tasks])	可视化表达式或任务图
`Series.where`(cond[, other])	在条件为 False 的地方替换值。

Index¶

`Index`(expr)	类似 Index 的 Expr 集合。
`Index.add`(other[, level, fill_value, axis])
`Index.align`(other[, join, axis, fill_value])	使用指定的 join 方法对齐两个对象的轴。
`Index.all`([axis, skipna, split_every])	返回所有元素是否为 True，可以指定轴。
`Index.any`([axis, skipna, split_every])	返回是否存在任何元素为 True，可以指定轴。
`Index.apply`(function, *args[, meta, axis])	pandas.Series.apply 的并行版本
`Index.astype`(dtypes)	将 pandas 对象强制转换为指定的 dtype `dtype`。
`Index.autocorr`([lag, split_every])	计算滞后 N 的自相关。
`Index.between`(left, right[, inclusive])	返回相当于 left <= series <= right 的布尔 Series。
`Index.bfill`([axis, limit])	使用下一个有效观测值填充 NA/NaN 值。
`Index.clear_divisions`()	清除分区信息。
`Index.clip`([lower, upper, axis])	修剪输入阈值处的值。
`Index.compute`(**kwargs)	计算此 dask 集合
`Index.copy`([deep])	创建 dataframe 的副本
`Index.corr`(other[, method, min_periods, ...])	计算与 other Series 的相关性，不包括缺失值。
`Index.count`([split_every])	计算每列或每行的非 NA 单元格数。
`Index.cov`(other[, min_periods, split_every])	计算与 Series 的协方差，不包括缺失值。
`Index.cummax`([axis, skipna])	返回 DataFrame 或 Series 轴上的累积最大值。
`Index.cummin`([axis, skipna])	返回 DataFrame 或 Series 轴上的累积最小值。
`Index.cumprod`([axis, skipna])	返回 DataFrame 或 Series 轴上的累积乘积。
`Index.cumsum`([axis, skipna])	返回 DataFrame 或 Series 轴上的累积总和。
`Index.describe`([split_every, percentiles, ...])	生成描述性统计信息。
`Index.diff`([periods, axis])	元素的第一个离散差分。
`Index.div`(other[, level, fill_value, axis])
`Index.drop_duplicates`([ignore_index, ...])
`Index.dropna`()	返回一个移除了缺失值的新 Series。
`Index.dtype`
`Index.eq`(other[, level, fill_value, axis])
`Index.explode`()	将类似列表的每个元素转换为一行。
`Index.ffill`([axis, limit])	通过将最后一个有效观测值传播到下一个有效位置来填充 NA/NaN 值。
`Index.fillna`([value, axis])	使用指定方法填充 NA/NaN 值。
`Index.floordiv`(other[, level, fill_value, axis])
`Index.ge`(other[, level, fill_value, axis])
`Index.get_partition`(n)	获取代表第 nth 个分区的 dask DataFrame/Series。
`Index.groupby`(by, **kwargs)	使用映射器或列 Series 对 Series 进行分组。
`Index.gt`(other[, level, fill_value, axis])
`Index.head`([n, npartitions, compute])	数据集的前 n 行
`Index.is_monotonic_decreasing`	如果对象中的值单调递减，则返回布尔值。
`Index.is_monotonic_increasing`	如果对象中的值单调递增，则返回布尔值。
`Index.isin`(values)	DataFrame 中的每个元素是否包含在 values 中。
`Index.isna`()	检测缺失值。
`Index.isnull`()	DataFrame.isnull 是 DataFrame.isna 的别名。
`Index.known_divisions`	分区是否已知。
`Index.le`(other[, level, fill_value, axis])
`Index.loc`	用于按标签选择的纯标签位置索引器。
`Index.lt`(other[, level, fill_value, axis])
`Index.map`(arg[, na_action, meta, is_monotonic])	使用输入映射或函数映射值。
`Index.map_overlap`(func, before, after, *args)	将函数应用于每个分区，与相邻分区共享行。
`Index.map_partitions`(func, *args[, meta, ...])	将 Python 函数应用于每个分区
`Index.mask`(cond[, other])	替换条件为 True 的值。
`Index.max`([axis, skipna, numeric_only, ...])	返回请求轴上的最大值。
`Index.median`()	返回请求轴上的中位数。
`Index.median_approximate`([method])	返回请求轴上的近似中位数。
`Index.memory_usage`([deep])	值的内存使用情况。
`Index.memory_usage_per_partition`([index, deep])	返回每个分区的内存使用量
`Index.min`([axis, skipna, numeric_only, ...])	返回请求轴上的最小值。
`Index.mod`(other[, level, fill_value, axis])
`Index.mul`(other[, level, fill_value, axis])
`Index.nbytes`	字节数
`Index.ndim`	返回维度
`Index.ne`(other[, level, fill_value, axis])
`Index.nlargest`([n, split_every])	返回最大的 n 个元素。
`Index.notnull`()	DataFrame.notnull 是 DataFrame.notna 的别名。
`Index.nsmallest`([n, split_every])	返回最小的 n 个元素。
`Index.nunique`([dropna, split_every, split_out])	返回对象中唯一元素的数量。
`Index.nunique_approx`([split_every])	近似唯一元素数。
`Index.persist`([fuse])	将此 Dask 集合持久化到内存中
`Index.pipe`(func, args, *kwargs)	应用期望 Series 或 DataFrame 的可链式函数。
`Index.pow`(other[, level, fill_value, axis])
`Index.quantile`([q, method])	Series 的近似分位数
`Index.radd`(other[, level, fill_value, axis])
`Index.random_split`(frac[, random_state, shuffle])	按行伪随机地将 DataFrame 分割成不同的部分
`Index.rdiv`(other[, level, fill_value, axis])
`Index.rename`(index[, sorted_index])	修改 Series 索引标签或名称
`Index.repartition`([divisions, npartitions, ...])	重新分区集合
`Index.replace`([to_replace, value, regex])	用 value 替换 to_replace 中给定的值。
`Index.resample`(rule[, closed, label])	对时间序列数据进行重采样。
`Index.reset_index`([drop])	将索引重置为默认索引。
`Index.rolling`(window, **kwargs)	提供滚动变换。
`Index.round`([decimals])	将 DataFrame 四舍五入到可变数量的小数位。
`Index.sample`([n, frac, replace, random_state])	随机采样项
`Index.sem`([axis, skipna, ddof, split_every, ...])	返回沿请求轴的均值的无偏标准误差。
`Index.shape`	返回表示 DataFrame 维度的元组。
`Index.shift`([periods, freq])	按所需的周期数（可选带时间 freq）移动索引。
`Index.size`	作为 Delayed 对象的 Series 或 DataFrame 的大小。
`Index.sub`(other[, level, fill_value, axis])
`Index.to_backend`([backend])	移动到新的 DataFrame 后端
`Index.to_bag`([index, format])	从 Series 创建 Dask Bag
`Index.to_csv`(filename, **kwargs)	更多信息请参阅 dd.to_csv 的文档字符串
`Index.to_dask_array`([lengths, meta, optimize])	将 Dask DataFrame 转换为 Dask array。
`Index.to_delayed`([optimize_graph])	转换为 `dask.delayed` 对象列表，每个分区一个。
`Index.to_frame`([index, name])	创建一个包含 Index 的 DataFrame。
`Index.to_hdf`(path_or_buf, key[, mode, append])	更多信息请参阅 dd.to_hdf 的文档字符串
`Index.to_series`([index, name])	创建一个索引和值都等于索引键的 Series。
`Index.to_string`([max_rows])	渲染 Series 的字符串表示。
`Index.to_timestamp`([freq, how])	转换为时间戳的 DatetimeIndex，位于周期的开始。
`Index.truediv`(other[, level, fill_value, axis])
`Index.unique`([split_every, split_out, ...])	返回对象中唯一值的 Series。
`Index.value_counts`([sort, ascending, ...])	返回一个包含唯一值计数的 Series。
`Index.values`	返回此 DataFrame 值的 Dask array。
`Index.visualize`([tasks])	可视化表达式或任务图
`Index.where`(cond[, other])	在条件为 False 的地方替换值。
`Index.to_frame`([index, name])	创建一个包含 Index 的 DataFrame。

访问器¶

与 pandas 类似，Dask 在各种访问器下提供了特定数据类型的方法。这些是 Series 中的单独命名空间，仅适用于特定的数据类型。

日期时间访问器¶

方法

`Series.dt.ceil`(args, *kwargs)	对数据执行向上取整操作到指定的 freq。
`Series.dt.floor`(args, *kwargs)	对数据执行向下取整操作到指定的 freq。
`Series.dt.isocalendar`()	根据 ISO 8601 标准计算年、周和日。
`Series.dt.normalize`(args, *kwargs)	将时间转换为午夜。
`Series.dt.round`(args, *kwargs)	对数据执行四舍五入操作到指定的 freq。
`Series.dt.strftime`(args, *kwargs)	使用指定的 date_format 转换为 Index。

属性

`Series.dt.date`	返回 Python `datetime.date` 对象的 numpy 数组。
`Series.dt.day`	datetime 的日。
`Series.dt.dayofweek`	周几，周一为 0，周日为 6。
`Series.dt.dayofyear`	当年的序号日。
`Series.dt.daysinmonth`	月中的天数。
`Series.dt.freq`
`Series.dt.hour`	datetime 的小时。
`Series.dt.microsecond`	datetime 的微秒。
`Series.dt.minute`	datetime 的分钟。
`Series.dt.month`	月份，一月为 1，十二月为 12。
`Series.dt.nanosecond`	datetime 的纳秒。
`Series.dt.quarter`	日期的季度。
`Series.dt.second`	datetime 的秒。
`Series.dt.time`	返回 `datetime.time` 对象的 numpy 数组。
`Series.dt.timetz`	返回带时区的 `datetime.time` 对象的 numpy 数组。
`Series.dt.tz`	返回时区。
`Series.dt.week`	当年的周序号。
`Series.dt.weekday`	周几，周一为 0，周日为 6。
`Series.dt.weekofyear`	当年的周序号。
`Series.dt.year`	datetime 的年份。

字符串访问器¶

方法

`Series.str.capitalize`()	将 Series/Index 中的字符串首字母大写。
`Series.str.casefold`()	将 Series/Index 中的字符串进行大小写折叠。
`Series.str.cat`([others, sep, na_rep])
`Series.str.center`(width[, fillchar])	在 Series/Index 中字符串的左右两侧填充。
`Series.str.contains`(pat[, case, flags, na, ...])	测试模式或正则表达式是否包含在 Series 或 Index 的字符串中。
`Series.str.count`(pat[, flags])	计算 Series/Index 中每个字符串中模式的出现次数。
`Series.str.decode`(encoding[, errors])	使用指定的编码解码 Series/Index 中的字符串。
`Series.str.encode`(encoding[, errors])	使用指定的编码编码 Series/Index 中的字符串。
`Series.str.endswith`(pat[, na])	测试每个字符串元素的末尾是否与模式匹配。
`Series.str.extract`(pat[, flags, expand])	将正则表达式 pat 中的捕获组提取为 DataFrame 中的列。
`Series.str.extractall`(pat[, flags])	将正则表达式 pat 中的捕获组提取为 DataFrame 中的列。
`Series.str.find`(sub[, start, end])	返回 Series/Index 中每个字符串中最低的索引。
`Series.str.findall`(pat[, flags])	在 Series/Index 中查找模式或正则表达式的所有出现。
`Series.str.fullmatch`(pat[, case, flags, na])	确定每个字符串是否完全匹配正则表达式。
`Series.str.get`(i)	从每个组件中提取指定位置或指定键处的元素。
`Series.str.index`(sub[, start, end])	返回 Series/Index 中每个字符串中最低的索引。
`Series.str.isalnum`()	检查每个字符串中的所有字符是否为字母数字。
`Series.str.isalpha`()	检查每个字符串中的所有字符是否为字母。
`Series.str.isdecimal`()	检查每个字符串中的所有字符是否为十进制数字。
`Series.str.isdigit`()	检查每个字符串中的所有字符是否为数字。
`Series.str.islower`()	检查每个字符串中的所有字符是否为小写。
`Series.str.isnumeric`()	检查每个字符串中的所有字符是否为数字。
`Series.str.isspace`()	检查每个字符串中的所有字符是否为空白字符。
`Series.str.istitle`()	检查每个字符串中的所有字符是否为标题格式。
`Series.str.isupper`()	检查每个字符串中的所有字符是否为大写。
`Series.str.join`(sep)	使用传递的分隔符连接 Series/Index 中作为元素包含的列表。
`Series.str.len`()	计算 Series/Index 中每个元素的长度。
`Series.str.ljust`(width[, fillchar])	在 Series/Index 中字符串的右侧填充。
`Series.str.lower`()	将 Series/Index 中的字符串转换为小写。
`Series.str.lstrip`([to_strip])	删除前导字符。
`Series.str.match`(pat[, case, flags, na])	确定每个字符串是否以正则表达式匹配开头。
`Series.str.normalize`(form)	返回 Series/Index 中字符串的 Unicode 规范化形式。
`Series.str.pad`(width[, side, fillchar])	填充 Series/Index 中的字符串至指定宽度。
`Series.str.partition`([sep, expand])	在第一次出现 sep 的位置分割字符串。
`Series.str.repeat`(repeats)	复制 Series 或 Index 中的每个字符串。
`Series.str.replace`(pat, repl[, n, case, ...])	替换 Series/Index 中模式/正则表达式的每个匹配项。
`Series.str.rfind`(sub[, start, end])	返回 Series/Index 中每个字符串中子串出现的最高索引。
`Series.str.rindex`(sub[, start, end])	返回 Series/Index 中每个字符串中子串出现的最高索引。
`Series.str.rjust`(width[, fillchar])	填充 Series/Index 中字符串的左侧。
`Series.str.rpartition`([sep, expand])	在最后一次出现 sep 的位置分割字符串。
`Series.str.rsplit`([pat, n, expand])
`Series.str.rstrip`([to_strip])	移除末尾字符。
`Series.str.slice`([start, stop, step])	从 Series 或 Index 的每个元素中切片出子字符串。
`Series.str.split`([pat, n, expand])	已知的不一致性：`expand=True` 与未知 `n` 一起使用将引发 `NotImplementedError`。
`Series.str.startswith`(pat[, na])	测试每个字符串元素的开头是否匹配模式。
`Series.str.strip`([to_strip])	移除开头和末尾字符。
`Series.str.swapcase`()	转换 Series/Index 中的字符串，使其大小写互换。
`Series.str.title`()	转换 Series/Index 中的字符串为标题大写形式。
`Series.str.translate`(table)	通过给定的映射表映射字符串中的所有字符。
`Series.str.upper`()	将 Series/Index 中的字符串转换为大写。
`Series.str.wrap`(width, **kwargs)	在指定的行宽处换行 Series/Index 中的字符串。
`Series.str.zfill`(width)	通过在 Series/Index 中的字符串前添加 '0' 字符进行填充。

分类访问器¶

方法

`Series.cat.add_categories`(args, *kwargs)	添加新分类。
`Series.cat.as_known`(**kwargs)	确保此 series 中的分类是已知的。
`Series.cat.as_ordered`(args, *kwargs)	将 Categorical 设置为有序。
`Series.cat.as_unknown`()	确保此 series 中的分类是未知的。
`Series.cat.as_unordered`(args, *kwargs)	将 Categorical 设置为无序。
`Series.cat.remove_categories`(args, *kwargs)	移除指定的分类。
`Series.cat.remove_unused_categories`()	移除未使用的分类。
`Series.cat.rename_categories`(args, *kwargs)	重命名分类。
`Series.cat.reorder_categories`(args, *kwargs)	按照 new_categories 中指定的顺序重新排列分类。
`Series.cat.set_categories`(args, *kwargs)	将分类设置为指定的新分类。

属性

`Series.cat.categories`	此 categorical 的分类。
`Series.cat.codes`	此 categorical 的编码。
`Series.cat.known`	分类是否完全已知。
`Series.cat.ordered`	分类是否具有有序关系。

分组操作¶

DataFrame 分组¶

`GroupBy.aggregate`([arg, split_every, ...])	使用一个或多个指定操作进行聚合。
`GroupBy.apply`(func, *args[, meta, ...])	pandas GroupBy.apply 的并行版本。
`GroupBy.bfill`([limit, shuffle_method])	向后填充值。
`GroupBy.count`(**kwargs)	计算组计数，不包括缺失值。
`GroupBy.cumcount`()	对每个组中的每个项目进行编号，从 0 到该组长度减 1。
`GroupBy.cumprod`([numeric_only])	计算每个组的累积积。
`GroupBy.cumsum`([numeric_only])	计算每个组的累积和。
`GroupBy.ffill`([limit, shuffle_method])	向前填充值。
`GroupBy.get_group`(key)	根据提供的名称从组构建 DataFrame。
`GroupBy.max`([numeric_only])	计算组值的最大值。
`GroupBy.mean`([numeric_only, split_out])	计算组均值，不包括缺失值。
`GroupBy.min`([numeric_only])	计算组值的最小值。
`GroupBy.size`(**kwargs)	计算组大小。
`GroupBy.std`([ddof, split_every, split_out, ...])	计算组标准差，不包括缺失值。
`GroupBy.sum`([numeric_only, min_count])	计算组值之和。
`GroupBy.var`([ddof, split_every, split_out, ...])	计算组方差，不包括缺失值。
`GroupBy.cov`([ddof, split_every, split_out, ...])	计算列的成对协方差，不包括 NA/null 值。
`GroupBy.corr`([split_every, split_out, ...])	计算列的成对相关性，不包括 NA/null 值。
`GroupBy.first`([numeric_only, sort])	计算每个组中每列的第一个条目。
`GroupBy.last`([numeric_only, sort])	计算每个组中每列的最后一个条目。
`GroupBy.idxmin`([split_every, split_out, ...])	返回请求轴上最小值的第一个出现位置的索引。
`GroupBy.idxmax`([split_every, split_out, ...])	返回请求轴上最大值的第一个出现位置的索引。
`GroupBy.rolling`(window[, min_periods, ...])	提供滚动变换。
`GroupBy.transform`(func[, meta, shuffle_method])	pandas GroupBy.transform 的并行版本。

Series 分组¶

`SeriesGroupBy.aggregate`([arg, split_every, ...])	使用一个或多个指定操作进行聚合。
`SeriesGroupBy.apply`(func, *args[, meta, ...])	pandas GroupBy.apply 的并行版本。
`SeriesGroupBy.bfill`([limit, shuffle_method])	向后填充值。
`SeriesGroupBy.count`(**kwargs)	计算组计数，不包括缺失值。
`SeriesGroupBy.cumcount`()	对每个组中的每个项目进行编号，从 0 到该组长度减 1。
`SeriesGroupBy.cumprod`([numeric_only])	计算每个组的累积积。
`SeriesGroupBy.cumsum`([numeric_only])	计算每个组的累积和。
`SeriesGroupBy.ffill`([limit, shuffle_method])	向前填充值。
`SeriesGroupBy.get_group`(key)	根据提供的名称从组构建 DataFrame。
`SeriesGroupBy.max`([numeric_only])	计算组值的最大值。
`SeriesGroupBy.mean`([numeric_only, split_out])	计算组均值，不包括缺失值。
`SeriesGroupBy.min`([numeric_only])	计算组值的最小值。
`SeriesGroupBy.nunique`([split_every, ...])	返回组中唯一元素的数量。
`SeriesGroupBy.size`(**kwargs)	计算组大小。
`SeriesGroupBy.std`([ddof, split_every, ...])	计算组标准差，不包括缺失值。
`SeriesGroupBy.sum`([numeric_only, min_count])	计算组值之和。
`SeriesGroupBy.var`([ddof, split_every, ...])	计算组方差，不包括缺失值。
`SeriesGroupBy.first`([numeric_only, sort])	计算每个组中每列的第一个条目。
`SeriesGroupBy.last`([numeric_only, sort])	计算每个组中每列的最后一个条目。
`SeriesGroupBy.idxmin`([split_every, ...])	返回请求轴上最小值的第一个出现位置的索引。
`SeriesGroupBy.idxmax`([split_every, ...])	返回请求轴上最大值的第一个出现位置的索引。
`SeriesGroupBy.rolling`(window[, min_periods, ...])	提供滚动变换。
`SeriesGroupBy.transform`(func[, meta, ...])	pandas GroupBy.transform 的并行版本。

自定义聚合¶

Aggregation(name, chunk, agg[, finalize])

用户定义的分组聚合。

滚动操作¶

`Series.rolling`(window, **kwargs)	提供滚动变换。
`DataFrame.rolling`(window, **kwargs)	提供滚动变换。

`Rolling.apply`(func, args, *kwargs)	计算滚动自定义聚合函数。
`Rolling.count`(args, *kwargs)	计算非 NaN 观测值的滚动计数。
`Rolling.kurt`(args, *kwargs)	计算无偏差的滚动 Fisher's 定义的峰度。
`Rolling.max`(args, *kwargs)	计算滚动最大值。
`Rolling.mean`(args, *kwargs)	计算滚动均值。
`Rolling.median`(args, *kwargs)	计算滚动中位数。
`Rolling.min`(args, *kwargs)	计算滚动最小值。
`Rolling.quantile`(q, args, *kwargs)	计算滚动分位数。
`Rolling.skew`(args, *kwargs)	计算滚动无偏斜度。
`Rolling.std`(args, *kwargs)	计算滚动标准差。
`Rolling.sum`(args, *kwargs)	计算滚动和。
`Rolling.var`(args, *kwargs)	计算滚动方差。

创建 DataFrames¶

`read_csv`(urlpath[, blocksize, ...])	将 CSV 文件读入 Dask.DataFrame。
`read_table`(urlpath[, blocksize, ...])	将分隔文件读入 Dask.DataFrame。
`read_fwf`(urlpath[, blocksize, ...])	将固定宽度文件读入 Dask.DataFrame。
`read_parquet`([path, columns, filters, ...])	将 Parquet 文件读入 Dask DataFrame。
`read_hdf`(pattern, key[, start, stop, ...])	将 HDF 文件读入 Dask DataFrame。
`read_json`(url_path[, orient, lines, ...])	从一组 JSON 文件创建 DataFrame。
`read_orc`(path[, engine, columns, index, ...])	从 ORC 文件读入 DataFrame。
`read_sql_table`(table_name, con, index_col[, ...])	将 SQL 数据库表读入 DataFrame。
`read_sql_query`(sql, con, index_col[, ...])	将 SQL 查询读入 DataFrame。
`read_sql`(sql, con, index_col, **kwargs)	将 SQL 查询或数据库表读入 DataFrame。
`from_array`(arr[, chunksize, columns, meta])	将任何可切片数组读入 Dask DataFrame。
`from_dask_array`(x[, columns, index, meta])	从 Dask Array 创建 Dask DataFrame。
`from_delayed`(dfs[, meta, divisions, prefix, ...])	从多个 Dask Delayed 对象创建 Dask DataFrame。
`from_map`(func, *iterables[, args, meta, ...])	从自定义函数映射创建 DataFrame 集合。
`from_pandas`(data[, npartitions, sort, chunksize])	从 Pandas DataFrame 构造 Dask DataFrame。
`DataFrame.from_dict`(data, *[, npartitions, ...])	从 Python 字典构造 Dask DataFrame。

存储 DataFrames¶

`to_csv`(df, filename[, single_file, ...])	将 Dask DataFrame 存储到 CSV 文件。
`to_parquet`(df, path[, compression, ...])	将 Dask.dataframe 存储到 Parquet 文件。
`to_hdf`(df, path, key[, mode, append, ...])	将 Dask DataFrame 存储到分层数据格式 (HDF) 文件。
`to_records`(df)	从 Dask DataFrame 创建 Dask Array。
`to_sql`(df, name, uri[, schema, if_exists, ...])	将 Dask DataFrame 存储到 SQL 表。
`to_json`(df, url_path[, orient, lines, ...])	将 DataFrame 写入 JSON 文本文件。
`to_orc`(df, path[, engine, write_index, ...])	将 Dask.dataframe 存储到 ORC 文件。

转换 DataFrames¶

`DataFrame.to_bag`([index, format])	从 Series 创建 Dask Bag
`DataFrame.to_dask_array`([lengths, meta, ...])	将 Dask DataFrame 转换为 Dask array。
`DataFrame.to_delayed`([optimize_graph])	转换为 `dask.delayed` 对象列表，每个分区一个。

重塑 DataFrames¶

`get_dummies`(data[, prefix, prefix_sep, ...])	将分类变量转换为哑变量/指示变量。
`pivot_table`(df, index, columns, values[, ...])	创建一个类似电子表格的透视表作为 DataFrame。
`melt`(frame[, id_vars, value_vars, var_name, ...])

连接 DataFrames¶

`DataFrame.merge`(right[, how, on, left_on, ...])	将 DataFrame 与另一个 DataFrame 合并
`concat`(dfs[, axis, join, ...])	沿行连接 DataFrames。
`merge`(left, right[, how, on, left_on, ...])	使用数据库风格的连接合并 DataFrame 或命名 Series 对象。
`merge_asof`(left, right[, on, left_on, ...])	按键距离执行合并。

重采样¶

`Resampler`(obj, rule, **kwargs)	使用一个或多个操作进行聚合。
`Resampler.agg`(func, args, *kwargs)	沿指定的轴使用一个或多个操作进行聚合。
`Resampler.count`()	计算组计数，不包括缺失值。
`Resampler.first`()	计算每个组中每列的第一个条目。
`Resampler.last`()	计算每个组中每列的最后一个条目。
`Resampler.max`()	计算组的最大值。
`Resampler.mean`()	计算组均值，不包括缺失值。
`Resampler.median`()	计算组中位数，不包括缺失值。
`Resampler.min`()	计算组的最小值。
`Resampler.nunique`()	返回组中唯一元素的数量。
`Resampler.ohlc`()	计算组的开盘价、最高价、最低价和收盘价，不包括缺失值。
`Resampler.prod`()	计算组值的乘积。
`Resampler.quantile`()	返回给定分位数的值。
`Resampler.sem`()	计算组均值的标准误差，不包括缺失值。
`Resampler.size`()	计算组大小。
`Resampler.std`()	计算组标准差，不包括缺失值。
`Resampler.sum`()	计算组值之和。
`Resampler.var`()	计算组方差，不包括缺失值。

Dask 元数据¶

make_meta(x[, index, parent_meta])

此方法根据 x 的类型以及（如果提供）parent_meta 创建元数据。

查询规划与优化¶

`DataFrame.explain`([stage, format])	创建 Expression 的图表示。
`DataFrame.visualize`([tasks])	可视化表达式或任务图
`DataFrame.analyze`([filename, format])	输出表达式中每个节点的统计信息。

其他函数¶

`compute`(*args[, traverse, optimize_graph, ...])	一次性计算多个 Dask 集合。
`map_partitions`(func, *args[, meta, ...])	对每个 DataFrame 分区应用 Python 函数。
`map_overlap`(func, df, before, after, *args)	将函数应用于每个分区，与相邻分区共享行。
`to_datetime`()	将参数转换为 datetime。
`to_numeric`(arg[, errors, downcast, meta])	将参数转换为数字类型。
`to_timedelta`()	将参数转换为 timedelta。

Dask DataFrame 最佳实践

dask.dataframe.DataFrame