Dask DataFrame API 与逻辑查询规划

Dask DataFrame API 与逻辑查询规划¶

DataFrame¶

`DataFrame`(expr)	类似 DataFrame 的表达式集合。
`DataFrame.abs`()	返回一个 Series/DataFrame，其中包含每个元素的绝对数值。
`DataFrame.add`(other[, axis, level, fill_value])
`DataFrame.align`(other[, join, axis, fill_value])	使用指定的连接方法在其轴上对齐两个对象。
`DataFrame.all`([axis, skipna, split_every])	返回是否所有元素都为 True，可能跨轴。
`DataFrame.any`([axis, skipna, split_every])	返回是否任一元素为 True，可能跨轴。
`DataFrame.apply`(function, *args[, meta, axis])	pandas.DataFrame.apply 的并行版本
`DataFrame.assign`(**pairs)	为 DataFrame 分配新列。
`DataFrame.astype`(dtypes)	将 pandas 对象转换为指定的 dtype `dtype`。
`DataFrame.bfill`([axis, limit])	使用下一个有效观测值填充 NA/NaN 值。
`DataFrame.categorize`([columns, index, ...])	将 DataFrame 的列转换为 category dtype。
`DataFrame.columns`
`DataFrame.compute`(**kwargs)	计算此 dask 集合
`DataFrame.copy`([deep])	创建 DataFrame 的副本
`DataFrame.corr`([method, min_periods, ...])	计算列的成对相关性，排除 NA/null 值。
`DataFrame.count`([axis, numeric_only, ...])	计算每列或每行的非 NA 单元格数量。
`DataFrame.cov`([min_periods, numeric_only, ...])	计算列的成对协方差，排除 NA/null 值。
`DataFrame.cummax`([axis, skipna])	返回 DataFrame 或 Series 轴上的累积最大值。
`DataFrame.cummin`([axis, skipna])	返回 DataFrame 或 Series 轴上的累积最小值。
`DataFrame.cumprod`([axis, skipna])	返回 DataFrame 或 Series 轴上的累积乘积。
`DataFrame.cumsum`([axis, skipna])	返回 DataFrame 或 Series 轴上的累积和。
`DataFrame.describe`([split_every, ...])	生成描述性统计信息。
`DataFrame.diff`([periods, axis])	元素的第一个离散差值。
`DataFrame.div`(other[, axis, level, fill_value])
`DataFrame.divide`(other[, axis, level, ...])
`DataFrame.divisions`	一个包含 `npartitions + 1` 个值的元组，按升序排列，标记每个分区的索引的下限/上限。
`DataFrame.drop`([labels, axis, columns, errors])	从行或列中删除指定的标签。
`DataFrame.drop_duplicates`([subset, ...])	返回已移除重复行的 DataFrame。
`DataFrame.dropna`([how, subset, thresh])	移除缺失值。
`DataFrame.dtypes`	返回数据类型
`DataFrame.eq`(other[, level, axis])
`DataFrame.eval`(expr, **kwargs)	评估描述 DataFrame 列操作的字符串。
`DataFrame.explode`(column)	将类似列表的每个元素转换为一行，复制索引值。
`DataFrame.ffill`([axis, limit])	通过将最后一个有效观测值传播到下一个有效位置来填充 NA/NaN 值。
`DataFrame.fillna`([value, axis])	使用指定的方法填充 NA/NaN 值。
`DataFrame.floordiv`(other[, axis, level, ...])
`DataFrame.ge`(other[, level, axis])
`DataFrame.get_partition`(n)	获取表示第 n 个分区的 dask DataFrame/Series。
`DataFrame.groupby`(by[, group_keys, sort, ...])	使用映射器或按列的 Series 对 DataFrame 进行分组。
`DataFrame.gt`(other[, level, axis])
`DataFrame.head`([n, npartitions, compute])	数据集的前 n 行
`DataFrame.idxmax`([axis, skipna, ...])	返回请求轴上最大值的首次出现索引。
`DataFrame.idxmin`([axis, skipna, ...])	返回请求轴上最小值的首次出现索引。
`DataFrame.iloc`	纯粹基于整数位置的索引，用于按位置选择。
`DataFrame.index`	返回 dask Index 实例
`DataFrame.info`([buf, verbose, memory_usage])	Dask DataFrame 的简明摘要
`DataFrame.isin`(values)	DataFrame 中的每个元素是否包含在 values 中。
`DataFrame.isna`()	检测缺失值。
`DataFrame.isnull`()	DataFrame.isnull 是 DataFrame.isna 的别名。
`DataFrame.items`()	迭代 (列名, Series) 对。
`DataFrame.iterrows`()	迭代 DataFrame 行作为 (索引, Series) 对。
`DataFrame.itertuples`([index, name])	迭代 DataFrame 行作为 namedtuples。
`DataFrame.join`(other[, on, how, lsuffix, ...])	连接另一个 DataFrame 的列。
`DataFrame.known_divisions`	分区是否已知。
`DataFrame.le`(other[, level, axis])
`DataFrame.loc`	纯粹基于标签位置的索引器，用于按标签选择。
`DataFrame.lt`(other[, level, axis])
`DataFrame.map_partitions`(func, *args[, ...])	将 Python 函数应用于每个分区
`DataFrame.mask`(cond[, other])	替换条件为 True 的值。
`DataFrame.max`([axis, skipna, numeric_only, ...])	返回请求轴上的最大值。
`DataFrame.mean`([axis, skipna, numeric_only, ...])	返回请求轴上的平均值。
`DataFrame.median`([axis, numeric_only])	返回请求轴上的中位数。
`DataFrame.median_approximate`([axis, method, ...])	返回请求轴上值的近似中位数。
`DataFrame.melt`([id_vars, value_vars, ...])	将 DataFrame 从宽格式转换为长格式，可选地保留标识符。
`DataFrame.memory_usage`([deep, index])	返回每列的内存使用量（字节）。
`DataFrame.memory_usage_per_partition`([...])	返回每个分区的内存使用量
`DataFrame.merge`(right[, how, on, left_on, ...])	将 DataFrame 与另一个 DataFrame 合并
`DataFrame.min`([axis, skipna, numeric_only, ...])	返回请求轴上的最小值。
`DataFrame.mod`(other[, axis, level, fill_value])
`DataFrame.mode`([dropna, split_every, ...])	获取沿选定轴的每个元素的众数。
`DataFrame.mul`(other[, axis, level, fill_value])
`DataFrame.ndim`	返回维度
`DataFrame.ne`(other[, level, axis])
`DataFrame.nlargest`([n, columns, split_every])	返回按 columns 降序排列的前 n 行。
`DataFrame.npartitions`	返回分区数量
`DataFrame.nsmallest`([n, columns, split_every])	返回按 columns 升序排列的前 n 行。
`DataFrame.partitions`	按分区切片 DataFrame
`DataFrame.persist`([fuse])	将此 dask 集合持久化到内存中
`DataFrame.pivot_table`(index, columns, values)	创建一个电子表格风格的透视表作为 DataFrame。
`DataFrame.pop`(item)	返回项并从 frame 中删除。
`DataFrame.pow`(other[, axis, level, fill_value])
`DataFrame.prod`([axis, skipna, numeric_only, ...])	返回请求轴上的值乘积。
`DataFrame.quantile`([q, axis, numeric_only, ...])	DataFrame 的近似行方向和精确列方向分位数
`DataFrame.query`(expr, **kwargs)	使用复杂表达式过滤 DataFrame
`DataFrame.radd`(other[, axis, level, fill_value])
`DataFrame.random_split`(frac[, random_state, ...])	按行伪随机地将 DataFrame 分割成不同的部分
`DataFrame.rdiv`(other[, axis, level, fill_value])
`DataFrame.rename`([index, columns])	重命名列或索引标签。
`DataFrame.rename_axis`([mapper, index, ...])	设置索引或列的轴名称。
`DataFrame.repartition`([divisions, ...])	对集合进行重新分区
`DataFrame.replace`([to_replace, value, regex])	将 to_replace 中给定的值替换为 value。
`DataFrame.resample`(rule[, closed, label])	重采样时间序列数据。
`DataFrame.reset_index`([drop])	将索引重置为默认索引。
`DataFrame.rfloordiv`(other[, axis, level, ...])
`DataFrame.rmod`(other[, axis, level, fill_value])
`DataFrame.rmul`(other[, axis, level, fill_value])
`DataFrame.round`([decimals])	将 DataFrame 四舍五入到指定的小数位数。
`DataFrame.rpow`(other[, axis, level, fill_value])
`DataFrame.rsub`(other[, axis, level, fill_value])
`DataFrame.rtruediv`(other[, axis, level, ...])
`DataFrame.sample`([n, frac, replace, ...])	项的随机抽样
`DataFrame.select_dtypes`([include, exclude])	根据列 dtype 返回 DataFrame 列的子集。
`DataFrame.sem`([axis, skipna, ddof, ...])	返回请求轴上的无偏标准误差。
`DataFrame.set_index`(other[, drop, sorted, ...])	使用现有列设置 DataFrame 索引（行标签）。
`DataFrame.shape`
`DataFrame.shuffle`([on, ignore_index, ...])	将 DataFrame 重排到新分区中
`DataFrame.size`	Series 或 DataFrame 的大小作为 Delayed 对象。
`DataFrame.sort_values`(by[, npartitions, ...])	按单个列对数据集进行排序。
`DataFrame.squeeze`([axis])	将一维轴对象压缩为标量。
`DataFrame.std`([axis, skipna, ddof, ...])	返回请求轴上的样本标准差。
`DataFrame.sub`(other[, axis, level, fill_value])
`DataFrame.sum`([axis, skipna, numeric_only, ...])	返回请求轴上的值总和。
`DataFrame.tail`([n, compute])	数据集的后 n 行
`DataFrame.to_backend`([backend])	切换到新的 DataFrame 后端
`DataFrame.to_bag`([index, format])	从 Series 创建 Dask Bag
`DataFrame.to_csv`(filename, **kwargs)	有关更多信息，请参阅 dd.to_csv 文档字符串
`DataFrame.to_dask_array`([lengths, meta, ...])	将 dask DataFrame 转换为 dask 数组。
`DataFrame.to_delayed`([optimize_graph])	转换为 `dask.delayed` 对象列表，每个分区一个。
`DataFrame.to_hdf`(path_or_buf, key[, mode, ...])	有关更多信息，请参阅 dd.to_hdf 文档字符串
`DataFrame.to_html`([max_rows])	将 DataFrame 渲染为 HTML 表格。
`DataFrame.to_json`(filename, args, *kwargs)	有关更多信息，请参阅 dd.to_json 文档字符串
`DataFrame.to_orc`(path, args, *kwargs)	有关更多信息，请参阅 dd.to_orc 文档字符串
`DataFrame.to_parquet`(path, **kwargs)
`DataFrame.to_records`([index, lengths])
`DataFrame.to_string`([max_rows])	将 DataFrame 渲染为控制台友好的表格输出。
`DataFrame.to_sql`(name, uri[, schema, ...])
`DataFrame.to_timestamp`([freq, how])	转换为时间戳的 DatetimeIndex，位于时间段的开始。
`DataFrame.truediv`(other[, axis, level, ...])
`DataFrame.values`	返回此 DataFrame 值的 dask.array
`DataFrame.var`([axis, skipna, ddof, ...])	返回请求轴上的无偏方差。
`DataFrame.visualize`([tasks])	可视化表达式或任务图
`DataFrame.where`(cond[, other])	替换条件为 False 的值。

Series¶

`Series`(expr)	类似 Series 的表达式集合。
`Series.add`(other[, level, fill_value, axis])
`Series.align`(other[, join, axis, fill_value])	使用指定的连接方法在其轴上对齐两个对象。
`Series.all`([axis, skipna, split_every])	返回是否所有元素都为 True，可能跨轴。
`Series.any`([axis, skipna, split_every])	返回是否任一元素为 True，可能跨轴。
`Series.apply`(function, *args[, meta, axis])	pandas.Series.apply 的并行版本
`Series.astype`(dtypes)	将 pandas 对象转换为指定的 dtype `dtype`。
`Series.autocorr`([lag, split_every])	计算滞后 N 的自相关。
`Series.between`(left, right[, inclusive])	返回等同于 left <= series <= right 的布尔 Series。
`Series.bfill`([axis, limit])	使用下一个有效观测值填充 NA/NaN 值。
`Series.clear_divisions`()	忘记分区信息。
`Series.clip`([lower, upper, axis])	在输入阈值处修剪值。
`Series.compute`(**kwargs)	计算此 dask 集合
`Series.copy`([deep])	创建 DataFrame 的副本
`Series.corr`(other[, method, min_periods, ...])	计算与 other Series 的相关性，排除缺失值。
`Series.count`([axis, numeric_only, split_every])	计算每列或每行的非 NA 单元格数量。
`Series.cov`(other[, min_periods, split_every])	计算与 Series 的协方差，排除缺失值。
`Series.cummax`([axis, skipna])	返回 DataFrame 或 Series 轴上的累积最大值。
`Series.cummin`([axis, skipna])	返回 DataFrame 或 Series 轴上的累积最小值。
`Series.cumprod`([axis, skipna])	返回 DataFrame 或 Series 轴上的累积乘积。
`Series.cumsum`([axis, skipna])	返回 DataFrame 或 Series 轴上的累积和。
`Series.describe`([split_every, percentiles, ...])	生成描述性统计信息。
`Series.diff`([periods, axis])	元素的第一个离散差值。
`Series.div`(other[, level, fill_value, axis])
`Series.drop_duplicates`([ignore_index, ...])
`Series.dropna`()	返回一个移除缺失值的新 Series。
`Series.dtype`
`Series.eq`(other[, level, fill_value, axis])
`Series.explode`()	将类列表的每个元素转换为一行。
`Series.ffill`([axis, limit])	通过将最后一个有效观测值传播到下一个有效位置来填充 NA/NaN 值。
`Series.fillna`([value, axis])	使用指定的方法填充 NA/NaN 值。
`Series.floordiv`(other[, level, fill_value, axis])
`Series.ge`(other[, level, fill_value, axis])
`Series.get_partition`(n)	获取表示第 n 个分区的 dask DataFrame/Series。
`Series.groupby`(by, **kwargs)	使用映射器或 Series 列进行 Series 分组。
`Series.gt`(other[, level, fill_value, axis])
`Series.head`([n, npartitions, compute])	数据集的前 n 行
`Series.idxmax`([axis, skipna, numeric_only, ...])	返回请求轴上最大值的首次出现索引。
`Series.idxmin`([axis, skipna, numeric_only, ...])	返回请求轴上最小值的首次出现索引。
`Series.isin`(values)	DataFrame 中的每个元素是否包含在 values 中。
`Series.isna`()	检测缺失值。
`Series.isnull`()	DataFrame.isnull 是 DataFrame.isna 的别名。
`Series.known_divisions`	分区是否已知。
`Series.le`(other[, level, fill_value, axis])
`Series.loc`	纯粹基于标签位置的索引器，用于按标签选择。
`Series.lt`(other[, level, fill_value, axis])
`Series.map`(arg[, na_action, meta])	根据输入映射或函数映射 Series 的值。
`Series.map_overlap`(func, before, after, *args)	对每个分区应用函数，与相邻分区共享行。
`Series.map_partitions`(func, *args[, meta, ...])	将 Python 函数应用于每个分区
`Series.mask`(cond[, other])	替换条件为 True 的值。
`Series.max`([axis, skipna, numeric_only, ...])	返回请求轴上的最大值。
`Series.mean`([axis, skipna, numeric_only, ...])	返回请求轴上的平均值。
`Series.median`()	返回请求轴上的中位数。
`Series.median_approximate`([method])	返回请求轴上值的近似中位数。
`Series.memory_usage`([deep, index])	返回 Series 的内存使用量。
`Series.memory_usage_per_partition`([index, deep])	返回每个分区的内存使用量
`Series.min`([axis, skipna, numeric_only, ...])	返回请求轴上的最小值。
`Series.mod`(other[, level, fill_value, axis])
`Series.mul`(other[, level, fill_value, axis])
`Series.nbytes`	字节数
`Series.ndim`	返回维度
`Series.ne`(other[, level, fill_value, axis])
`Series.nlargest`([n, split_every])	返回最大的 n 个元素。
`Series.notnull`()	DataFrame.notnull 是 DataFrame.notna 的别名。
`Series.nsmallest`([n, split_every])	返回最小的 n 个元素。
`Series.nunique`([dropna, split_every, split_out])	返回对象中唯一元素的数量。
`Series.nunique_approx`([split_every])	唯一行数的近似值。
`Series.persist`([fuse])	将此 dask 集合持久化到内存中
`Series.pipe`(func, args, *kwargs)	应用接受 Series 或 DataFrame 作为输入的链式函数。
`Series.pow`(other[, level, fill_value, axis])
`Series.prod`([axis, skipna, numeric_only, ...])	返回请求轴上的值乘积。
`Series.quantile`([q, method])	Series 的近似分位数
`Series.radd`(other[, level, fill_value, axis])
`Series.random_split`(frac[, random_state, ...])	按行伪随机地将 DataFrame 分割成不同的部分
`Series.rdiv`(other[, level, fill_value, axis])
`Series.repartition`([divisions, npartitions, ...])	对集合进行重新分区
`Series.replace`([to_replace, value, regex])	将 to_replace 中给定的值替换为 value。
`Series.rename`(index[, sorted_index])	更改 Series 索引标签或名称
`Series.resample`(rule[, closed, label])	重采样时间序列数据。
`Series.reset_index`([drop])	将索引重置为默认索引。
`Series.rolling`(window, **kwargs)	提供滚动转换。
`Series.round`([decimals])	将 DataFrame 四舍五入到指定的小数位数。
`Series.sample`([n, frac, replace, random_state])	项的随机抽样
`Series.sem`([axis, skipna, ddof, ...])	返回请求轴上的无偏标准误差。
`Series.shape`	返回一个表示 DataFrame 维度的元组。
`Series.shift`([periods, freq, axis])	通过所需的周期数移动索引，可选择指定时间 freq。
`Series.size`	Series 或 DataFrame 的大小作为 Delayed 对象。
`Series.std`([axis, skipna, ddof, ...])	返回请求轴上的样本标准差。
`Series.sub`(other[, level, fill_value, axis])
`Series.sum`([axis, skipna, numeric_only, ...])	返回请求轴上的值总和。
`Series.to_backend`([backend])	切换到新的 DataFrame 后端
`Series.to_bag`([index, format])	从 Series 创建 Dask Bag
`Series.to_csv`(filename, **kwargs)	有关更多信息，请参阅 dd.to_csv 文档字符串
`Series.to_dask_array`([lengths, meta, optimize])	将 dask DataFrame 转换为 dask 数组。
`Series.to_delayed`([optimize_graph])	转换为 `dask.delayed` 对象列表，每个分区一个。
`Series.to_frame`([name])	将 Series 转换为 DataFrame。
`Series.to_hdf`(path_or_buf, key[, mode, append])	有关更多信息，请参阅 dd.to_hdf 文档字符串
`Series.to_string`([max_rows])	渲染 Series 的字符串表示。
`Series.to_timestamp`([freq, how])	转换为时间戳的 DatetimeIndex，位于时间段的开始。
`Series.truediv`(other[, level, fill_value, axis])
`Series.unique`([split_every, split_out, ...])	返回对象中唯一值的 Series。
`Series.value_counts`([sort, ascending, ...])	返回一个包含唯一值计数的 Series。
`Series.values`	返回此 DataFrame 值的 dask.array
`Series.var`([axis, skipna, ddof, ...])	返回请求轴上的无偏方差。
`Series.visualize`([tasks])	可视化表达式或任务图
`Series.where`(cond[, other])	替换条件为 False 的值。

索引¶

`Index`(expr)	类似索引的表达式集合。
`Index.add`(other[, level, fill_value, axis])
`Index.align`(other[, join, axis, fill_value])	使用指定的连接方法在其轴上对齐两个对象。
`Index.all`([axis, skipna, split_every])	返回是否所有元素都为 True，可能跨轴。
`Index.any`([axis, skipna, split_every])	返回是否任一元素为 True，可能跨轴。
`Index.apply`(function, *args[, meta, axis])	pandas.Series.apply 的并行版本
`Index.astype`(dtypes)	将 pandas 对象转换为指定的 dtype `dtype`。
`Index.autocorr`([lag, split_every])	计算滞后 N 的自相关。
`Index.between`(left, right[, inclusive])	返回等同于 left <= series <= right 的布尔 Series。
`Index.bfill`([axis, limit])	使用下一个有效观测值填充 NA/NaN 值。
`Index.clear_divisions`()	忘记分区信息。
`Index.clip`([lower, upper, axis])	在输入阈值处修剪值。
`Index.compute`(**kwargs)	计算此 dask 集合
`Index.copy`([deep])	创建 DataFrame 的副本
`Index.corr`(other[, method, min_periods, ...])	计算与 other Series 的相关性，排除缺失值。
`Index.count`([split_every])	计算每列或每行的非 NA 单元格数量。
`Index.cov`(other[, min_periods, split_every])	计算与 Series 的协方差，排除缺失值。
`Index.cummax`([axis, skipna])	返回 DataFrame 或 Series 轴上的累积最大值。
`Index.cummin`([axis, skipna])	返回 DataFrame 或 Series 轴上的累积最小值。
`Index.cumprod`([axis, skipna])	返回 DataFrame 或 Series 轴上的累积乘积。
`Index.cumsum`([axis, skipna])	返回 DataFrame 或 Series 轴上的累积和。
`Index.describe`([split_every, percentiles, ...])	生成描述性统计信息。
`Index.diff`([periods, axis])	元素的第一个离散差值。
`Index.div`(other[, level, fill_value, axis])
`Index.drop_duplicates`([ignore_index, ...])
`Index.dropna`()	返回一个移除缺失值的新 Series。
`Index.dtype`
`Index.eq`(other[, level, fill_value, axis])
`Index.explode`()	将类列表的每个元素转换为一行。
`Index.ffill`([axis, limit])	通过将最后一个有效观测值传播到下一个有效位置来填充 NA/NaN 值。
`Index.fillna`([value, axis])	使用指定的方法填充 NA/NaN 值。
`Index.floordiv`(other[, level, fill_value, axis])
`Index.ge`(other[, level, fill_value, axis])
`Index.get_partition`(n)	获取表示第 n 个分区的 dask DataFrame/Series。
`Index.groupby`(by, **kwargs)	使用映射器或 Series 列进行 Series 分组。
`Index.gt`(other[, level, fill_value, axis])
`Index.head`([n, npartitions, compute])	数据集的前 n 行
`Index.is_monotonic_decreasing`	如果对象中的值单调递减，则返回布尔值。
`Index.is_monotonic_increasing`	如果对象中的值单调递增，则返回布尔值。
`Index.isin`(values)	DataFrame 中的每个元素是否包含在 values 中。
`Index.isna`()	检测缺失值。
`Index.isnull`()	DataFrame.isnull 是 DataFrame.isna 的别名。
`Index.known_divisions`	分区是否已知。
`Index.le`(other[, level, fill_value, axis])
`Index.loc`	纯粹基于标签位置的索引器，用于按标签选择。
`Index.lt`(other[, level, fill_value, axis])
`Index.map`(arg[, na_action, meta, is_monotonic])	使用输入映射或函数映射值。
`Index.map_overlap`(func, before, after, *args)	对每个分区应用函数，与相邻分区共享行。
`Index.map_partitions`(func, *args[, meta, ...])	将 Python 函数应用于每个分区
`Index.mask`(cond[, other])	替换条件为 True 的值。
`Index.max`([axis, skipna, numeric_only, ...])	返回请求轴上的最大值。
`Index.median`()	返回请求轴上的中位数。
`Index.median_approximate`([method])	返回请求轴上值的近似中位数。
`Index.memory_usage`([deep])	值的内存使用量。
`Index.memory_usage_per_partition`([index, deep])	返回每个分区的内存使用量
`Index.min`([axis, skipna, numeric_only, ...])	返回请求轴上的最小值。
`Index.mod`(other[, level, fill_value, axis])
`Index.mul`(other[, level, fill_value, axis])
`Index.nbytes`	字节数
`Index.ndim`	返回维度
`Index.ne`(other[, level, fill_value, axis])
`Index.nlargest`([n, split_every])	返回最大的 n 个元素。
`Index.notnull`()	DataFrame.notnull 是 DataFrame.notna 的别名。
`Index.nsmallest`([n, split_every])	返回最小的 n 个元素。
`Index.nunique`([dropna, split_every, split_out])	返回对象中唯一元素的数量。
`Index.nunique_approx`([split_every])	唯一行数的近似值。
`Index.persist`([fuse])	将此 dask 集合持久化到内存中
`Index.pipe`(func, args, *kwargs)	应用接受 Series 或 DataFrame 作为输入的链式函数。
`Index.pow`(other[, level, fill_value, axis])
`Index.quantile`([q, method])	Series 的近似分位数
`Index.radd`(other[, level, fill_value, axis])
`Index.random_split`(frac[, random_state, shuffle])	按行伪随机地将 DataFrame 分割成不同的部分
`Index.rdiv`(other[, level, fill_value, axis])
`Index.rename`(index[, sorted_index])	更改 Series 索引标签或名称
`Index.repartition`([divisions, npartitions, ...])	对集合进行重新分区
`Index.replace`([to_replace, value, regex])	将 to_replace 中给定的值替换为 value。
`Index.resample`(rule[, closed, label])	重采样时间序列数据。
`Index.reset_index`([drop])	将索引重置为默认索引。
`Index.rolling`(window, **kwargs)	提供滚动转换。
`Index.round`([decimals])	将 DataFrame 四舍五入到指定的小数位数。
`Index.sample`([n, frac, replace, random_state])	项的随机抽样
`Index.sem`([axis, skipna, ddof, split_every, ...])	返回请求轴上的无偏标准误差。
`Index.shape`	返回一个表示 DataFrame 维度的元组。
`Index.shift`([periods, freq])	通过所需的周期数移动索引，可选择指定时间 freq。
`Index.size`	Series 或 DataFrame 的大小作为 Delayed 对象。
`Index.sub`(other[, level, fill_value, axis])
`Index.to_backend`([backend])	切换到新的 DataFrame 后端
`Index.to_bag`([index, format])	从 Series 创建 Dask Bag
`Index.to_csv`(filename, **kwargs)	有关更多信息，请参阅 dd.to_csv 文档字符串
`Index.to_dask_array`([lengths, meta, optimize])	将 dask DataFrame 转换为 dask 数组。
`Index.to_delayed`([optimize_graph])	转换为 `dask.delayed` 对象列表，每个分区一个。
`Index.to_frame`([index, name])	创建一个包含 Index 的列的 DataFrame。
`Index.to_hdf`(path_or_buf, key[, mode, append])	有关更多信息，请参阅 dd.to_hdf 文档字符串
`Index.to_series`([index, name])	创建一个 Index 和值都等于索引键的 Series。
`Index.to_string`([max_rows])	渲染 Series 的字符串表示。
`Index.to_timestamp`([freq, how])	转换为时间戳的 DatetimeIndex，位于时间段的开始。
`Index.truediv`(other[, level, fill_value, axis])
`Index.unique`([split_every, split_out, ...])	返回对象中唯一值的 Series。
`Index.value_counts`([sort, ascending, ...])	返回一个包含唯一值计数的 Series。
`Index.values`	返回此 DataFrame 值的 dask.array
`Index.visualize`([tasks])	可视化表达式或任务图
`Index.where`(cond[, other])	替换条件为 False 的值。
`Index.to_frame`([index, name])	创建一个包含 Index 的列的 DataFrame。

访问器¶

与 pandas 类似，Dask 在各种访问器下提供了特定于 dtype 的方法。这些是 Series 中仅适用于特定数据类型的独立命名空间。

日期时间访问器¶

方法

`Series.dt.ceil`(args, *kwargs)	对数据执行向上取整操作到指定的 freq。
`Series.dt.floor`(args, *kwargs)	对数据执行向下取整操作到指定的 freq。
`Series.dt.isocalendar`()	根据 ISO 8601 标准计算年、周和日。
`Series.dt.normalize`(args, *kwargs)	将时间转换为午夜。
`Series.dt.round`(args, *kwargs)	对数据执行四舍五入操作到指定的 freq。
`Series.dt.strftime`(args, *kwargs)	使用指定的 date_format 转换为 Index。

属性

`Series.dt.date`	返回 python `datetime.date` 对象的 numpy 数组。
`Series.dt.day`	日期时间中的日。
`Series.dt.dayofweek`	星期几（周一=0，周日=6）。
`Series.dt.dayofyear`	一年中的序号日。
`Series.dt.daysinmonth`	当月的天数。
`Series.dt.freq`
`Series.dt.hour`	日期时间中的小时。
`Series.dt.microsecond`	日期时间中的微秒。
`Series.dt.minute`	日期时间中的分钟。
`Series.dt.month`	月份（一月=1，十二月=12）。
`Series.dt.nanosecond`	日期时间中的纳秒。
`Series.dt.quarter`	日期中的季度。
`Series.dt.second`	日期时间中的秒。
`Series.dt.time`	返回 `datetime.time` 对象的 numpy 数组。
`Series.dt.timetz`	返回带有时区的 `datetime.time` 对象的 numpy 数组。
`Series.dt.tz`	返回时区。
`Series.dt.week`	一年中的周序号。
`Series.dt.weekday`	星期几（周一=0，周日=6）。
`Series.dt.weekofyear`	一年中的周序号。
`Series.dt.year`	日期时间中的年份。

字符串访问器¶

方法

`Series.str.capitalize`()	将 Series/Index 中的字符串首字母大写。
`Series.str.casefold`()	将 Series/Index 中的字符串折叠大小写。
`Series.str.cat`([others, sep, na_rep])
`Series.str.center`(width[, fillchar])	填充 Series/Index 中字符串的左侧和右侧。
`Series.str.contains`(pat[, case, flags, na, ...])	测试模式或正则表达式是否包含在 Series 或 Index 的字符串中。
`Series.str.count`(pat[, flags])	计算 Series/Index 中每个字符串中模式的出现次数。
`Series.str.decode`(encoding[, errors])	使用指定的编码解码 Series/Index 中的字符串。
`Series.str.encode`(encoding[, errors])	使用指定的编码编码 Series/Index 中的字符串。
`Series.str.endswith`(pat[, na])	测试每个字符串元素的结尾是否匹配模式。
`Series.str.extract`(pat[, flags, expand])	将 regex pat 中的捕获组提取为 DataFrame 中的列。
`Series.str.extractall`(pat[, flags])	将 regex pat 中的捕获组提取为 DataFrame 中的列。
`Series.str.find`(sub[, start, end])	返回 Series/Index 中每个字符串的最低索引。
`Series.str.findall`(pat[, flags])	查找 Series/Index 中模式或正则表达式的所有出现。
`Series.str.fullmatch`(pat[, case, flags, na])	确定每个字符串是否完全匹配正则表达式。
`Series.str.get`(i)	从每个组件中提取指定位置或具有指定键的元素。
`Series.str.index`(sub[, start, end])	返回 Series/Index 中每个字符串的最低索引。
`Series.str.isalnum`()	检查每个字符串中的所有字符是否都是字母或数字。
`Series.str.isalpha`()	检查每个字符串中的所有字符是否都是字母。
`Series.str.isdecimal`()	检查每个字符串中的所有字符是否都是十进制数字。
`Series.str.isdigit`()	检查每个字符串中的所有字符是否都是数字。
`Series.str.islower`()	检查每个字符串中的所有字符是否都是小写。
`Series.str.isnumeric`()	检查每个字符串中的所有字符是否都是数字。
`Series.str.isspace`()	检查每个字符串中的所有字符是否都是空格。
`Series.str.istitle`()	检查每个字符串中的所有字符是否都是标题格式。
`Series.str.isupper`()	检查每个字符串中的所有字符是否都是大写。
`Series.str.join`(sep)	使用传递的分隔符连接 Series/Index 中包含的列表元素。
`Series.str.len`()	计算 Series/Index 中每个元素的长度。
`Series.str.ljust`(width[, fillchar])	填充 Series/Index 中字符串的右侧。
`Series.str.lower`()	将 Series/Index 中的字符串转换为小写。
`Series.str.lstrip`([to_strip])	移除前导字符。
`Series.str.match`(pat[, case, flags, na])	确定每个字符串是否以匹配正则表达式开头。
`Series.str.normalize`(form)	返回 Series/Index 中字符串的 Unicode 标准化形式。
`Series.str.pad`(width[, side, fillchar])	填充 Series/Index 中的字符串至指定的宽度。
`Series.str.partition`([sep, expand])	在第一次出现 sep 的位置分割字符串。
`Series.str.repeat`(repeats)	复制 Series 或 Index 中的每个字符串。
`Series.str.replace`(pat, repl[, n, case, ...])	替换 Series/Index 中模式/regex 的每个出现。
`Series.str.rfind`(sub[, start, end])	返回 Series/Index 中每个字符串的最高索引。
`Series.str.rindex`(sub[, start, end])	返回 Series/Index 中每个字符串的最高索引。
`Series.str.rjust`(width[, fillchar])	填充 Series/Index 中字符串的左侧。
`Series.str.rpartition`([sep, expand])	在最后一次出现 sep 的位置分割字符串。
`Series.str.rsplit`([pat, n, expand])
`Series.str.rstrip`([to_strip])	移除尾部字符。
`Series.str.slice`([start, stop, step])	从 Series 或 Index 中的每个元素切片子字符串。
`Series.str.split`([pat, n, expand])	已知的不一致：对于未知的 `n`，`expand=True` 将引发 `NotImplementedError`。
`Series.str.startswith`(pat[, na])	测试每个字符串元素的开头是否匹配模式。
`Series.str.strip`([to_strip])	移除前导和尾部字符。
`Series.str.swapcase`()	转换 Series/Index 中的字符串大小写。
`Series.str.title`()	将 Series/Index 中的字符串转换为标题格式。
`Series.str.translate`(table)	通过给定的映射表映射字符串中的所有字符。
`Series.str.upper`()	将 Series/Index 中的字符串转换为大写。
`Series.str.wrap`(width, **kwargs)	在指定的行宽度处包装 Series/Index 中的字符串。
`Series.str.zfill`(width)	通过在 Series/Index 中的字符串前面添加“0”字符来填充。

分类访问器¶

方法

`Series.cat.add_categories`(args, *kwargs)	添加新类别。
`Series.cat.as_known`(**kwargs)	确保此 Series 中的类别是已知的。
`Series.cat.as_ordered`(args, *kwargs)	将 Categorical 设置为有序。
`Series.cat.as_unknown`()	确保此 Series 中的类别是未知的
`Series.cat.as_unordered`(args, *kwargs)	将 Categorical 设置为无序。
`Series.cat.remove_categories`(args, *kwargs)	移除指定的类别。
`Series.cat.remove_unused_categories`()	移除未使用的类别
`Series.cat.rename_categories`(args, *kwargs)	重命名类别。
`Series.cat.reorder_categories`(args, *kwargs)	按照 new_categories 中指定的方式重新排序类别。
`Series.cat.set_categories`(args, *kwargs)	将类别设置为指定的新类别。

属性

`Series.cat.categories`	此分类的类别。
`Series.cat.codes`	此分类的代码。
`Series.cat.known`	类别是否完全已知
`Series.cat.ordered`	类别是否有序关系

分组操作¶

DataFrame 分组¶

`GroupBy.aggregate`([arg, split_every, ...])	使用一个或多个指定操作进行聚合
`GroupBy.apply`(func, *args[, meta, ...])	pandas GroupBy.apply 的并行版本
`GroupBy.bfill`([limit, shuffle_method])	向后填充值。
`GroupBy.count`(**kwargs)	计算组的计数，不包括缺失值。
`GroupBy.cumcount`()	对每个组中的每个项进行编号，从 0 到该组的长度减 1。
`GroupBy.cumprod`([numeric_only])	计算每个组的累积乘积。
`GroupBy.cumsum`([numeric_only])	计算每个组的累积和。
`GroupBy.ffill`([limit, shuffle_method])	向前填充值。
`GroupBy.get_group`(key)	从具有指定名称的组构造 DataFrame。
`GroupBy.max`([numeric_only])	计算组值的最大值。
`GroupBy.mean`([numeric_only, split_out])	计算组的均值，不包括缺失值。
`GroupBy.min`([numeric_only])	计算组的最小值。
`GroupBy.size`(**kwargs)	计算组的大小。
`GroupBy.std`([ddof, split_every, split_out, ...])	计算组的标准差，不包括缺失值。
`GroupBy.sum`([numeric_only, min_count])	计算组值的总和。
`GroupBy.var`([ddof, split_every, split_out, ...])	计算组的方差，不包括缺失值。
`GroupBy.cov`([ddof, split_every, split_out, ...])	计算列的成对协方差，排除 NA/null 值。
`GroupBy.corr`([split_every, split_out, ...])	计算列的成对相关性，排除 NA/null 值。
`GroupBy.first`([numeric_only, sort])	计算每个组内每列的第一个条目。
`GroupBy.last`([numeric_only, sort])	计算每个组内每列的最后一个条目。
`GroupBy.idxmin`([split_every, split_out, ...])	返回请求轴上最小值的首次出现索引。
`GroupBy.idxmax`([split_every, split_out, ...])	返回请求轴上最大值的首次出现索引。
`GroupBy.rolling`(window[, min_periods, ...])	提供滚动转换。
`GroupBy.transform`(func[, meta, shuffle_method])	pandas GroupBy.transform 的并行版本

Series 分组¶

`SeriesGroupBy.aggregate`([arg, split_every, ...])	使用一个或多个指定操作进行聚合
`SeriesGroupBy.apply`(func, *args[, meta, ...])	pandas GroupBy.apply 的并行版本
`SeriesGroupBy.bfill`([limit, shuffle_method])	向后填充值。
`SeriesGroupBy.count`(**kwargs)	计算组的计数，不包括缺失值。
`SeriesGroupBy.cumcount`()	对每个组中的每个项进行编号，从 0 到该组的长度减 1。
`SeriesGroupBy.cumprod`([numeric_only])	计算每个组的累积乘积。
`SeriesGroupBy.cumsum`([numeric_only])	计算每个组的累积和。
`SeriesGroupBy.ffill`([limit, shuffle_method])	向前填充值。
`SeriesGroupBy.get_group`(key)	从具有指定名称的组构造 DataFrame。
`SeriesGroupBy.max`([numeric_only])	计算组值的最大值。
`SeriesGroupBy.mean`([numeric_only, split_out])	计算组的均值，不包括缺失值。
`SeriesGroupBy.min`([numeric_only])	计算组的最小值。
`SeriesGroupBy.nunique`([split_every, ...])	返回组中唯一元素的数量。
`SeriesGroupBy.size`(**kwargs)	计算组的大小。
`SeriesGroupBy.std`([ddof, split_every, ...])	计算组的标准差，不包括缺失值。
`SeriesGroupBy.sum`([numeric_only, min_count])	计算组值的总和。
`SeriesGroupBy.var`([ddof, split_every, ...])	计算组的方差，不包括缺失值。
`SeriesGroupBy.first`([numeric_only, sort])	计算每个组内每列的第一个条目。
`SeriesGroupBy.last`([numeric_only, sort])	计算每个组内每列的最后一个条目。
`SeriesGroupBy.idxmin`([split_every, ...])	返回请求轴上最小值的首次出现索引。
`SeriesGroupBy.idxmax`([split_every, ...])	返回请求轴上最大值的首次出现索引。
`SeriesGroupBy.rolling`(window[, min_periods, ...])	提供滚动转换。
`SeriesGroupBy.transform`(func[, meta, ...])	pandas GroupBy.transform 的并行版本

自定义聚合¶

Aggregation(name, chunk, agg[, finalize])

用户定义的 groupby 聚合。

滚动操作¶

`Series.rolling`(window, **kwargs)	提供滚动转换。
`DataFrame.rolling`(window, **kwargs)	提供滚动转换。

`Rolling.apply`(func, args, *kwargs)	计算滚动自定义聚合函数。
`Rolling.count`(args, *kwargs)	计算非 NaN 观测值的滚动计数。
`Rolling.kurt`(args, *kwargs)	计算基于 Fisher 定义的无偏滚动峰度。
`Rolling.max`(args, *kwargs)	计算滚动最大值。
`Rolling.mean`(args, *kwargs)	计算滚动平均值。
`Rolling.median`(args, *kwargs)	计算滚动中位数。
`Rolling.min`(args, *kwargs)	计算滚动最小值。
`Rolling.quantile`(q, args, *kwargs)	计算滚动分位数。
`Rolling.skew`(args, *kwargs)	计算无偏滚动偏度。
`Rolling.std`(args, *kwargs)	计算滚动标准差。
`Rolling.sum`(args, *kwargs)	计算滚动总和。
`Rolling.var`(args, *kwargs)	计算滚动方差。

创建 DataFrame¶

`read_csv`(urlpath[, blocksize, ...])	将 CSV 文件读取到 Dask.DataFrame 中
`read_table`(urlpath[, blocksize, ...])	将分隔文件读取到 Dask.DataFrame 中
`read_fwf`(urlpath[, blocksize, ...])	将固定宽度文件读取到 Dask.DataFrame 中
`read_parquet`([path, columns, filters, ...])	将 Parquet 文件读取到 Dask DataFrame 中
`read_hdf`(pattern, key[, start, stop, ...])	将 HDF 文件读取到 Dask DataFrame 中
`read_json`(url_path[, orient, lines, ...])	从一组 JSON 文件创建 DataFrame
`read_orc`(path[, engine, columns, index, ...])	从 ORC 文件读取 DataFrame
`read_sql_table`(table_name, con, index_col[, ...])	将 SQL 数据库表读取到 DataFrame 中。
`read_sql_query`(sql, con, index_col[, ...])	将 SQL 查询读取到 DataFrame 中。
`read_sql`(sql, con, index_col, **kwargs)	将 SQL 查询或数据库表读取到 DataFrame 中。
`from_array`(arr[, chunksize, columns, meta])	将任何可切片数组读取到 Dask Dataframe 中
`from_dask_array`(x[, columns, index, meta])	从 Dask Array 创建 Dask DataFrame。
`from_delayed`(dfs[, meta, divisions, prefix, ...])	从多个 Dask Delayed 对象创建 Dask DataFrame
`from_map`(func, *iterables[, args, meta, ...])	从自定义函数映射创建 DataFrame 集合。
`from_pandas`(data[, npartitions, sort, chunksize])	从 Pandas DataFrame 构建 Dask DataFrame
`DataFrame.from_dict`(data, *[, npartitions, ...])	从 Python 字典构建 Dask DataFrame

存储 DataFrame¶

`to_csv`(df, filename[, single_file, ...])	将 Dask DataFrame 存储到 CSV 文件
`to_parquet`(df, path[, compression, ...])	将 Dask.dataframe 存储到 Parquet 文件
`to_hdf`(df, path, key[, mode, append, ...])	将 Dask Dataframe 存储到层次化数据格式 (HDF) 文件
`to_records`(df)	从 Dask Dataframe 创建 Dask Array
`to_sql`(df, name, uri[, schema, if_exists, ...])	将 Dask Dataframe 存储到 SQL 表
`to_json`(df, url_path[, orient, lines, ...])	将 DataFrame 写入 JSON 文本文件
`to_orc`(df, path[, engine, write_index, ...])	将 Dask.dataframe 存储到 ORC 文件

转换 DataFrame¶

`DataFrame.to_bag`([index, format])	从 Series 创建 Dask Bag
`DataFrame.to_dask_array`([lengths, meta, ...])	将 dask DataFrame 转换为 dask 数组。
`DataFrame.to_delayed`([optimize_graph])	转换为 `dask.delayed` 对象列表，每个分区一个。

重塑 DataFrame¶

`get_dummies`(data[, prefix, prefix_sep, ...])	将类别变量转换为虚拟变量/指示变量。
`pivot_table`(df, index, columns, values[, ...])	创建一个电子表格风格的透视表作为 DataFrame。
`melt`(frame[, id_vars, value_vars, var_name, ...])

连接 DataFrame¶

`DataFrame.merge`(right[, how, on, left_on, ...])	将 DataFrame 与另一个 DataFrame 合并
`concat`(dfs[, axis, join, ...])	沿行连接 DataFrame。
`merge`(left, right[, how, on, left_on, ...])	使用数据库风格的连接合并 DataFrame 或命名 Series 对象。
`merge_asof`(left, right[, on, left_on, ...])	按键距离执行合并。

重采样¶

`Resampler`(obj, rule, **kwargs)	使用一个或多个操作进行聚合
`Resampler.agg`(func, args, *kwargs)	沿指定轴使用一个或多个操作进行聚合。
`Resampler.count`()	计算组的计数，不包括缺失值。
`Resampler.first`()	计算每个组内每列的第一个条目。
`Resampler.last`()	计算每个组内每列的最后一个条目。
`Resampler.max`()	计算组的最大值。
`Resampler.mean`()	计算组的均值，不包括缺失值。
`Resampler.median`()	计算组的中位数，不包括缺失值。
`Resampler.min`()	计算组的最小值。
`Resampler.nunique`()	返回组中唯一元素的数量。
`Resampler.ohlc`()	计算组的开盘价、最高价、最低价和收盘价，不包括缺失值。
`Resampler.prod`()	计算组值的乘积。
`Resampler.quantile`()	返回给定分位数的值。
`Resampler.sem`()	计算组平均值的标准误，不包括缺失值。
`Resampler.size`()	计算组的大小。
`Resampler.std`()	计算组的标准差，不包括缺失值。
`Resampler.sum`()	计算组值的总和。
`Resampler.var`()	计算组的方差，不包括缺失值。

Dask 元数据¶

make_meta(x[, index, parent_meta])

此方法根据 x 的类型以及提供的 parent_meta 创建元数据。

查询规划与优化¶

`DataFrame.explain`([stage, format])	创建表达式的图表示。
`DataFrame.visualize`([tasks])	可视化表达式或任务图
`DataFrame.analyze`([filename, format])	输出表达式中每个节点的统计信息。

其他函数¶

`compute`(*args[, traverse, optimize_graph, ...])	一次性计算多个 dask 集合。
`map_partitions`(func, *args[, meta, ...])	在每个 DataFrame 分区上应用 Python 函数。
`map_overlap`(func, df, before, after, *args)	对每个分区应用函数，与相邻分区共享行。
`to_datetime`()	将参数转换为 datetime 类型。
`to_numeric`(arg[, errors, downcast, meta])	将参数转换为数字类型。
`to_timedelta`()	将参数转换为 timedelta 类型。

Dask DataFrames 最佳实践

dask.dataframe.DataFrame