使用pandas仅填充空白，而不是两端的NaN-Java 学习之路

我有一些房屋价格数据，大约8个月，并跟踪价格，因为房屋上市，直到他们出售 . 我想填写的中间数据有几个空白，但我想在每个未触及的末尾留下NaN .

举一个简单的例子，假设我们有'house1'在'第4天'上市，售价为20万，在'第9天'卖出190000 . 我们的house2在第1至12天保持在180000，并且在那个时间窗口内不卖 . 但是，在第6天和第7天出现了问题，我丢失了数据：

house1 = [NaN, NaN, NaN, 200000, 200000, NaN, NaN, 200000, 190000, NaN, NaN, NaN]
house2 = [180000, 180000, 180000, 180000, 180000, NaN, NaN, 180000, 180000, 180000, 180000, 180000]

现在想象一下，而不是常规数组，这些是按日期索引的Pandas Dataframes中的列 .

麻烦的是，我通常用来填补空白的函数是DataFrame.fillna()，使用回填或ffill方法 . 如果我使用ffill，则house1返回：

house1 = [NaN, NaN, NaN, 200000, 200000, 200000, 200000, 200000, 190000, 190000, 190000, 190000]

这填补了空白，但也错误地填写了销售当天的数据 . 如果我使用回填，我得到这个：

house1 = [200000, 200000, 200000, 200000, 200000, 200000, 200000, 200000, 190000, NaN, NaN, NaN]

同样，它填补了这个空白，但这次它也填补了数据的前端 . 如果我使用'limit = 2'和ffill，那么我得到的是：

house1 = [NaN, NaN, NaN, 200000, 200000, 200000, 200000, 200000, 190000, 190000, 190000, NaN]

它再一次填补了这个空白，但随后它也开始将数据填满“真实”数据结束的地方 .

到目前为止，我的解决方案是编写以下函数：

def fillGaps(houseDF):
    """Fills up holes in the housing data"""

    def fillColumns(column):
        filled_col = column
        lastValue = None
        # Keeps track of if we are dealing with a gap in numbers
        gap = False
        i = 0
        for currentValue in filled_col:
            # Loops over all the nans before the numbers begin
            if not isANumber(currentValue) and lastValue is None:
                pass
            # Keeps track of the last number we encountered before a gap
            elif isANumber(currentValue) and (gap is False):
                lastIndex = i
                lastValue = currentValue
            # Notes when we encounter a gap in numbers
            elif not isANumber(currentValue):
                gap = True
            # Fills in the gap
            elif isANumber(currentValue):
                gapIndicies = range(lastIndex + 1, i)
                for j in gapIndicies:
                    filled_col[j] = lastValue
                gap = False
            i += 1
        return filled_col

    filled_df = houseDF.apply(fillColumns, axis=0)
    return filled_df

它只是跳过前面的所有NaN，填充间隙（由实际值之间的NaN组定义），并且不会在末尾填充NaN .

有没有更简洁的方法来做到这一点，或者我不知道的内置熊猫功能？

3 回答

您可以在系列的某些部分使用 fillna . 根据您的描述， fillna 应该只填充第一个非NaN之后和最后一个非NaN之前的NaN：

import numpy as np
import pandas as pd


def fill_column(house):
    house = house.copy()
    non_nans = house[~house.apply(np.isnan)]
    start, end = non_nans.index[0], non_nans.index[-1]
    house.ix[start:end] = house.ix[start:end].fillna(method='ffill')
    return house


house1 = pd.Series([np.nan, np.nan, np.nan, 200000, 200000, np.nan, np.nan, 200000, 190000, np.nan, np.nan, np.nan])
print fill_column(house1)

输出：

0        NaN
1        NaN
2        NaN
3     200000
4     200000
5     200000
6     200000
7     200000
8     190000
9        NaN
10       NaN
11       NaN

请注意，这假定系列包含至少两个非NaN，对应于第一天和最后一天的价格 .

回复于 2024-04-28T11:34:09+08:00

我在一年后找到了这个答案，但是需要它来处理具有多个列的DataFrame，所以我想留下我的解决方案，以防其他人需要相同的 . 我的功能只是YS-L的修改版本

def fillna_downbet(df):
    df = df.copy()
    for col in df:
        non_nans = df[col][~df[col].apply(np.isnan)]
        start, end = non_nans.index[0], non_nans.index[-1]
        df[col].loc[start:end] = df[col].loc[start:end].fillna(method='ffill')
    return df

谢谢！

回复于 2024-04-28T11:34:09+08:00

3
另一种具有多列的DataFrame解决方案
```
df.fillna(method='ffill') + (df.fillna(method='bfill') * 0)
```
它是如何工作的？

第一个 fillna 执行前向填充值 . 这几乎是我们想要的，除了它在每个系列的末尾留下了填充值的痕迹 .

第二个 fillna 执行向后填充的值，我们将其乘以零 . 结果是我们不需要的尾随值将是NaN，其他所有值都将为0 .

最后，我们将两者结合起来，利用x 0 = x和x NaN = NaN的事实 .
回复于 2024-04-28T11:34:09+08:00

使用pandas仅填充空白，而不是两端的NaN

3 回答

相关问题