首页 文章

pd.get_dummies()在较大的级别上变慢

提问于
浏览
3

我不确定这是否已经是最快的方法,或者我的效率是否低效 .

我想对具有27k可能级别的特定分类列进行热编码 . 该列在2个不同的数据集中具有不同的值,因此在使用get_dummies()之前我首先合并了这些级别

def hot_encode_column_in_both_datasets(column_name,df,df2,sparse=True):
    col1b = set(df2[column_name].unique())
    col1a = set(df[column_name].unique())
    combined_cats = list(col1a.union(col1b))
    df[column_name] = df[column_name].astype('category', categories=combined_cats)
    df2[column_name] = df2[column_name].astype('category', categories=combined_cats)

    df = pd.get_dummies(df, columns=[column_name],sparse=sparse)
    df2 = pd.get_dummies(df2, columns=[column_name],sparse=sparse)
    try:
        del df[column_name]
        del df2[column_name]
    except:
        pass
    return df,df2

但是,它已运行超过2个小时,它仍然是热编码 .

我可以在这里做错事吗?或者只是在大型数据集上运行它的本质?

Df有6.8米行和27列,在热编码我想要的列之前,Df2有19990行和27列 .

建议表示赞赏,谢谢! :)

1 回答

  • 2

    我简要回顾了get_dummies source code,我认为它可能没有充分利用您的用例的稀疏性 . 以下方法可能更快,但我没有尝试将其一直扩展到您拥有的19M记录:

    import numpy as np
    import pandas as pd
    import scipy.sparse as ssp
    
    np.random.seed(1)
    N = 10000
    
    dfa = pd.DataFrame.from_dict({
        'col1': np.random.randint(0, 27000, N)
        , 'col2b': np.random.choice([1, 2, 3], N)
        , 'target': np.random.choice([1, 2, 3], N)
        })
    
    # construct an array of the unique values of the column to be encoded
    vals = np.array(dfa.col1.unique())
    # extract an array of values to be encoded from the dataframe
    col1 = dfa.col1.values
    # construct a sparse matrix of the appropriate size and an appropriate,
    # memory-efficient dtype
    spmtx = ssp.dok_matrix((N, len(vals)), dtype=np.uint8)
    # do the encoding. NB: This is only vectorized in one of the two dimensions.
    # Finding a way to vectorize the second dimension may yield a large speed up
    for idx, val in enumerate(vals):
        spmtx[np.argwhere(col1 == val), idx] = 1
    
    # Construct a SparseDataFrame from the sparse matrix and apply the index
    # from the original dataframe and column names.
    dfnew = pd.SparseDataFrame(spmtx, index=dfa.index,
                               columns=['col1_' + str(el) for el in vals])
    dfnew.fillna(0, inplace=True)
    

    UPDATE

    借用其他答案的见解herehere,我能够在两个维度上对解决方案进行矢量化 . 在我的有限测试中,我注意到构造SparseDataFrame似乎会将执行时间增加几倍 . 因此,如果您不需要返回类似DataFrame的对象,则可以节省大量时间 . 此解决方案还处理需要将2个DataFrame编码为具有相同列数的2-d数组的情况 .

    import numpy as np
    import pandas as pd
    import scipy.sparse as ssp
    
    np.random.seed(1)
    N1 = 10000
    N2 = 100000
    
    dfa = pd.DataFrame.from_dict({
        'col1': np.random.randint(0, 27000, N1)
        , 'col2a': np.random.choice([1, 2, 3], N1)
        , 'target': np.random.choice([1, 2, 3], N1)
        })
    
    dfb = pd.DataFrame.from_dict({
        'col1': np.random.randint(0, 27000, N2)
        , 'col2b': np.random.choice(['foo', 'bar', 'baz'], N2)
        , 'target': np.random.choice([1, 2, 3], N2)
        })
    
    # construct an array of the unique values of the column to be encoded
    # taking the union of the values from both dataframes.
    valsa = set(dfa.col1.unique())
    valsb = set(dfb.col1.unique())
    vals = np.array(list(valsa.union(valsb)), dtype=np.uint16)
    
    
    def sparse_ohe(df, col, vals):
        """One-hot encoder using a sparse ndarray."""
        colaray = df[col].values
        # construct a sparse matrix of the appropriate size and an appropriate,
        # memory-efficient dtype
        spmtx = ssp.dok_matrix((df.shape[0], vals.shape[0]), dtype=np.uint8)
        # do the encoding
        spmtx[np.where(colaray.reshape(-1, 1) == vals.reshape(1, -1))] = 1
    
        # Construct a SparseDataFrame from the sparse matrix
        dfnew = pd.SparseDataFrame(spmtx, dtype=np.uint8, index=df.index,
                                   columns=[col + '_' + str(el) for el in vals])
        dfnew.fillna(0, inplace=True)
        return dfnew
    
    dfanew = sparse_ohe(dfa, 'col1', vals)
    dfbnew = sparse_ohe(dfb, 'col1', vals)
    

相关问题