首页 文章

将数据帧列切割为十分位并附加到矩阵

提问于
浏览
3

tl; dr:我的代码执行我想要的除了一个元素:为每个代码块定义 tick_slice 的行不返回所需的切片 . 如果您想跳过我的数据的详细信息,请跳到下面以"My Question:"开头的第4段 .

我有下面的代码,其中我将csv文件读入数据帧 . csv文件中的数据组织如下:第一行包含格式为YYYYMMDD的日期,它们在~350列中具有降序值 . 在每列(每个日期下面)都是唯一标识符(有点像股票代码) . 大多数列具有不同的行数,范围从几百到几千,标识符从“最佳”到“最差”排序 .

我的愿望是创建一个矩阵(然后将其输出到一个新的csv文件) . 矩阵构建如下:从输入csv数据的第一列开始,获取标识符的最高十分位数(10%)并将它们放在输出矩阵的第一列中 . 接下来,移动到输入csv数据的第二列,从此列获取标识符的最高十分位数,并将这些标识符放在输出矩阵的第一列下面,从第一步放置到那里 . 对变量 hld_per 给出的次数重复此操作 . 然后移到右边一列并重复这些步骤 . 所以在我当前的代码中,嵌套的for循环首先会覆盖输入csv数据的1:12列,然后是第2:13列等 .

我目前有十个这段代码重复,以创建10个输出文件,每个十分位一个,最好到最差(下面的代码只显示前两个十块,以节省空间;另外,我知道重复这个块10次是低效的,但是我会在我按照需要运行当前代码后解决这个问题 .

我的问题:下面的代码完成了我想要的大部分内容,但为每个嵌套for循环定义 tick_slice 的行不知道如何应用它 .

任何帮助是极大的赞赏 . 此外,如果在熊猫中有一种完全不同的方式,我肯定会接受任何建议 .

部分代码:

import pandas as pd

hld_per = 12
quantiles = 10
permnos = pd.read_csv('Ranks.csv')
my_headers = list(permnos.columns)
total_cols = len(permnos.columns)

ports1 = []

for i in range(total_cols-(hld_per-1)):
    permlist = []
    for j in range(hld_per):
        tick_slice = int(permnos.iloc[:,i+j].count()/quantiles)
        col_slice = permnos.iloc[0:tick_slice, i+j].tolist()
        permlist = permlist + col_slice
    ports1.append(permlist)

matrix = pd.DataFrame(ports1)
matrix = matrix.T
matrix.columns = my_headers[0:len(matrix.columns)]
matrix.to_csv('ports1.csv', sep=',', index=False, header=True)

ports2 = []

for i in range(total_cols-(hld_per-1)):
    permlist = []
    for j in range(hld_per):
        tick_slice = int(permnos.iloc[:,i+j].count()/quantiles)
        col_slice = permnos.iloc[tick_slice:tick_slice*2, i+j].tolist()
        permlist = permlist + col_slice
    ports2.append(permlist)

matrix = pd.DataFrame(ports2)
matrix = matrix.T
matrix.columns = my_headers[0:len(matrix.columns)]
matrix.to_csv('ports2.csv', sep=',', index=False, header=True)

从这段代码中, permnos.head() 产生:

20131231  20131130  20131031  20130930  20130831  20130731  20130630  \

0   93044.0   93044.0     13264     13264   89169.0   82486.0   91274.0   
1   79702.0   91515.0     90710     81148   47387.0   88359.0   93353.0   
2   85751.0   85724.0     88810     11513   85576.0   47387.0   85576.0   
3   85576.0   89169.0     81562     81562   81148.0   10294.0   10294.0   
4   13264.0   90710.0     82281     47387   11285.0   90710.0   47387.0

matrix.head() for ports1 yield:

20131231  20131130  20131031  20130930  20130831  20130731  20130630  \

0   93044.0   93044.0   13264.0   13264.0   89169.0   82486.0   91274.0   
1   79702.0   91515.0   90710.0   81148.0   47387.0   88359.0   93353.0   
2   85751.0   85724.0   88810.0   11513.0   85576.0   47387.0   85576.0   
3   93044.0   13264.0   13264.0   89169.0   82486.0   91274.0   85653.0   
4   91515.0   90710.0   81148.0   47387.0   88359.0   93353.0   91274.0

matrix.head() for ports2 yield:

20131231  20131130  20131031  20130930  20130831  20130731  20130630  \

0   85576.0   89169.0   81562.0   81562.0   81148.0   10294.0   10294.0   
1   13264.0   90710.0   82281.0   47387.0   11285.0   90710.0   47387.0   
2   90539.0   47387.0   93044.0   92805.0   82281.0   89169.0   66852.0   
3   89169.0   81562.0   81562.0   81148.0   10294.0   10294.0   89169.0   
4   90710.0   82281.0   47387.0   11285.0   90710.0   47387.0   93353.0

2 回答

  • 1

    如果我理解正确,那么是的, qcut() 可以为你找到你正在寻找的分裂 .

    我们将从构建示例 permnos 数据框开始 . 这基于OP permnos.head() ,加上一些额外的行来说明列长度的异质性 .

    import pandas as pd
    
    data = {'20130630': {0: 91274.0, 1: 93353.0, 2: 85576.0, 3: 10294.0, 4: 47387.0, 5: np.nan, 6: np.nan},
            '20130731': {0: 82486.0, 1: 88359.0, 2: 47387.0, 3: 10294.0, 4: 90710.0},
            '20130831': {0: 89169.0, 1: 47387.0, 2: 85576.0, 3: 81148.0, 4: 11285.0},
            '20130930': {0: 13264, 1: 81148, 2: 11513, 3: 81562, 4: np.nan},
            '20131031': {0: 13264, 1: 90710, 2: 88810, 3: 81562, 4: 82281},
            '20131130': {0: 93044.0, 1: 91515.0, 2: 85724.0, 3: 89169.0, 4: 90710.0, 5: 80000., 6: 900000.},
            '20131231': {0: 93044.0, 1: 79702.0, 2: 85751.0, 3: 85576.0, 4: 13264.0, 5: np.nan}}
    
    permnos = pd.DataFrame(data)
    permnos
       20130630  20130731  20130831  20130930  20131031  20131130  20131231
    0   91274.0   82486.0   89169.0   13264.0   13264.0   93044.0   93044.0
    1   93353.0   88359.0   47387.0   81148.0   90710.0   91515.0   79702.0
    2   85576.0   47387.0   85576.0   11513.0   88810.0   85724.0   85751.0
    3   10294.0   10294.0   81148.0   81562.0   81562.0   89169.0   85576.0
    4   47387.0   90710.0   11285.0       NaN   82281.0   90710.0   13264.0
    5       NaN       NaN       NaN       NaN       NaN   80000.0       NaN
    6       NaN       NaN       NaN       NaN       NaN  900000.0       NaN
    

    考虑一下 qcut() returns

    out:如果标签为False,则为分类或系列或整数数组返回类型(分类或系列)取决于输入:如果输入是Series else Categorical,则为类型类别系列 .

    我们正在传递系列数据,因此我们将获得一系列类型 category 作为输出 . 例如:

    n_bin = 3
    out = pd.qcut(permnos["20130630"].dropna(), n_bin)
    out
    0      (89374.667, 93353.0]
    1      (89374.667, 93353.0]
    2    (60116.667, 89374.667]
    3    (10293.999, 60116.667]
    4    (10293.999, 60116.667]
    Name: 20130630, dtype: category
    Categories (3, interval[float64]): [(10293.999, 60116.667] < (60116.667, 89374.667] < (89374.667, 93353.0]]
    

    我们可以看到它如何处理我们的不均匀分裂;这三个类别分别制作了2个,1个和2个数据点的桶 . 现在,我们可以构造一个布尔掩码,只获取我们关心的存储桶中的条目:

    out == out.cat.categories[2] # categories are ordered small to large
    0     True
    1     True
    2    False
    3    False
    4    False
    Name: 20130630, dtype: bool
    

    通过这种方法,我们应该能够通过仅获取在每次迭代中在目标分位数中被分块的条目来从每列中获取我们想要的切片 .

    我们可以通过将核心操作包装在一个函数_390250中来压缩一些东西 .

    UDPATED
    (针对多个矩阵推广,每条评论)

    hld_per = 5
    my_headers = list(permnos.columns)
    total_cols = len(permnos.columns)
    
    def construct_matrix(df, hld_per, total_cols, n_bin, colnames, start_pos, end_cat):
        ports = []
        started = start_pos
    
        for i in range(total_cols-(hld_per-1)):
            permlist = []
            for j in range(hld_per):
                out = pd.qcut(df.iloc[:,i+j].dropna(), n_bin)
                if not bool(started):
                    end_pos = (out == out.cat.categories[start_pos]).sum()
                else:
                    end_pos = start_pos + (out == out.cat.categories[end_cat]).sum()
                col_slice = df.iloc[start_pos:end_pos, i+j].dropna().tolist()
                permlist += col_slice
            ports.append(permlist)
    
        matrix = pd.DataFrame(ports).T
        matrix.columns = colnames[0:len(matrix.columns)]
    
        return matrix, end_pos
    

    鉴于我们可能想要基于 n_bin 构建任意数量的矩阵,我们设置了一个 construct_matrix() 循环,其中第一个 start_pos 之后的每个新 start_pos 都是 start_pos end_pos ,在上一次迭代的函数内设置 . 我们将生成的矩阵存储在一个列表中, matrices .

    matrices = []
    start_pos = 0
    # number of qcut bins
    n_bin = 3
    
    for i in range(1, n_bin):
    
        end_cat = n_bin - i
    
        print("matrix: {}, start_pos: {}, end_cat: {}".format(i, start_pos, end_cat))
    
        matrix, start_pos = construct_matrix(permnos, hld_per, total_cols, n_bin, my_headers, 
                                             start_pos=start_pos, end_cat=end_cat)
        matrices.append(matrix)
    
        print(matrix)
        print()
    

    输出:

    matrix: 1, start_pos: 0, end_cat: 2
        20130630  20130731  20130831
    0    91274.0   82486.0   89169.0
    1    93353.0   88359.0   47387.0
    2    82486.0   89169.0   13264.0
    3    88359.0   47387.0   81148.0
    4    89169.0   13264.0   13264.0
    5    47387.0   81148.0   90710.0
    6    13264.0   13264.0   93044.0
    7    81148.0   90710.0   91515.0
    8    13264.0   93044.0   85724.0
    9    90710.0   91515.0   93044.0
    10       NaN   85724.0   79702.0
    
    matrix: 2, start_pos: 2, end_cat: 1
       20130630  20130731  20130831
    0   85576.0   47387.0   85576.0
    1   47387.0   85576.0   11513.0
    2   85576.0   11513.0   88810.0
    3   11513.0   88810.0   85724.0
    4   88810.0   85724.0   89169.0
    5       NaN   89169.0   85751.0
    

    这应该可以获得您在列和分位数之间寻找的(大致)偶数分割 .

  • 0

    我从来没有能够使用 qcut 为此制作解决方案,但我已经提出了一个符合我的规格的替代解决方案 . 希望其他人也会发现这个用途 .

    import pandas as pd
    
    hld_per = 12
    quantiles = 10
    permnos = pd.read_csv('Ranks.csv')
    my_headers = list(permnos.columns)
    total_cols = len(permnos.columns)
    
    def slice_range(col_length, quantile):
        increment = col_length // 10
        remainder = col_length % 10 + 1
        addon = 0
        for i in range(quantile-1):
            remainder = max(0, remainder - 1)
            if remainder > 0:
                addon += 1
        start = (quantile - 1) * increment + addon  
        return start
    
    for i in range(quantiles):
        ports = []
        for j in range(total_cols-(hld_per-1)):
            permlist = []
            for k in range(hld_per):
                col_len = permnos.iloc[:,j+k].count()
                start = slice_range(col_len, i+1)
                end = slice_range(col_len, (i+2))
                col_slice = permnos.iloc[start:end, j+k].tolist()
                permlist += col_slice
            ports.append(permlist)
    
        matrix = pd.DataFrame(ports).T
        matrix.columns = my_headers[0:len(matrix.columns)]
        matrix.to_csv("portstst5_" + str(i+1) + ".csv", sep=',', index=False, header=True)
    

相关问题