首页 文章

使用pandas,numpy或其他将numpy数组连接到两个数组

提问于
浏览
1

我有一系列numpy数组生成,例如:

import random
N = 5
data = [[random.random() for i in range(N)] for j in range(N)]
names = ['a','b','c','d','e']
df = pd.DataFrame(data)
df = df.transpose()
df.columns = names

即:

a    b    c    d    e
0.01 0.03 0.01 0.2  0.04
0.2  0.01 0.02 0.01 0.1
...

我想格式化它,使它看起来像这样:

name    value
a       0.01
b       0.03
c       0.01
d       0.2
e       0.04
a       0.2
b       0.01
....

(数据顺序不重要)

我试过pandas dataframe transpose:

df = pd.DataFrame(data)
df = df.transpose()
df.columns = names

但结果如下:

a    0.1   0.2  0.01 0.2
b    0.3   0.1  0.2  0.01
....

关于如何重新格式化numpy数组/ pandas数据帧以获得两列数据的想法?

3 回答

  • 2

    那是你要的吗?

    In [11]: df
    Out[11]:
              a         b         c         d         e
    0  0.791796  0.428642  0.887860  0.803709  0.860545
    1  0.230401  0.105232  0.617007  0.557678  0.590459
    2  0.448462  0.314422  0.207188  0.785642  0.022271
    3  0.075631  0.707029  0.111538  0.769387  0.174297
    4  0.707566  0.299966  0.197642  0.145841  0.231135
    
    In [12]: df.stack().reset_index(level=0, drop=True).reset_index()
    Out[12]:
       index         0
    0      a  0.791796
    1      b  0.428642
    2      c  0.887860
    3      d  0.803709
    4      e  0.860545
    5      a  0.230401
    6      b  0.105232
    7      c  0.617007
    8      d  0.557678
    9      e  0.590459
    10     a  0.448462
    11     b  0.314422
    12     c  0.207188
    13     d  0.785642
    14     e  0.022271
    15     a  0.075631
    16     b  0.707029
    17     c  0.111538
    18     d  0.769387
    19     e  0.174297
    20     a  0.707566
    21     b  0.299966
    22     c  0.197642
    23     d  0.145841
    24     e  0.231135
    
  • 1

    您可以使用numpy.tile作为重复列名称,使用numpy.ravel作为 DataFrame 的展平值:

    #random dataframe
    np.random.seed(100)
    df = pd.DataFrame(np.random.randint(10, size=(5,5)), columns=list('ABCDE'))
    print (df)
       A  B  C  D  E
    0  8  8  3  7  7
    1  0  4  2  5  2
    2  2  2  1  0  8
    3  4  0  9  6  2
    4  4  1  5  3  4
    
    df2 = pd.DataFrame({
            "name": np.tile(df.columns, len(df.index)),
            "value": df.values.ravel()})
    print (df2)        
       name  value
    0     A      8
    1     B      8
    2     C      3
    3     D      7
    4     E      7
    5     A      0
    6     B      4
    7     C      2
    8     D      5
    9     E      2
    10    A      2
    11    B      2
    12    C      1
    13    D      0
    14    E      8
    15    A      4
    16    B      0
    17    C      9
    18    D      6
    19    E      2
    20    A      4
    21    B      1
    22    C      5
    23    D      3
    24    E      4
    

    Timingslen(df) = 1M ):

    #random dataframe
    np.random.seed(100)
    N = 1000000
    df = pd.DataFrame(np.random.randint(10, size=(N,5)), columns=list('abcde'))
    print (df)
    
    In [86]: %timeit (pd.DataFrame({"name": np.tile(df.columns, len(df.index)),"value": df.values.ravel()}))
    10 loops, best of 3: 84.8 ms per loop
    
    In [87]: %timeit (pd.DataFrame(np.column_stack((np.tile(df.columns, df.shape[0]), df.values.reshape(-1,1))), columns=['name', 'value']))
    10 loops, best of 3: 196 ms per loop
    
    In [88]: %timeit (df.stack().reset_index(level=0, drop=True).reset_index(name='value').rename(columns={'index':'name'}))
    1 loop, best of 3: 344 ms per loop
    

    如果需要输出 numpy array 添加numpy.column_stack

    print (np.column_stack((np.tile(df.columns, len(df.index)), df.values.ravel())))
    [['a' 8]
     ['b' 8]
     ['c' 3]
     ['d' 7]
     ['e' 7]
     ['a' 0]
     ['b' 4]
     ['c' 2]
     ['d' 5]
     ['e' 2]
     ['a' 2]
     ['b' 2]
     ['c' 1]
     ['d' 0]
     ['e' 8]
     ['a' 4]
     ['b' 0]
     ['c' 9]
     ['d' 6]
     ['e' 2]
     ['a' 4]
     ['b' 1]
     ['c' 5]
     ['d' 3]
     ['e' 4]]
    
  • 1

    您只需要 df 中的所有列 concat . 由于列的名称不同,您需要使用相同的名称设置它们 . 如果没有, pandas 将在 concat 结果中添加新列 .

    import random
    import pandas as pd
    
    N = 5
    data = [[random.random() for i in range(N)] for j in range(N)]
    names = ['a','b','c','d','e']
    
    df = pd.DataFrame(data)
    df.columns = names
    df = df.transpose()
    print df
    
    #           0         1         2         3         4
    # a  0.643042  0.061476  0.415979  0.209272  0.394414
    # b  0.175363  0.580336  0.056173  0.468121  0.388956
    # c  0.096257  0.570860  0.516667  0.892087  0.956790
    # d  0.082906  0.340805  0.466074  0.010123  0.293006
    # e  0.430240  0.759413  0.083779  0.442159  0.434603
    
    df_col=[df[[i]] for i in range(len(df))]    # separate columns in df
    for col in df_col:
        col.columns=['value']                   # change the columns' name
    
    res = pd.concat(df_col)                     # concat them all together
    res.index.names=['name']
    
    print res
    
    #          value
    # name          
    # a     0.643042
    # b     0.175363
    # c     0.096257
    # d     0.082906
    # e     0.430240
    # a     0.061476
    # b     0.580336
    # c     0.570860
    # d     0.340805
    # e     0.759413
    # a     0.415979
    # b     0.056173
    # c     0.516667
    # d     0.466074
    # e     0.083779
    # a     0.209272
    # b     0.468121
    # c     0.892087
    # d     0.010123
    # e     0.442159
    # a     0.394414
    # b     0.388956
    # c     0.956790
    # d     0.293006
    # e     0.434603
    

相关问题