首页 文章

在pandas / python中的数据框中组合两列文本

提问于
浏览
256

我使用pandas在python中有一个20 x 4000的数据帧 . 其中两列名为Year和quarter . 我想创建一个名为period的变量,使Year = 2000,quarter = q2变为2000q2

任何人都可以帮忙吗?

16 回答

  • 255

    效率更高

    def concat_df_str1(df):
        """ run time: 1.3416s """
        return pd.Series([''.join(row.astype(str)) for row in df.values], index=df.index)
    

    这是一个时间测试:

    import numpy as np
    import pandas as pd
    
    from time import time
    
    
    def concat_df_str1(df):
        """ run time: 1.3416s """
        return pd.Series([''.join(row.astype(str)) for row in df.values], index=df.index)
    
    
    def concat_df_str2(df):
        """ run time: 5.2758s """
        return df.astype(str).sum(axis=1)
    
    
    def concat_df_str3(df):
        """ run time: 5.0076s """
        df = df.astype(str)
        return df[0] + df[1] + df[2] + df[3] + df[4] + \
               df[5] + df[6] + df[7] + df[8] + df[9]
    
    
    def concat_df_str4(df):
        """ run time: 7.8624s """
        return df.astype(str).apply(lambda x: ''.join(x), axis=1)
    
    
    def main():
        df = pd.DataFrame(np.zeros(1000000).reshape(100000, 10))
        df = df.astype(int)
    
        time1 = time()
        df_en = concat_df_str4(df)
        print('run time: %.4fs' % (time() - time1))
        print(df_en.head(10))
    
    
    if __name__ == '__main__':
        main()
    

    最后,当使用 sum (concat_df_str2)时,结果不仅仅是concat,它将转换为整数 .

  • 10

    我的回答有点晚,但我认为以后总比没有好 . 让我们假设你的 dataframedf ,列 YearQuarter .

    import pandas as pd
    df = pd.DataFrame({'Quarter':'q1 q2 q3 q4'.split(), 'Year':'2000'})
    

    假设我们想要查看数据帧;

    df
    >>>  Quarter    Year
       0    q1      2000
       1    q2      2000
       2    q3      2000
       3    q4      2000
    

    最后,将 YearQuarter 连接如下 .

    df['Period'] = df['Year'] + ' ' + df['Quarter']
    

    您现在可以 print df 查看结果数据帧 .

    df
    >>>  Quarter    Year    Period
        0   q1      2000    2000 q1
        1   q2      2000    2000 q2
        2   q3      2000    2000 q3
        3   q4      2000    2000 q4
    

    如果您不想要年份和季度之间的空间,只需将其删除;

    df['Period'] = df['Year'] + df['Quarter']
    

    我希望这可以帮助你 .

  • 0

    还有另一种方法:

    df['period'] = df['Year'].astype(str) + df['quarter']
    

    或者慢一点:

    df['period'] = df[['Year','quarter']].astype(str).sum(axis=1)
    

    让我们在200K行DF上测试它:

    In [250]: df
    Out[250]:
       Year quarter
    0  2014      q1
    1  2015      q2
    
    In [251]: df = pd.concat([df] * 10**5)
    
    In [252]: df.shape
    Out[252]: (200000, 2)
    

    UPDATE: 时序图Pandas 0.23.0

    enter image description here

    UPDATE: 使用Pandas 0.19.0的新时间

    Timing 没有CPU / GPU优化(从最快到最慢排序):

    In [107]: %timeit df['Year'].astype(str) + df['quarter']
    10 loops, best of 3: 131 ms per loop
    
    In [106]: %timeit df['Year'].map(str) + df['quarter']
    10 loops, best of 3: 161 ms per loop
    
    In [108]: %timeit df.Year.str.cat(df.quarter)
    10 loops, best of 3: 189 ms per loop
    
    In [109]: %timeit df.loc[:, ['Year','quarter']].astype(str).sum(axis=1)
    1 loop, best of 3: 567 ms per loop
    
    In [110]: %timeit df[['Year','quarter']].astype(str).sum(axis=1)
    1 loop, best of 3: 584 ms per loop
    
    In [111]: %timeit df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)
    1 loop, best of 3: 24.7 s per loop
    

    Timing 使用CPU / GPU优化:

    In [113]: %timeit df['Year'].astype(str) + df['quarter']
    10 loops, best of 3: 53.3 ms per loop
    
    In [114]: %timeit df['Year'].map(str) + df['quarter']
    10 loops, best of 3: 65.5 ms per loop
    
    In [115]: %timeit df.Year.str.cat(df.quarter)
    10 loops, best of 3: 79.9 ms per loop
    
    In [116]: %timeit df.loc[:, ['Year','quarter']].astype(str).sum(axis=1)
    1 loop, best of 3: 230 ms per loop
    
    In [117]: %timeit df[['Year','quarter']].astype(str).sum(axis=1)
    1 loop, best of 3: 230 ms per loop
    
    In [118]: %timeit df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)
    1 loop, best of 3: 9.38 s per loop
    
  • 2

    当您的数据插入到数据框中时,此命令应该可以解决您的问题:

    df['period'] = df[['Year', 'quarter']].apply(lambda x: ' '.join(x.astype(str)), axis=1)
    
  • 1

    使用 .combine_first .

    df['Period'] = df['Year'].combine_first(df['Quarter'])
    
  • 11
    dataframe["period"] = dataframe["Year"].astype(str).add(dataframe["quarter"])
    

    或者如果值类似于[2000] [4]并且想要制作[2000q4]

    dataframe["period"] = dataframe["Year"].astype(str).add('q').add(dataframe["quarter"]).astype(str)
    

    .map(str) 代替 .astype(str) 也可以 .

  • 0

    方法cat() of the .str accessor对此非常有效:

    >>> import pandas as pd
    >>> df = pd.DataFrame([["2014", "q1"], 
    ...                    ["2015", "q3"]],
    ...                   columns=('Year', 'Quarter'))
    >>> print(df)
       Year Quarter
    0  2014      q1
    1  2015      q3
    >>> df['Period'] = df.Year.str.cat(df.Quarter)
    >>> print(df)
       Year Quarter  Period
    0  2014      q1  2014q1
    1  2015      q3  2015q3
    

    cat() 甚至允许您添加分隔符,例如,假设您只有年份和期间的整数,您可以这样做:

    >>> import pandas as pd
    >>> df = pd.DataFrame([[2014, 1],
    ...                    [2015, 3]],
    ...                   columns=('Year', 'Quarter'))
    >>> print(df)
       Year Quarter
    0  2014       1
    1  2015       3
    >>> df['Period'] = df.Year.astype(str).str.cat(df.Quarter.astype(str), sep='q')
    >>> print(df)
       Year Quarter  Period
    0  2014       1  2014q1
    1  2015       3  2015q3
    

    连接多个列只是传递一个系列列表或一个包含除第一列之外的所有数据的数据帧作为第一列(系列)上调用的 str.cat() 的参数:

    >>> df = pd.DataFrame(
    ...     [['USA', 'Nevada', 'Las Vegas'],
    ...      ['Brazil', 'Pernambuco', 'Recife']],
    ...     columns=['Country', 'State', 'City'],
    ... )
    >>> df['AllTogether'] = df['Country'].str.cat(df[['State', 'City']], sep=' - ')
    >>> print(df)
      Country       State       City                   AllTogether
    0     USA      Nevada  Las Vegas      USA - Nevada - Las Vegas
    1  Brazil  Pernambuco     Recife  Brazil - Pernambuco - Recife
    

    请注意,如果您的pandas dataframe / series具有空值,则需要包含参数na_rep以使用字符串替换NaN值,否则组合列将默认为NaN .

  • 24
    df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})
    df['period'] = df[['Year', 'quarter']].apply(lambda x: ''.join(x), axis=1)
    

    产生此数据帧

    Year quarter  period
    0  2014      q1  2014q1
    1  2015      q2  2015q2
    

    通过将 df[['Year', 'quarter']] 替换为数据帧的任何列切片,例如,此方法可以推广到任意数量的字符串列 . df.iloc[:,0:2].apply(lambda x: ''.join(x), axis=1) .

    您可以查看有关apply()方法的更多信息here

  • 107

    这次使用lamba函数和string.format() .

    import pandas as pd
    df = pd.DataFrame({'Year': ['2014', '2015'], 'Quarter': ['q1', 'q2']})
    print df
    df['YearQuarter'] = df[['Year','Quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)
    print df
    
      Quarter  Year
    0      q1  2014
    1      q2  2015
      Quarter  Year YearQuarter
    0      q1  2014      2014q1
    1      q2  2015      2015q2
    

    这允许您根据需要使用非字符串并重新格式化值 .

    import pandas as pd
    df = pd.DataFrame({'Year': ['2014', '2015'], 'Quarter': [1, 2]})
    print df.dtypes
    print df
    
    df['YearQuarter'] = df[['Year','Quarter']].apply(lambda x : '{}q{}'.format(x[0],x[1]), axis=1)
    print df
    
    Quarter     int64
    Year       object
    dtype: object
       Quarter  Year
    0        1  2014
    1        2  2015
       Quarter  Year YearQuarter
    0        1  2014      2014q1
    1        2  2015      2015q2
    
  • 170
    def madd(x):
        """Performs element-wise string concatenation with multiple input arrays.
    
        Args:
            x: iterable of np.array.
    
        Returns: np.array.
        """
        for i, arr in enumerate(x):
            if type(arr.item(0)) is not str:
                x[i] = x[i].astype(str)
        return reduce(np.core.defchararray.add, x)
    

    例如:

    data = list(zip([2000]*4, ['q1', 'q2', 'q3', 'q4']))
    df = pd.DataFrame(data=data, columns=['Year', 'quarter'])
    df['period'] = madd([df[col].values for col in ['Year', 'quarter']])
    
    df
    
        Year    quarter period
    0   2000    q1  2000q1
    1   2000    q2  2000q2
    2   2000    q3  2000q3
    3   2000    q4  2000q4
    
  • 8

    虽然@silvado答案很好,如果你将 df.map(str) 改为 df.astype(str) 它会更快:

    import pandas as pd
    df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})
    
    In [131]: %timeit df["Year"].map(str)
    10000 loops, best of 3: 132 us per loop
    
    In [132]: %timeit df["Year"].astype(str)
    10000 loops, best of 3: 82.2 us per loop
    
  • 5

    这是一个我发现非常通用的实现:

    In [1]: import pandas as pd 
    
    In [2]: df = pd.DataFrame([[0, 'the', 'quick', 'brown'],
       ...:                    [1, 'fox', 'jumps', 'over'], 
       ...:                    [2, 'the', 'lazy', 'dog']],
       ...:                   columns=['c0', 'c1', 'c2', 'c3'])
    
    In [3]: def str_join(df, sep, *cols):
       ...:     from functools import reduce
       ...:     return reduce(lambda x, y: x.astype(str).str.cat(y.astype(str), sep=sep), 
       ...:                   [df[col] for col in cols])
       ...: 
    
    In [4]: df['cat'] = str_join(df, '-', 'c0', 'c1', 'c2', 'c3')
    
    In [5]: df
    Out[5]: 
       c0   c1     c2     c3                cat
    0   0  the  quick  brown  0-the-quick-brown
    1   1  fox  jumps   over   1-fox-jumps-over
    2   2  the   lazy    dog     2-the-lazy-dog
    
  • 133

    使用 zip 甚至可以更快:

    dataframe["period"] = ([''.join(i) for i in 
                            zip(dataframe["Year"].map(str),dataframe["quarter"])])
    

    在下面的数据集中, zip() 最为便宜:https://stackoverflow.com/a/50316945/7386332

    import pandas as pd
    
    data = '''\
    ID,Host,Protocol,Port
    1,10.0.0.10,tcp,445
    1,10.0.0.10,tcp,445
    1,10.0.0.10,tcp,445
    1,10.0.0.10,tcp,445
    1,10.0.0.10,tcp,445
    1,10.0.0.10,tcp,445
    1,10.0.0.10,tcp,445
    1,10.0.0.10,tcp,49707
    1,10.0.0.10,tcp,49672
    1,10.0.0.10,tcp,49670'''
    
    df = pd.read_csv(pd.compat.StringIO(data)) # Recreates a sample dataframe
    
    df = pd.concat([df]*10000)
    
    %timeit df['Host'] + "/" + df['Protocol'] + "/" + df['Port'].map(str)
    %timeit ['/'.join(i) for i in zip(df['Host'],df['Protocol'],df['Port'].map(str))]
    %timeit ['/'.join(i) for i in df[['Host','Protocol','Port']].astype(str).values]
    
    10 loops, best of 3: 39.7 ms per loop  
    10 loops, best of 3: 35.9 ms per loop  
    10 loops, best of 3: 162 ms per loop
    
  • 3

    可以使用DataFrame的assign方法:

    df= (pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']}).
      assign(period=lambda x: x.Year+x.quarter ))
    
  • 0
    dataframe["period"] = dataframe["Year"].map(str) + dataframe["quarter"]
    
  • 11

    正如之前提到的那样,您必须将每个列转换为字符串,然后使用plus运算符组合两个字符串列 . 使用NumPy可以大大提高性能 .

    %timeit df['Year'].values.astype(str) + df.quarter
    71.1 ms ± 3.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    %timeit df['Year'].astype(str) + df['quarter']
    565 ms ± 22.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

相关问题