首页 文章

pandas将字符串列转换为datetime,允许丢失但不是无效

提问于
浏览
3

我有一个 pandas 数据框,其中多列字符串表示日期,空字符串表示缺少日期 . 例如

import numpy as np
import pandas as pd

# expected date format is 'm/%d/%Y'

custId = np.array(list(range(1,6)))
eventDate = np.array(["06/10/1992","08/24/2012","04/24/2015","","10/14/2009"])
registerDate = np.array(["06/08/2002","08/20/2012","04/20/2015","","10/10/2009"])

# both date columns of dfGood should convert to datetime without error
dfGood = pd.DataFrame({'custId':custId, 'eventDate':eventDate, 'registerDate':registerDate})

我在尝试着:

  • 有效地将所有字符串都是有效日期的列或空白转换为 datetime64 类型的列(空白时为 NaT

  • 当任何非空字符串不符合预期格式时,引发 ValueError

应该引发 ValueError 的示例:

# 2nd string invalid
registerDate = np.array(["06/08/2002","20/08/2012","04/20/2015","","10/10/2009"]) 
# eventDate column should convert, registerDate column should raise ValueError
dfBad = pd.DataFrame({'custId':custId, 'eventDate':eventDate, 'registerDate':registerDate})

这个函数在元素级别做我想要的:

from datetime import datetime

def parseStrToDt(s, format = '%m/%d/%Y'):
    """Parse a string to datetime with the supplied format."""
    return pd.NaT if s=='' else datetime.strptime(s, format)

print(parseStrToDt("")) # correctly returns NaT
print(parseStrToDt("12/31/2011")) # correctly returns 2011-12-31 00:00:00
print(parseStrToDt("12/31/11")) # correctly raises ValueError

但是,我有read字符串操作不应该 np.vectorize -d . 我认为这可以使用 pandas.DataFrame.apply 有效地完成,如:

dfGood[['eventDate','registerDate']].applymap(lambda s: parseStrToDt(s)) # raises TypeError

dfGood.loc[:,'eventDate'].apply(lambda s: parseStrToDt(s)) # raises same TypeError

我猜 TypeError 与我的函数有关,返回不同的 dtype ,但我确实想利用动态类型并用日期时间替换字符串(除非引发ValueError)...所以我该怎么办这个?

2 回答

  • 1

    pandas 不是一种方法,这应该是相对有效的 .

    In [4]: dfBad
    Out[4]: 
       custId   eventDate registerDate
    0       1  06/10/1992   06/08/2002
    1       2  08/24/2012   20/08/2012
    2       3  04/24/2015   04/20/2015
    3       4                         
    4       5  10/14/2009   10/10/2009
    
    In [7]: cols
    Out[7]: ['eventDate', 'registerDate']
    
    In [9]: dts = dfBad[cols].apply(lambda x: pd.to_datetime(x, errors='coerce', format='%m/%d/%Y'))
    
    In [10]: dts
    Out[10]: 
       eventDate registerDate
    0 1992-06-10   2002-06-08
    1 2012-08-24          NaT
    2 2015-04-24   2015-04-20
    3        NaT          NaT
    4 2009-10-14   2009-10-10
    
    In [11]: mask = pd.isnull(dts) & (dfBad[cols] != '')
    
    In [12]: mask
    Out[12]: 
      eventDate registerDate
    0     False        False
    1     False         True
    2     False        False
    3     False        False
    4     False        False
    
    
    In [13]: mask.any()
    Out[13]: 
    eventDate       False
    registerDate     True
    dtype: bool
    
    In [14]: is_bad = mask.any()
    
    In [23]: if is_bad.any():
        ...:     raise ValueError("bad dates in col(s) {0}".format(is_bad[is_bad].index.tolist()))
        ...: else:
        ...:     df[cols] = dts
        ...:     
    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    <ipython-input-23-579c06ce3c77> in <module>()
          1 if is_bad.any():
    ----> 2     raise ValueError("bad dates in col(s) {0}".format(is_bad[is_bad].index.tolist()))
          3 else:
          4     df[cols] = dts
          5 
    
    ValueError: bad dates in col(s) ['registerDate']
    
  • 2

    为了进一步理解接受的答案,我用解析的日期时间替换了所有有效或缺失字符串的列,然后为剩余的未解析列引发了错误:

    dtCols = ['eventDate', 'registerDate']
    dts = dfBad[dtCols].apply(lambda x: pd.to_datetime(x, errors='coerce', format='%m/%d/%Y'))
    
    mask = pd.isnull(dts) & (dfBad[dtCols] != '')
    colHasError = mask.any()
    
    invalidCols = colHasError[colHasError].index.tolist() 
    validCols = list(set(dtCols) - set(invalidCols))
    
    dfBad[validCols] = dts[validCols] # replace the completely valid/empty string cols with dates
    if colHasError.any():
        raise ValueError("bad dates in col(s) {0}".format(invalidCols))
    # raises:  ValueError: bad dates in col(s) ['registerDate']
    
    print(dfBad) # eventDate got converted, registerDate didn't
    

    然而,接受的答案包含主要的洞察力,即继续并将错误强制转换为 NaT ,然后使用掩码将非空但无效的字符串与空字符串区分开来 .

相关问题