首页 文章

使用另一列的子字符串创建字典键

提问于
浏览
2

我有一个数据框,其中包含纽约市(曼哈顿,布鲁克林等)的行政区名称 . 我想创建另一个列'borough_num',为每个行政区划分一个数字(曼哈顿 - > 1,布鲁克林 - > 2,皇后 - > 3,史坦顿岛 - > 4,布朗克斯 - > 5,其他 - > 0) .

但是,在Borough列中,某些行包含自治市镇名称前面的数字(例如,而不是"Bronx"我有"07 Bronx") . 因为"07 Bronx"仍然是布朗克斯区的一部分,所以它也应该被指定为"5"的相同值"5" . 因此,我需要创建一个字典,将数字5分配给 contains 字"Bronx"的字符串 . 每个自治市都一样 . 有关如何做到这一点的任何线索?我是python的新手!

这是我在注意到带有数字的单元格之前所拥有的:

df['Borough'] = df['Borough'].fillna('OTHER')
borough_dict = {'MANHATTAN':1, 'BROOKLYN':2, 'QUEENS': 3, 'STATEN ISLAND': 4, 'BRONX': 5, 'OTHER':6}
df['borough_num'] = df['Borough'].apply(lambda x:0 if borough_dict.get(x) == None else borough_dict.get(x))

enter image description here

8 回答

  • -1

    由于有一小部分自治市镇名称可以为其分配整数代码,因此将其作为一系列显式逻辑索引分配完全可以接受,例如下面的一些示例数据 .

    具体来说,在这种情况下,不需要尝试将自治市镇代码映射封装到 dict 或辅助函数或DataFrame上的任何类型的 applymap 操作 .

    只是一组5个无聊,直接的逻辑分配 .

    In [13]: df = pandas.DataFrame({
        'Borough': ["Manhattan", "Brooklyn", "Bronx", "07 Bronx", 
                    "109 Staten Island", "03 Brooklyn", "04 Queens"], 
        'Value':[1, 2, 3, 4, 5, 6, 7]
    })
    
    In [14]: df
    Out[14]:
                 Borough  Value
    0          Manhattan      1
    1           Brooklyn      2
    2              Bronx      3
    3           07 Bronx      4
    4  109 Staten Island      5
    5        03 Brooklyn      6
    6          04 Queens      7
    
    In [15]: df['Borough_num'] = 6  # everything defaults to the 'other' case
    
    In [16]: df.loc[df.Borough.str.contains("Manhattan"), 'Borough_num'] = 1
    
    In [17]: df.loc[df.Borough.str.contains("Brooklyn"), 'Borough_num'] = 2
    
    In [18]: df.loc[df.Borough.str.contains("Queens"), 'Borough_num'] = 3
    
    In [19]: df.loc[df.Borough.str.contains("Staten Island"), 'Borough_num'] = 4
    
    In [20]: df.loc[df.Borough.str.contains("Bronx"), 'Borough_num'] = 5
    
    In [21]: df
    Out[21]: 
                 Borough  Value  Borough_num
    0          Manhattan      1            1
    1           Brooklyn      2            2
    2              Bronx      3            5
    3           07 Bronx      4            5
    4  109 Staten Island      5            4
    5        03 Brooklyn      6            2
    6          04 Queens      7            3
    

    如果你想以任何理由封装自治市镇到代码的映射,你可以用一个简单的 dict 然后循环来实现:

    In [30]: borough_code = {'Manhattan': 1, 'Brooklyn': 2, 'Queens': 3,
                             'Staten Island': 4, 'Bronx': 5}
    
    In [31]: for borough, code in borough_code.items():
        ...:     df.loc[df.Borough.str.contains(borough), 'Borough_num'] = code
    

    除非DataFrame是巨大的,否则 str.contains 的重复矢量化计算将与在列中映射函数无法区分,但将更容易理解 .

  • 4

    也许写一个简单的辅助函数:

    def find_borough_id(name):
        for k, v in borough_dict.items():
            if k in name:
                return v
        return 0
    
    df['borough_num'] = df['Borough'].apply(find_borough_id)
    
  • 1

    一种解决方案是搜索borough_dict的哪个键是x的子字符串并返回其关联值:

    def get_borough_num(x):
      for key, val in borough_dict.items():
        if key in x:
           return val
      return 0
    df['borough_num'] = df['Borough'].apply(get_borough_num)
    

    另一种解决方案是假设所有行都使用borough_name作为行政区名称或以行政区域名称结尾 . 有了这样的假设,您可以使用以下命令获取borough_name:

    x.rsplit(' ')[-1]
    

    如果字符串包含空格,则返回最后一个空格后的字符串,否则返回整个字符串:

    "Manhattan".rsplit(' ')[-1] => "Manhattan"
    "blah Manhattan".rsplit(' ')[-1] => "Manhattan"
    

    所以当结束时:

    get_borough_num = lambda x: borough_dict.get(x.rsplit(' ')[-1], 0)
    df['borough_num'] = df['Borough'].apply(get_borough_num)
    
  • 1

    让我们使用Pandas str访问器和字符串函数,如 extractjoinupper 和python方法 map .

    根据@alexlaval问题设置:

    borough_dict = {'MANHATTAN':1, 'BROOKLYN':2, 'QUEENS': 3, 'STATEN ISLAND': 4, 'BRONX': 5, 'OTHER':6}
    

    并从@ely设置

    df = pd.DataFrame({
        'Borough': ["Manhattan", "Brooklyn", "Bronx", "07 Bronx", 
                    "109 Staten Island", "03 Brooklyn", "04 Queens","Unknown"], 
        'Value':[1, 2, 3, 4, 5, 6, 7, 8]
    })
    

    让我们创建一个正则表达式来从dataframe列中提取自治市镇:

    x = '(' + '|'.join(borough_dict.keys()) + ')'
    

    现在,让我们使用提取和 Map 来获得自治市镇号码

    df['Borough_number'] = df.Borough.str.upper()\
                             .str.extract(x, expand=False).fillna('OTHER')\
                             .map(borough_dict)
    

    输出:

    Borough  Value  Borough_number
    0          Manhattan      1               1
    1           Brooklyn      2               2
    2              Bronx      3               5
    3           07 Bronx      4               5
    4  109 Staten Island      5               4
    5        03 Brooklyn      6               2
    6          04 Queens      7               3
    7            Unknown      8               6
    
  • -1

    一种天真的方法是定义一个函数来搜索字符串,如果找到该字符串,则返回预期的id .

    def borough_id(borough):
        if 'Bronx' in borough:
            return 5
        elif ...
            ...
            ...
        else:
            return None
    
    df['borough_num'] = df['Borough'].apply(lambda x: borough_id(x))
    
  • -2

    Functional approach

    使用带有生成器表达式的自定义函数,如果未找到匹配项,则使用 dict.get 的默认值返回'OTHER' .

    然后通过 pd.Series.apply 应用该功能 .

    df = pd.DataFrame({'Borough': ['07 BRONX', '01 MANHATTAN', 'STATEN ISLAND', '12 QUEENS', 'UNKNOWN']})
    
    d = {'MANHATTAN':1, 'BROOKLYN':2, 'QUEENS': 3, 'STATEN ISLAND': 4, 'BRONX': 5, 'OTHER':6}
    
    def map_borough(x, mapping):
        return mapping.get(next((k for k in mapping if x.endswith(k)), None), 'OTHER')
    
    df['borough_num'] = df['Borough'].apply(map_borough, mapping=d)
    
    print(df)
    
    #          Borough borough_num
    # 0       07 BRONX           5
    # 1   01 MANHATTAN           1
    # 2  STATEN ISLAND           4
    # 3      12 QUEENS           3
    # 4        UNKNOWN       OTHER
    
  • -2

    Object-oriented approach

    您可以继承 dict ,然后使用 pd.Series.map .

    class dict_endswith(dict):
        def __getitem__(self, value):
            key = next((k for k in self.keys() if value.endswith(k)), None)
            return self.get(key)
    
    df = pd.DataFrame({'Borough': ['07 BRONX', '01 MANHATTAN', 'STATEN ISLAND', '12 QUEENS', 'UNKNOWN']})
    
    d = dict_endswith({'MANHATTAN':1, 'BROOKLYN':2, 'QUEENS': 3, 'STATEN ISLAND': 4, 'BRONX': 5, 'OTHER':6})
    
    df['borough_num'] = df['Borough'].map(lambda x: d[x]).fillna('OTHER')
    
    print(df)
    
    #          Borough borough_num
    # 0       07 BRONX           5
    # 1   01 MANHATTAN           1
    # 2  STATEN ISLAND           4
    # 3      12 QUEENS           3
    # 4        UNKNOWN       OTHER
    
  • 1

    Loopy solution

    如果要使用循环,则可以通过将数据结构与逻辑分离来使代码更具可读性 .

    在此示例中,您可以按顺序迭代字典项 . @ely's version对于更大的数据帧更好 .

    df = pd.DataFrame({'Borough': ['07 BRONX', '01 MANHATTAN', 'STATEN ISLAND', '12 QUEENS', 'UNKNOWN']})
    
    d = {'MANHATTAN':1, 'BROOKLYN':2, 'QUEENS': 3, 'STATEN ISLAND': 4, 'BRONX': 5, 'OTHER':6}
    
    def map_borough(x, mapping):
        for k, v in mapping.items():
            if k in x:
                return v
        else:
            return 'OTHER'
    
    df['borough_num'] = df['Borough'].apply(map_borough, mapping=d)
    
    #          Borough borough_num
    # 0       07 BRONX           5
    # 1   01 MANHATTAN           1
    # 2  STATEN ISLAND           4
    # 3      12 QUEENS           3
    # 4        UNKNOWN       OTHER
    

相关问题