首页 文章

解析单引号或双引号并允许使用正则表达式转义字符(在Python中)

提问于
浏览
5

我的输入看起来像一个参数列表:

input1 = '''
title="My First Blog" author='John Doe'
'''

值可以用单引号或双引号括起来,但是,也允许转义:

input2 = '''
title='John\'s First Blog' author="John Doe"
'''

有没有办法使用正则表达式来提取会计单引号或双引号和转义引号的键值对?

使用python,我可以使用以下正则表达式并处理非转义引号:

rex = r"(\w+)\=(?P<quote>['\"])(.*?)(?P=quote)"

然后返回:

import re
re.findall(rex, input1)
[('title', '"', 'My First Blog'), ('author', "'", 'John Doe')]

import re
re.findall(rex, input2)
[('title', "'", 'John'), ('author', '"', 'John Doe')]

后者是不正确的 . 我可以't figure out how to handle escaped quotes--assumedly in the (.*?) section. I'已经在Python regex to match text in single quotes, ignoring escaped quotes (and tabs/newlines)的已发布答案中使用解决方案无济于事 .

从技术上讲,我不需要findall来返回引号字符 - 而只需要键/值对 - 但这很容易处理 .

任何帮助,将不胜感激!谢谢!

2 回答

  • 4

    我认为蒂姆使用反向引用过度复杂化表达式(并在此猜测)也使得速度变慢 . 标准方法(在owl书中使用)是分别匹配单引号和双引号字符串:

    rx = r'''(?x)
        (\w+) = (
            ' (?: \\. | [^'] )* '
            |
            " (?: \\. | [^"] )* "
            |
            [^'"\s]+
        )
    '''
    

    添加一些后期处理,你很好:

    input2 = r'''
    title='John\'s First Blog' author="John Doe"
    '''
    
    data = {k:v.strip("\"\'").decode('string-escape') for k, v in re.findall(rx, input2)}
    print data
    # {'author': 'John Doe', 'title': "John's First Blog"}
    

    作为奖励,这也匹配不带引号的属性,如 weight=150 .

    添加:这是一个没有正则表达式的清洁方式:

    input2 = r'''
    title='John\'s First Blog' author="John Doe"
    '''
    
    import shlex
    
    lex = shlex.shlex(input2, posix=True)
    lex.escapedquotes = '\"\''
    lex.whitespace = ' \n\t='
    for token in lex:
        print token
    
    # title
    # John's First Blog
    # author
    # John Doe
    
  • 5

    EDIT

    我的初始正则表达式解决方案有一个错误 . 该错误掩盖了输入字符串中的错误: input2 不是您认为的错误:

    >>> input2 = '''
    ... title='John\'s First Blog' author="John Doe"
    ... '''
    >>> input2      # See - the apostrophe is not correctly escaped!
    '\ntitle=\'John\'s First Blog\' author="John Doe"\n'
    

    你需要使 input2 成为一个原始字符串(或使用双反斜杠):

    >>> input2 = r'''
    ... title='John\'s First Blog' author="John Doe"
    ... '''
    >>> input2
    '\ntitle=\'John\\\'s First Blog\' author="John Doe"\n'
    

    现在,您可以使用正确处理转义引号的正则表达式:

    >>> rex = re.compile(
        r"""(\w+)# Match an identifier (group 1)
        =        # Match =
        (['"])   # Match an opening quote (group 2)
        (        # Match and capture into group 3:
         (?:     # the following regex:
          \\.    # Either an escaped character
         |       # or
          (?!\2) # (as long as we're not right at the matching quote)
          .      # any other character.
         )*      # Repeat as needed
        )        # End of capturing group
        \2       # Match the corresponding closing quote.""", 
        re.DOTALL | re.VERBOSE)
    >>> rex.findall(input2)
    [('title', "'", "John\\'s First Blog"), ('author', '"', 'John Doe')]
    

相关问题