首页 文章

正则表达式检索引用的字符串和引用字符

提问于
浏览
10

我有一种语言,它将字符串定义为由单引号或双引号分隔,其中分隔符在字符串中通过加倍来转义 . 例如,以下所有内容都是合法字符串:

'This isn''t easy to parse.'
'Then John said, "Hello Tim!"'
"This isn't easy to parse."
"Then John said, ""Hello Tim!"""

我有一个字符串集合(上面定义),由不包含引号的东西分隔 . 我试图用正则表达式做的是解析列表中的每个字符串 . 例如,这是一个输入:

“一些字符串#1”或“一些字符串#2”和“一些'字符串'#3”XOR'一些“字符串”#4'HOWDY“一些”“字符串”“#5”FOO'一些''字符串' '#6'

用于确定字符串是否具有这种形式的正则表达式是微不足道的:

^(?:"(?:[^"]|"")*"|'(?:[^']|'')*')(?:\s+[^"'\s]+\s+(?:"(?:[^"]|"")*"|'(?:[^']|'')*')*

在运行上面的表达式来测试它是否是这种形式之后,我需要另一个正则表达式来从输入中获取每个分隔的字符串 . 我计划这样做如下:

Pattern pattern = Pattern.compile("What REGEX goes here?");
Matcher matcher = pattern.matcher(inputString);
int startIndex = 0;
while (matcher.find(startIndex))
{
    String quote        = matcher.group(1);
    String quotedString = matcher.group(2);
    ...
    startIndex = matcher.end();
}

我想要一个正则表达式来捕获组#1中的引号字符,以及组#2中的引号中的文本(我正在使用Java Regex) . 所以,对于上面的输入,我正在寻找一个在每个循环迭代中产生以下输出的正则表达式:

Loop 1: matcher.group(1) = "
        matcher.group(2) = Some String #1
Loop 2: matcher.group(1) = '
        matcher.group(2) = Some String #2
Loop 3: matcher.group(1) = "
        matcher.group(2) = Some 'String' #3
Loop 4: matcher.group(1) = '
        matcher.group(2) = Some "String" #4
Loop 5: matcher.group(1) = "
        matcher.group(2) = Some ""String"" #5
Loop 6: matcher.group(1) = '
        matcher.group(2) = Some ''String'' #6

到目前为止我尝试过的模式(未转义,然后转义为Java代码):

(["'])((?:[^\1]|\1\1)*)\1
"([\"'])((?:[^\\1]|\\1\\1)*)\\1"

(?<quot>")(?<val>(?:[^"]|"")*)"|(?<quot>')(?<val>(?:[^']|'')*)'
"(?<quot>\")(?<val>(?:[^\"]|\"\")*)\"|(?<quot>')(?<val>(?:[^']|'')*)'"

尝试编译模式时,这两个都失败了 .

这样的正则表达式可能吗?

5 回答

  • 2

    创建一个匹配您的实用程序类:

    class test {
        private static Pattern pd = Pattern.compile("(\")((?:[^\"]|\"\")*)\"");
        private static Pattern ps = Pattern.compile("(')((?:[^']|'')*)'");
        public static Matcher match(String s) {
            Matcher md = pd.matcher(s);
            if (md.matches()) return md;
            else return ps.matcher(s);
        }
    }
    
  • 0

    我不确定这是否是你要求的,但你可以编写一些代码来解析字符串并获得所需的结果(引用字符和内部文本)而不是使用正则表达式 .

    class Parser {
    
      public static ParseResult parse(String str)
      throws ParseException {
    
        if(str == null || (str.length() < 2)){
          throw new ParseException();
        }
    
        Character delimiter = getDelimiter(str);
    
        // Remove delimiters
        str = str.substring(1, str.length() -1);
    
        // Unescape escaped quotes in inner string
        String escapedDelim = "" + delimiter + delimiter;
        str = str.replaceAll(escapedDelim, "" + delimiter);
    
        return new ParseResult(delimiter, str);
      }
    
      private static Character getDelimiter(String str)
      throws ParseException {
        Character firstChar = str.charAt(0);
        Character lastChar = str.charAt(str.length() -1);
    
        if(!firstChar.equals(lastChar)){
          throw new ParseException(String.format(
                "First char (%s) doesn't match last char (%s) for string %s",
               firstChar, lastChar, str
          ));
        }
    
        return firstChar;
      }
    
    }
    
    class ParseResult {
    
      public final Character delimiter;
      public final String contents;
    
      public ParseResult(Character delimiter, String contents){
        this.delimiter = delimiter;
        this.contents = contents;
      }
    
    }
    
    class ParseException extends Exception {
    
      public ParseException(){
        super();
      }
    
      public ParseException(String msg){
        super(msg);
      }
    
    }
    
  • 0

    使用这个正则表达式:

    "^('|\")(.*)\\1$"
    

    一些测试代码:

    public static void main(String[] args) {
        String[] tests = {
                "'This isn''t easy to parse.'",
                "'Then John said, \"Hello Tim!\"'",
                "\"This isn't easy to parse.\"",
                "\"Then John said, \"\"Hello Tim!\"\"\""};
        Pattern pattern = Pattern.compile("^('|\")(.*)\\1$");
        Arrays.stream(tests).map(pattern::matcher).filter(Matcher::find).forEach(m -> System.out.println("1=" + m.group(1) + ", 2=" + m.group(2)));
    }
    

    输出:

    1=', 2=This isn''t easy to parse.
    1=', 2=Then John said, "Hello Tim!"
    1=", 2=This isn't easy to parse.
    1=", 2=Then John said, ""Hello Tim!""
    

    如果您对如何捕获文本中的引用文本感兴趣:

    此正则表达式匹配所有变体并捕获组1中的引用和组6中的引用文本:

    ^((')|("))(.*?("\3|")(.*)\5)?.*\1$
    

    live demo .


    这是一些测试代码:

    public static void main(String[] args) {
        String[] tests = {
                "'This isn''t easy to parse.'",
                "'Then John said, \"Hello Tim!\"'",
                "\"This isn't easy to parse.\"",
                "\"Then John said, \"\"Hello Tim!\"\"\""};
        Pattern pattern = Pattern.compile("^((')|(\"))(.*?(\"\\3|\")(.*)\\5)?.*\\1$");
        Arrays.stream(tests).map(pattern::matcher).filter(Matcher::find)
          .forEach(m -> System.out.println("quote=" + m.group(1) + ", quoted=" + m.group(6)));
    }
    

    输出:

    quote=', quoted=null
    quote=', quoted=Hello Tim!
    quote=", quoted=null
    quote=", quoted=Hello Tim!
    
  • 0

    对这类问题使用正则表达式非常具有挑战性 . 不使用正则表达式的简单解析器更容易实现,理解和维护 .

    另外,这样一个简单的解析可以很容易地支持反斜杠转义,以及将反斜杠序列转换为字符(例如“\ n”转换为换行符) .

  • 0

    这可以通过下面的简单正则表达式轻松完成

    private static Object[] checkPattern(String name, String regex) {
        List<String> matchedString = new ArrayList<>();
        Pattern pattern = Pattern.compile(regex);
        Matcher matcher = pattern.matcher(name);
        while (matcher.find()) {
            if (matcher.group().length() > 0) {
                matchedString.add(matcher.group());
            }
        }
        return matchedString.toArray();
    }
    
    
    @Test
    public void quotedtextMultipleQuotedLines() {
        String text = "He said, \"I am Tom\". She said, \"I am Lisa\".";
        String quoteRegex = "(\"[^\"]+\")";
        String[] strArray = {"\"I am Tom\"", "\"I am Lisa\""};
        assertArrayEquals(strArray, checkPattern(text, quoteRegex));
    }
    

    我们在这里得到字符串作为数组元素 .

相关问题