首页 文章

正则表达式 - 我需要找到一些东西然后复制它之后的东西

提问于
浏览
0

我有这样的事情:

<A NAME=speech26><b>SIR HUGH EVANS</b></a>
<blockquote>
<A NAME=1.1.58>Shall I tell you a lie? I do despise a liar as I do</A><br>
<A NAME=1.1.59>despise one that is false, or as I despise one that</A><br>
<A NAME=1.1.60>is not true. The knight, Sir John, is there; and, I</A><br>
<A NAME=1.1.61>beseech you, be ruled by your well-willers. I will</A><br>
<A NAME=1.1.62>peat the door for Master Page.</A><br>
<p><i>Knocks</i></p>
<A NAME=1.1.63>What, hoa! Got pless your house here!</A><br>
</blockquote>

我想找到所有的文字,并把它搞得像这样

Shall I tell you a lie? I do despise a liar as I do
despise one that is false, or as I despise one that
is not true. The knight, Sir John, is there; and, I
beseech you, be ruled by your well-willers. I will
peat the door for Master Page.
What, hoa! Got pless your house here!

我试过 <A NAME=[0-9]+\\.[0-9]+\\.[0-9]+> ,但它没有实现我想做的事情 . 有人可以帮忙吗?

4 回答

  • -1

    你可以试试这个

    <A NAME=\d+\.\d+\.\d+>(.*)(?=</A>)
    

    Explanation

    • <A NAME=\d+\.\d+\.\d+> - 会匹配这样的东西 <A NAME=1.1.112>

    • (.*) - 匹配除换行零或更多时间以外的任何内容 .

    • (?=</A>) - 积极向前看 . 匹配 </A> .

    Demo

  • 0

    使用正则表达式解析HTML / XML / JSON就像编写质量差的代码一样 . HTML可能包含重复的嵌套结构,在使用正则表达式进行语法分析时可能会导致意外结果 .

    您可以在python中使用 Beautiful Soup 库并解析给定的HTML以提取所需的输出 .

    这是一个使用 Beautiful Soup 的示例python代码

    import re
    from bs4 import BeautifulSoup
    
    data = """<A NAME=speech26><b>SIR HUGH EVANS</b>
    </a><blockquote>
    <A NAME=1.1.58>Shall I tell you a lie? I do despise a liar as I do</A><br>
    <A NAME=1.1.59>despise one that is false, or as I despise one that</A><br>
    <A NAME=1.1.60>is not true. The knight, Sir John, is there; and, I</A><br>
    <A NAME=1.1.61>beseech you, be ruled by your well-willers. I will</A><br>
    <A NAME=1.1.62>peat the door for Master Page.</A><br>
    <p><i>Knocks</i></p>
    <A NAME=1.1.63>What, hoa! Got pless your house here!</A><br>
    </blockquote>"""
    
    soup = BeautifulSoup(data)
    
    for aTag in soup.find_all('a', {'name': re.compile(r'\d+\.\d+\.\d+')}):
     print(aTag.get_text())
    

    根据需要提供以下输出,

    Shall I tell you a lie? I do despise a liar as I do
    despise one that is false, or as I despise one that
    is not true. The knight, Sir John, is there; and, I
    beseech you, be ruled by your well-willers. I will
    peat the door for Master Page.
    What, hoa! Got pless your house here!
    

    注意,我've used regex here as well, but in a limited space, just to tell I am interested in all ' a'标签,其中 name 属性的值属于此 \d+\.\d+\.\d+ 模式 .

  • -1

    你可以尝试下面的代码 .

    text = """<A NAME=speech26><b>SIR HUGH EVANS</b>
    </a><blockquote>
    <A NAME=1.1.58>Shall I tell you a lie? I do despise a liar as I do</A><br>
    <A NAME=1.1.59>despise one that is false, or as I despise one that</A><br>
    <A NAME=1.1.60>is not true. The knight, Sir John, is there; and, I</A><br>
    <A NAME=1.1.61>beseech you, be ruled by your well-willers. I will</A><br>
    <A NAME=1.1.62>peat the door for Master Page.</A><br>
    <p><i>Knocks</i></p>
    <A NAME=1.1.63>What, hoa! Got pless your house here!</A><br>
    </blockquote>"""
    
    output = re.findall(r'<A NAME=\d\.\d\.\d*>(.*?)(?=</A>)', text, re.MULTILINE|re.DOTALL)
    print(output)
    

    产量

    ['Shall I tell you a lie? I do despise a liar as I do', 'despise one that is false, or as I despise one that', 'is not true. The knight, Sir John, is there; and, I', 'beseech you, be ruled by your well-willers. I will', 'peat the door for Master Page.', 'What, hoa! Got pless your house here!']
    
  • -1

    这是一个选项,使用 re.findall

    text = "<A NAME=1.1.58>Shall I tell you a lie? " # ... your input from above
    output = re.findall(r'<A NAME=\d+\.\d+\.\d+>(.*?)(?=</A>)', text, re.MULTILINE|re.DOTALL)
    print(output)
    
    ['Shall I tell you a lie? I do despise a liar as I do',
     'despise one that is false, or as I despise one that',
     'is not true. The knight, Sir John, is there; and, I',
     'beseech you, be ruled by your well-willers. I will',
     'peat the door for Master Page.',
     'What, hoa! Got pless your house here!']
    

    但请注意,通常使用正则表达式来解析HTML / XML内容并不是一件好事 . 如果您确定目标内容只会出现在上面显示的 <A> 标签类型之间,那么您可能会使用正则表达式 .

相关问题