首页 文章

从span标签中获取日期

提问于
浏览
0

使用Beautiful Soup,我想从包含url列表的文本文件中提取日期 . 其中日期在span标签中使用div class = update定义 . 当我尝试下面的代码时,我得到的结果是 <span id="time"></span> 但不是确切的时间 . 请帮忙 . 例如,sabah_url.txt中链接的类型是“http://www.dailysabah.com/world/2012/02/20/seeking-international-support-to-block-assad

from cookielib import CookieJar
import urllib2
from bs4 import BeautifulSoup
cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
try:
    url_file = open('sabah_url.txt', 'r')
    for line in url_file:
       print line
       #Opens each extracted URL with urllib2 library
       data = urllib2.urlopen(line).read()
       soup = BeautifulSoup(data)
       #Extracts all the dates of URLs ith its respective class as defined
       date = soup.find_all('span', {'id': 'time'})
       for item in date:
          print item 
except BaseException, e:
    print 'failed', str(e) 
    pass

1 回答

  • 1

    假设您计划获取发布日期,可以从 meta 标记中获取:

    import urllib2
    from bs4 import BeautifulSoup
    
    url = "http://www.dailysabah.com/world/2012/02/20/seeking-international-support-to-block-assad"
    
    data = urllib2.urlopen(url)
    soup = BeautifulSoup(data)
    
    print soup.find('meta', itemprop='datePublished', content=True)['content']
    

    打印 2012-02-20T17:41:01Z .

    要使它看起来像"February 20, 2012",您可以使用python-dateutil模块:

    >>> from dateutil import parser
    >>> s = "2012-02-20T17:41:01Z"
    >>> parser.parse(s)
    datetime.datetime(2012, 2, 20, 17, 41, 1, tzinfo=tzutc())
    >>> parser.parse(s).strftime('%B %d, %Y')
    'February 20, 2012'
    

相关问题