BeautifulSoup为html转换提供垃圾-Java 学习之路

我试图scape这个网址='http://www.jmlr.org/proceedings/papers/v36/li14.pdf网址 . 这是我的代码

html = requests.get(url)
    htmlText = html.text
    soup = BeautifulSoup(htmlText)
    print soup #gives garbage

然而，它给出了我认为是垃圾的怪异符号 . 这是一个html文件，所以不应该尝试解析它作为pdf应该是什么？

我尝试了以下内容：How to correctly parse UTF-8 encoded HTML to Unicode strings with BeautifulSoup?

request = urllib2.Request(url)
    request.add_header('Accept-Encoding', 'utf-8') #tried with 'latin-1'too
    response = urllib2.urlopen(request)
    soup = BeautifulSoup(response.read().decode('utf-8', 'ignore'))

这也是：Python and BeautifulSoup encoding issues

html = requests.get(url)
    htmlText = html.text
    soup = BeautifulSoup(htmlText)
    print soup.prettify('utf-8')

两者都给了我垃圾，即没有正确解析的html标签 . 最后一个链接也建议编码可能我不同，尽管metaa charset是'utf8'所以我也用'latin-1'尝试了上面但是似乎没有任何工作

关于如何抓取数据的给定链接的任何建议？请不要建议在文件上下载和使用pdfminer . 随意询问更多信息！

1 回答

1

这是因为URL指向PDF格式的文档，因此将其解释为HTML根本没有任何意义 .

回复于 2024-04-29T03:52:09+08:00

BeautifulSoup为html转换提供垃圾

1 回答

相关问题