首页 文章

使用beautifulsoup从页面刮取表格,找不到表格

提问于
浏览
2

我一直试图从here刮掉 table ,但在我看来,BeautifulSoup找不到任何 table .

我写:

import requests
import pandas as pd
from bs4 import BeautifulSoup
import csv

url = "http://www.payscale.com/college-salary-report/bachelors?page=65" 
r=requests.get(url)
data=r.text

soup=BeautifulSoup(data,'xml')
table=soup.find_all('table')
print table   #prints nothing..

基于其他类似的问题,我假设HTML在某种程度上被打破了,但我找到了答案:(Beautiful soup missing some html table tags),(Extracting a table from a website),(Scraping a table using BeautifulSoup),甚至(Python+BeautifulSoup: scraping a particular table from a webpage

谢谢你!

3 回答

  • 2

    您正在解析 html ,但您使用了 xml 解析器 .
    你应该使用 soup=BeautifulSoup(data,"html.parser")
    您的必要数据在 script 标记中,实际上实际上没有 table 标记 . 所以,你需要在 script 中找到文本 .
    N.B: If you are using Python 2.x then use "HTMLParser" instead of "html.parser".

    这是代码 .

    import csv
    import requests
    from bs4 import BeautifulSoup
    
    url = "http://www.payscale.com/college-salary-report/bachelors?page=65" 
    r=requests.get(url)
    data=r.text
    
    soup=BeautifulSoup(data,"html.parser")
    scripts = soup.find_all("script")
    
    file_name = open("table.csv","w",newline="")
    writer = csv.writer(file_name)
    list_to_write = []
    
    list_to_write.append(["Rank","School Name","School Type","Early Career Median Pay","Mid-Career Median Pay","% High Job Meaning","% STEM"])
    
    for script in scripts:
        text = script.text
        start = 0
        end = 0
        if(len(text) > 10000):
            while(start > -1):
                start = text.find('"School Name":"',start)
                if(start == -1):
                    break
                start += len('"School Name":"')
                end = text.find('"',start)
                school_name = text[start:end]
    
                start = text.find('"Early Career Median Pay":"',start)
                start += len('"Early Career Median Pay":"')
                end = text.find('"',start)
                early_pay = text[start:end]
    
                start = text.find('"Mid-Career Median Pay":"',start)
                start += len('"Mid-Career Median Pay":"')
                end = text.find('"',start)
                mid_pay = text[start:end]
    
                start = text.find('"Rank":"',start)
                start += len('"Rank":"')
                end = text.find('"',start)
                rank = text[start:end]
    
                start = text.find('"% High Job Meaning":"',start)
                start += len('"% High Job Meaning":"')
                end = text.find('"',start)
                high_job = text[start:end]
    
                start = text.find('"School Type":"',start)
                start += len('"School Type":"')
                end = text.find('"',start)
                school_type = text[start:end]
    
                start = text.find('"% STEM":"',start)
                start += len('"% STEM":"')
                end = text.find('"',start)
                stem = text[start:end]
    
                list_to_write.append([rank,school_name,school_type,early_pay,mid_pay,high_job,stem])
    writer.writerows(list_to_write)
    file_name.close()
    

    这将在csv中生成必要的表 . 完成后不要忘记关闭文件 .

  • 1

    虽然这不是 r.text 中的't find the table that',但是你要求 BeautifulSoup 使用 xml 解析器而不是 html.parser 所以我建议将该行更改为:

    soup=BeautifulSoup(data,'html.parser')

    您将通过网络抓取遇到的问题之一是所谓的"client-rendered"网站与服务器呈现 . 基本上,这意味着您通过 requests 模块或通过 curl 从基本html请求获取的页面与在Web浏览器中呈现的内容不同 . 一些常见的框架是React和Angular . 如果你检查你想要抓取的页面的来源,他们在他们的几个html元素上有 data-react-id . Angular页面的常见说明是具有前缀 ng 的类似元素属性,例如 ng-ifng-bind . 您可以通过各自的开发工具在Chrome或Firefox中查看该页面的来源,这些工具可以使用任一浏览器中的键盘快捷键 Ctrl+Shift+I 启动 . 值得注意的是,并非所有React和Angular页面都只是客户端呈现的 .

    要获得此类内容,您需要使用无头浏览器工具,如Selenium . 使用Selenium和Python进行Web抓取有很多资源 .

  • 2

    数据位于JavaScript变量中,您应该找到js文本数据,然后使用正则表达式来提取它 . 当你得到数据时,它是包含900学校字典的json列表对象,你应该使用json模块将它加载到python list obejct .

    import requests, bs4, re, json
    
    url = "http://www.payscale.com/college-salary-report/bachelors?page=65"
    r = requests.get(url)
    data = r.text
    soup = bs4.BeautifulSoup(data, 'lxml')
    var = soup.find(text=re.compile('collegeSalaryReportData'))
    table_text = re.search(r'collegeSalaryReportData = (\[.+\]);\n    var', var, re.DOTALL).group(1)
    table_data = json.loads(table_text)
    pprint(table_data)
    print('The number of school', len(table_data))
    

    出:

    {'% Female': '0.57',
      '% High Job Meaning': 'N/A',
      '% Male': '0.43',
      '% Pell': 'N/A',
      '% STEM': '0.1',
      '% who Recommend School': 'N/A',
      'Division 1 Basketball Classifications': 'Not Division 1 Basketball',
      'Division 1 Football Classifications': 'Not Division 1 Football',
      'Early Career Median Pay': '36200',
      'IPEDS ID': '199643',
      'ImageUrl': '/content/school_logos/Shaw University_50px.png',
      'Mid-Career Median Pay': '45600',
      'Rank': '963',
      'School Name': 'Shaw University',
      'School Sector': 'Private not-for-profit',
      'School Type': 'Private School, Religious',
      'State': 'North Carolina',
      'Undergraduate Enrollment': '1664',
      'Url': '/research/US/School=Shaw_University/Salary',
      'Zip Code': '27601'}]
    The number of school 963
    

相关问题