首页 文章

使用Libreoffice Basic读取HTML页面

提问于
浏览
-1

我试图在LibreOffice Calc中编写一个宏,它将从一个单元格(例如Stark)中读取一个贵族House of Westeros的名字,然后通过查看the relevant page在冰与火的Wiki上输出该House的单词 . 它应该像这样工作:

enter image description here

enter image description here

这是伪代码:

Read HouseName from column A
Open HtmlFile at "http://www.awoiaf.westeros.org/index.php/House_" & HouseName
Iterate through HtmlFile to find line which begins "<table class="infobox infobox-body"" // Finds the info box for the page.
Read Each Row in the table until Row begins Words
Read the contents of the next <td> tag, and return this as a string.

我的问题是第二行,我不知道如何读取HTML文件 . 我应该如何在LibreOffice Basic中执行此操作?

1 回答

  • 0

    这有两个主要问题 . 1.性能您的UDF需要在存储它的每个单元中获取HTTP资源 . 2. HTML遗憾的是,OpenOffice或LibreOffice中没有HTML解析器 . 只有一个XML解析器 . 这就是我们无法直接使用UDF解析HTML的原因 .

    这将有效,但速度慢且不太普遍:

    Public Function FETCHHOUSE(sHouse as String) as String
    
       sURL = "http://awoiaf.westeros.org/index.php/House_" & sHouse
    
       oSimpleFileAccess = createUNOService ("com.sun.star.ucb.SimpleFileAccess")
       oInpDataStream = createUNOService ("com.sun.star.io.TextInputStream")
       on error goto falseHouseName
       oInpDataStream.setInputStream(oSimpleFileAccess.openFileRead(sUrl))
       on error goto 0
       dim delimiters() as long
       sContent = oInpDataStream.readString(delimiters(), false)
    
       lStartPos = instr(1, sContent, "<table class=" & chr(34) & "infobox infobox-body" )
       if lStartPos = 0 then
         FETCHHOUSE = "no infobox on page"
         exit function
       end if   
       lEndPos = instr(lStartPos, sContent, "</table>")
       sTable = mid(sContent, lStartPos, lEndPos-lStartPos + 8)
    
       lStartPos = instr(1, sTable, "Words" )
       if lStartPos = 0 then
         FETCHHOUSE = "no Words on page"
         exit function
       end if        
       lEndPos = instr(lStartPos, sTable, "</tr>")
       sRow = mid(sTable, lStartPos, lEndPos-lStartPos + 5)
    
       oTextSearch = CreateUnoService("com.sun.star.util.TextSearch")
       oOptions = CreateUnoStruct("com.sun.star.util.SearchOptions")
       oOptions.algorithmType = com.sun.star.util.SearchAlgorithms.REGEXP
       oOptions.searchString = "<td[^<]*>"
       oTextSearch.setOptions(oOptions)
       oFound = oTextSearch.searchForward(sRow, 0, Len(sRow))
       If  oFound.subRegExpressions = 0 then 
         FETCHHOUSE = "Words header but no Words content on page"
         exit function   
       end if
       lStartPos = oFound.endOffset(0) + 1
       lEndPos = instr(lStartPos, sRow, "</td>")
       sWords = mid(sRow, lStartPos, lEndPos-lStartPos)
    
       FETCHHOUSE = sWords
       exit function
    
       falseHouseName:
       FETCHHOUSE = "House name does not exist"
    
    End Function
    

    更好的方法是,如果您可以从Wiki提供的Web API获取所需的信息 . 你知道Wiki背后的人吗?如果是这样,那么你可以把它放在那里作为建议 .

    问候

    阿克塞尔

相关问题