我有一个包含大约150个Word和PDF(相同文本)文档的文件夹 . 数据在这里:http://www.sicgen.pt/antigen_folder/data_sheet/AB0003_ERP57_AB_data_sheet2003.pdf
文本总是像(在加载pdftools之后):
library(pdftools)
u <- pdf_text("AB0003_ERP57_AB_data_sheet200.pdf")
[1] " Product Data Sheet\r\n 001 Rev1 Jan 2012 by JR\r\nCatalogue No. AB0003-200\r\nQty: 400 µg (2 mg/ml)\r\n ERp57 Polyclonal Antibody\r\nSource: Goat phospholipase C alpha, PI PLC, protein disulfide\r\n isomerase A3 antibody.\r\nGeneral description: Goat polyclonal to ERp57 -\r\nendoplasmic reticulum lumen marker. This Form: Polyclonal antibody supplied as a 200 µl\r\nendoplasmic reticulum protein interacts with lectin (2 mg/ml) aliquot in PBS, 20% glycerol and 0.05%\r\nchaperones calreticulin and calnexin to modulate sodium azide. This antibody is epitope-affinity\r\nfolding of newly synthesized glycoproteins. It has purified from goat antiserum.\r\ndisulfide isomerase activity and complexes of\r\nlectins and this protein mediate protein folding by Immunogen: Recombinant peptide derived from\r\npromoting formation of disulfide bonds in their within residues 300 aa to the C-terminus of human\r\nglycoprotein substrates. ERp57 produced in E. coli.\r\nAlternative names: 58 kDa glucose regulated Specificity: Detects a band of 60 kDa by Western\r\nprotein, 58 kDa microsomal protein, disulfide blot in the following canine, human, monkey,\r\nisomerase ER 60, endoplasmic reticulum resident mouse, rat whole cell lysates.\r\nprotein 57, endoplasmic reticulum resident protein\r\n60, ER protein 57, ER protein 60, ER protein 61,\r\nERP57, ERp60, ERp61, glucose regulated protein\r\n58 Kd, GRP57, GRP58, HsT17083, P58, PDIA3,\r\nReactivity: Reacts against human, rat, mouse, canine and monkey proteins.\r\nSample Western blot Immuno- Histochemistry (paraffin) Histochemistry (frozen)\r\n fluorescence\r\nhuman +++ +++ +++ +++\r\nrat +++ +++ +++ +++\r\nmouse +++ +++ +++ +++\r\ncanine +++ +++ +++ +++\r\nmonkey +++ +++ +++ +++\r\n+++ excellent, ++ good, + poor, ND not determined\r\nUsage: Western blot 1:500-1:2,000 Storage: Store at -20 C for long-term storage. Store\r\nImmunofluorescence 1:50-1:500 at 2-8 C for up to one month.\r\nImmunohistochemistry (paraffin) 1:200-1:1,000\r\nImmunohistochemistry (frozen) 1:200-1:1,000 Special instructions: Avoid freeze/thaw cycles.\r\nSICGEN - Research and Development in Biotechnology Ltd\r\nEstrada do Pombalinho, Rabaçal, 3230-544 PENELA – PORTUGAL\r\nwww.sicgen.pt information@sicgen.pt\r\n"
[2] " Product Data Sheet\r\n 001 Rev1 Jan 2012 by JR\r\nReferences:\r\n For research use only, not for diagnostic use\r\nSICGEN's Proprietary Immunogen Policy\r\nIn order to produce high specific antibodies SICGEN has invested a lot of time and effort into selecting immunogen\r\nsequences. SICGEN has decided to protect this information by not publishing it on the website. However, these sequences\r\nare available on request.\r\nSICGEN - Research and Development in Biotechnology Ltd\r\nEstrada do Pombalinho, Rabaçal, 3230-544 PENELA – PORTUGAL\r\nwww.sicgen.pt information@sicgen.pt\r\n"
我希望转换为R或excell中的数据帧或表 .
Catalogue.No. Name Source.
1 AB0003-200 ERp57 Goat
2 AB0004-500 (...) (...)
General.Description
1 Goat polyclonal to ERp57 - endoplasmic reticulum lumen marker. This endoplasmic reticulum protein interacts (...)
2 (...)
Alternative.names.
1 58 kDa glucose regulated protein, (...)
2 (...)
Form.
1 Polyclonal antibody supplied as a 200 µl (2 mg/ml) aliquot in PBS
2 (...)
Immunogen
1 Recombinant peptide derived from within residues 300 aa (...)
2 (...)
Specificity. Reactivity.
1 Detects a band of 60 kDa by(...) Reacts against human, rat, ...
2 (...) (...)
Usage.
1 Western blot 1:500-1:2,000 Immunofluorescence
2 (...)
我想将其格式化为表格格式 . 这是从PDF文件导入的 .
textImport <- pdf_text("AB0003_ERP57_AB_data_sheet200.pdf")
[1] " Product Data Sheet\r\n 001 Rev1 Jan 2012 by JR\r\nCatalogue No. AB0003-200\r\nQty: 400 µg (2 mg/ml)\r\n ERp57 Polyclonal Antibody\r\nSource: Goat phospholipase C alpha, PI PLC, protein disulfide\r\n isomerase A3 antibody.\r\nGeneral description: Goat polyclonal to ERp57 -\r\nendoplasmic reticulum lumen marker. This Form: Polyclonal antibody supplied as a 200 µl\r\nendoplasmic reticulum protein interacts with lectin (2 mg/ml) aliquot in PBS, 20% glycerol and 0.05%\r\nchaperones calreticulin and calnexin to modulate sodium azide. This antibody is epitope-affinity\r\nfolding of newly synthesized glycoproteins. It has purified from goat antiserum.\r\ndisulfide isomerase activity and complexes of\r\nlectins and this protein mediate protein folding by Immunogen: Recombinant peptide derived from\r\npromoting formation of disulfide bonds in their within residues 300 aa to the C-terminus of human\r\nglycoprotein substrates. ERp57 produced in E. coli.\r\nAlternative names: 58 kDa glucose regulated Specificity: Detects a band of 60 kDa by Western\r\nprotein, 58 kDa microsomal protein, disulfide blot in the following canine, human, monkey,\r\nisomerase ER 60, endoplasmic reticulum resident mouse, rat whole cell lysates.\r\nprotein 57, endoplasmic reticulum resident protein\r\n60, ER protein 57, ER protein 60, ER protein 61,\r\nERP57, ERp60, ERp61, glucose regulated protein\r\n58 Kd, GRP57, GRP58, HsT17083, P58, PDIA3,\r\nReactivity: Reacts against human, rat, mouse, canine and monkey proteins.\r\nSample Western blot Immuno- Histochemistry (paraffin) Histochemistry (frozen)\r\n fluorescence\r\nhuman +++ +++ +++ +++\r\nrat +++ +++ +++ +++\r\nmouse +++ +++ +++ +++\r\ncanine +++ +++ +++ +++\r\nmonkey +++ +++ +++ +++\r\n+++ excellent, ++ good, + poor, ND not determined\r\nUsage: Western blot 1:500-1:2,000 Storage: Store at -20 C for long-term storage. Store\r\nImmunofluorescence 1:50-1:500 at 2-8 C for up to one month.\r\nImmunohistochemistry (paraffin) 1:200-1:1,000\r\nImmunohistochemistry (frozen) 1:200-1:1,000 Special instructions: Avoid freeze/thaw cycles.\r\nSICGEN - Research and Development in Biotechnology Ltd\r\nEstrada do Pombalinho, Rabaçal, 3230-544 PENELA – PORTUGAL\r\nwww.sicgen.pt information@sicgen.pt\r\n"
[2] " Product Data Sheet\r\n 001 Rev1 Jan 2012 by JR\r\nReferences:\r\n For research use only, not for diagnostic use\r\n
如果您有任何建议,请告诉我 .
1 回答
无法't post code in comments so here'使用
pdftools
和正则表达式的可能方法 .DATA
我使用了您提供的相同数据并将其保存到名为“pdf_catalogue.pdf”的pdf中 .
CODE
get_string
函数返回(.*)
之前和之后的字符串之间包含的任何内容 . 这是基于您的问题所暗示的文件结构一致的假设 . 如果需要,您可能需要使用(.*?)
进行"lazy search" . 如果你不熟悉正则表达式,Roger Peng会有一个很好的解释video .OUTPUT
您可能希望根据结构进一步拆分输出 . 例如,在
Alternative names
中,名称看起来全部用逗号分隔 . 你可以试试这使
请注意,在逗号(
,
)之后使用空格会导致第二个元素具有两个名称 . 您需要使用,
来避免此类错误 . 这对于.pdf文件尤为重要 . 您还可以通过适当地定义中断(句点后跟大写字母)轻松地将多行划分为单独的字段 . 正则表达式应该让您解决所有这些用例 .这是一个相当小的示例,但您可以轻松地在其上构建,以涵盖您可能需要的其他字段/组合 .
对于多个文件,我'd recommend enclosing all of this in a function (once you'已完成您的代码)并使用
lapply
循环遍历目录 . 我使用类似的东西来查看.txt和.csv文件 .希望这是有帮助的 . 干杯!