任何人都可以帮我识别我的Google Cloud ML培训工作中的“错误”吗？-Java 学习之路

-1

我按照以下链接使用新数据和新模型复制流程：

https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_pets.md

在到达最后一步之前，我使用下面的脚本激活训练作业：

gcloud ml-engine jobs submit training `whoami`_object_detection_`date +%s` \
--runtime-version 1.4 \
--job-dir=gs://marksbucket0000/train \
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \
--module-name object_detection.train \
--region us-east1 \
--config /Users/markli/Desktop/chase_ad_object/project_2/cluster_config/cloud.yml \
-- \
--train_dir=gs://marksbucket0000/train \
--pipeline_config_path=gs://marksbucket0000/data/ssd_mobilenet_v1_coco.config

看来这项工作已经成功开展：

ob [xxx_object_detection_xxxxxxx] submitted successfully.
Your job is still active. You may view the status of your job with the command

$ gcloud ml-engine jobs describe xxx_object_detection_xxxxxxx

or continue streaming the logs with the command

但是，由于日志中存在以下错误，它会停止：

enter image description here

由于我对Google ML can和tensorflow对象检测api非常陌生，我无法从日志中找到一条线索，重新判断哪一步我做错了 .

我使用的YML集群配置文件是：

trainingInput:
runtimeVersion: "1.4"
scaleTier: CUSTOM
masterType: standard_gpu
workerCount: 5
workerType: standard_gpu
parameterServerCount: 3
parameterServerType: standard

我真的很感激，如果有人能够至少向我展示调试的方向 . 非常感谢提前！

---------------- Update on the question --------------

我实际上通过更改setup.py来实现它，如下所示：

"""Setup script for object_detection."""

from setuptools import find_packages
from setuptools import setup


# REQUIRED_PACKAGES = ['Pillow>=1.0', 'Matplotlib>=2.1', 'Cython>=0.28.1']
REQUIRED_PACKAGES = ['Tensorflow>=1.4.0','Pillow>=1.0','Matplotlib>=2.1','Cython>=0.28.1','Jupyter']

setup(
    name='object_detection',
    version='0.1',
    install_requires=REQUIRED_PACKAGES,
    include_package_data=True,
    packages=[p for p in find_packages() if p.startswith('object_detection')],
    description='Tensorflow Object Detection Library',
)

虽然我在运行培训工作时遇到了一些“没有发现模块”的问题，但是有很多在线会话可以快速确定解决方案，所以我不在这里复制它们 .

但是，我在运行评估工作时遇到了问题 - "cannot import pycocotool"并且我在这里找到了解决方案：https://github.com/tensorflow/models/issues/3470

现在，我的培训和评估工作都已启动并运行 . 但是，我看不出任何统计数据（橙色的ex.loss plot）出现在tensorbroad的标量显示上的评估工作似乎很奇怪（但是，我确实看到eval作业复选框显示为一个视图选项它）：

enter image description here

我还检查了eval作业中的日志，我发现节点似乎不断跳过图像 . 这是问题的原因吗？可能是评估数据集的一些问题？

评估作业中的日志信息：

enter image description here

1 回答

0
并行交错功能仅在TensorFlow 1.5中可用 . 尝试将YAML中的行更改为：
```
runtimeVersion: "1.8"
```
回复于 2024-05-15T21:53:27+08:00

任何人都可以帮我识别我的Google Cloud ML培训工作中的“错误”吗？

1 回答

相关问题